Regression on a synthetic dataset

In this page we will see how to perform Least Squares Support Vector Regression using LeastSquaresSVM. To accomplish this task, we will use synthetic data as created by the make_regression function from MLJ.

First, we need to import all the necessary packages.

using LeastSquaresSVM
using MLJ, MLJBase
using DataFrames, CSV
using CategoricalArrays, Random
using Plots
gr();
rng = MersenneTwister(88);
[ Info: Model metadata loaded from registry.

We then create the regression problem. To really push the implementation we will create a problem with 5 features and 500 instances/observations. We will also add a little bit of Gaussian noise to the problem.

X, y = MLJ.make_regression(500, 5; noise=1.0, rng=rng);

We need to construct a DataFrame with the arrays created to better handle the data, as well as a better integration with MLJ.

df = DataFrame(X);
df.y = y;

A very important part of the MLJ framework is its use of scitypes, a special kind of type that work together with the objects from the framework. Because the regression problem has the Julia types we need to convert this types to correct scitypes such such that the machines from MLJ work fine.

dfnew = coerce(df, autotype(df));

We can then observe the first three columns, together with their new types.

first(dfnew, 3) |> pretty
┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
│ x1         │ x2         │ x3         │ x4         │ x5         │ y          │
│ Float64    │ Float64    │ Float64    │ Float64    │ Float64    │ Float64    │
│ Continuous │ Continuous │ Continuous │ Continuous │ Continuous │ Continuous │
├────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│ 0.548308   │ -0.138211  │ -0.972058  │ 0.208522   │ -0.176455  │ -0.227663  │
│ -0.719038  │ -0.0480897 │ 2.04784    │ 0.171147   │ 0.307089   │ 4.35437    │
│ 0.96055    │ 0.306524   │ -0.445852  │ 0.611987   │ 1.3417     │ -4.34867   │
└────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘

We should also check out the basic statistics of the dataset.

describe(dfnew, :mean, :std, :eltype)

6 rows × 4 columns

variablemeanstdeltype
SymbolFloat64Float64DataType
1x1-0.08219641.06794Float64
2x2-0.03041011.0045Float64
3x30.008884280.99942Float64
4x4-0.0536760.972936Float64
5x5-0.02630421.00545Float64
6y0.5264713.12977Float64

Recall that we also need to standardize the dataset, we can see here that the mean is close to zero, but not quite, and we also need an unitary standard deviation.

Split the dataset into training and testing sets.

y, X = unpack(dfnew, ==(:y), colname -> true);
train, test = partition(eachindex(y), 0.75, shuffle=true, rng=rng);
stand1 = Standardizer();
X = MLJBase.transform(MLJBase.fit!(MLJBase.machine(stand1, X)), X);
[ Info: Training Machine{Standardizer} @106.

We should make sure that the features have mean close to zero and an unitary standard deviation.

describe(X |> DataFrame, :mean, :std, :eltype)

5 rows × 4 columns

variablemeanstdeltype
SymbolFloat64Float64DataType
1x13.73035e-171.0Float64
2x2-2.75335e-171.0Float64
3x33.33067e-181.0Float64
4x41.24345e-171.0Float64
5x5-2.26485e-171.0Float64

Define a good set of hyperparameters for this problem and train the regressor. We will use the amazing capability of MLJ to tune a model and return the best model found.

For this we have taken some judicious guessing on the best values that the hyperparameters should take. We employ 5-fold cross-validation and a 400 by 400 grid of points to do an exhaustive search.

We will train the regressor using the root mean square error which is defined as follows

\[RMSE = \sqrt{\frac{\sum_{i=1}^{N} \left(\hat{y}_i - y_i \right)^2}{N}}\]

where we define $\hat{y}_i$ as the predicted value, and $y_i$ as the real value.

model = LeastSquaresSVM.LSSVRegressor();
r1 = MLJBase.range(model, :σ, lower=7e-4, upper=1e-3);
r2 = MLJBase.range(model, :γ, lower=120.0, upper=130.0);
self_tuning_model = TunedModel(
    model=model,
    tuning=Grid(goal=400, rng=rng),
    resampling=CV(nfolds=5),
    range=[r1, r2],
    measure=MLJBase.rms,
    acceleration=CPUThreads(), # We use this to enable multithreading
);

And now we proceed to train all the models and find the best one!

mach = MLJ.machine(self_tuning_model, X, y);
MLJBase.fit!(mach, rows=train, verbosity=0);
fitted_params(mach).best_model
LSSVRegressor(
    kernel = :rbf,
    γ = 130.0,
    σ = 0.0007315789473684211,
    degree = 0) @053

Having found the best hyperparameters for the regressor model we proceed to check how the model generalizes and we use the test set to check the performance.

ŷ = MLJBase.predict(mach, rows=test);
result = round(MLJBase.rms(ŷ, y[test]), sigdigits=4);
@show result # Check the result
1.035

We can see that we did quite well. A value of 1, or close enough, is good. We expect it to reach a lower value, closer to zero, but maybe we needed more refinement in the grid search.

We can also see a plot of the predicted and true values. The closer these dots are to the diagonal means that the model performed well.

scatter(ŷ, y[test], markersize=9)
r = range(minimum(y[test]), maximum(y[test]); length=length(test))
plot!(r, r, linewidth=9)
plot!(size=(3000, 2100))

We can actually see that we are not that far off, maybe a little more search could definitely improve the performance of our model.


This page was generated using Literate.jl.