Regression on a synthetic dataset

In this page we will see how to perform Least Squares Support Vector Regression using LeastSquaresSVM. To accomplish this task, we will use synthetic data as created by the make_regression function from MLJ.

First, we need to import all the necessary packages.

using LeastSquaresSVM
using MLJ, MLJBase
using DataFrames, CSV
using CategoricalArrays, Random
using Plots
gr();
rng = MersenneTwister(88);

[ Info: Model metadata loaded from registry.

We then create the regression problem. To really push the implementation we will create a problem with 5 features and 500 instances/observations. We will also add a little bit of Gaussian noise to the problem.

X, y = MLJ.make_regression(500, 5; noise=1.0, rng=rng);

We need to construct a DataFrame with the arrays created to better handle the data, as well as a better integration with MLJ.

df = DataFrame(X);
df.y = y;

A very important part of the MLJ framework is its use of scitypes, a special kind of type that work together with the objects from the framework. Because the regression problem has the Julia types we need to convert this types to correct scitypes such such that the machines from MLJ work fine.

dfnew = coerce(df, autotype(df));

We can then observe the first three columns, together with their new types.

first(dfnew, 3) |> pretty

┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
│ x1         │ x2         │ x3         │ x4         │ x5         │ y          │
│ Float64    │ Float64    │ Float64    │ Float64    │ Float64    │ Float64    │
│ Continuous │ Continuous │ Continuous │ Continuous │ Continuous │ Continuous │
├────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│ 0.548308   │ -0.138211  │ -0.972058  │ 0.208522   │ -0.176455  │ -0.227663  │
│ -0.719038  │ -0.0480897 │ 2.04784    │ 0.171147   │ 0.307089   │ 4.35437    │
│ 0.96055    │ 0.306524   │ -0.445852  │ 0.611987   │ 1.3417     │ -4.34867   │
└────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘

We should also check out the basic statistics of the dataset.

describe(dfnew, :mean, :std, :eltype)

6 rows × 4 columns

	variable	mean	std	eltype
	Symbol	Float64	Float64	DataType
1	x1	-0.0821964	1.06794	Float64
2	x2	-0.0304101	1.0045	Float64
3	x3	0.00888428	0.99942	Float64
4	x4	-0.053676	0.972936	Float64
5	x5	-0.0263042	1.00545	Float64
6	y	0.526471	3.12977	Float64

Recall that we also need to standardize the dataset, we can see here that the mean is close to zero, but not quite, and we also need an unitary standard deviation.

Split the dataset into training and testing sets.

y, X = unpack(dfnew, ==(:y), colname -> true);
train, test = partition(eachindex(y), 0.75, shuffle=true, rng=rng);
stand1 = Standardizer();
X = MLJBase.transform(MLJBase.fit!(MLJBase.machine(stand1, X)), X);

[ Info: Training [34mMachine{Standardizer} @106[39m.

We should make sure that the features have mean close to zero and an unitary standard deviation.

describe(X |> DataFrame, :mean, :std, :eltype)

5 rows × 4 columns

	variable	mean	std	eltype
	Symbol	Float64	Float64	DataType
1	x1	3.73035e-17	1.0	Float64
2	x2	-2.75335e-17	1.0	Float64
3	x3	3.33067e-18	1.0	Float64
4	x4	1.24345e-17	1.0	Float64
5	x5	-2.26485e-17	1.0	Float64

Define a good set of hyperparameters for this problem and train the regressor. We will use the amazing capability of MLJ to tune a model and return the best model found.

For this we have taken some judicious guessing on the best values that the hyperparameters should take. We employ 5-fold cross-validation and a 400 by 400 grid of points to do an exhaustive search.

We will train the regressor using the root mean square error which is defined as follows

\[RMSE = \sqrt{\frac{\sum_{i=1}^{N} \left(\hat{y}_i - y_i \right)^2}{N}}\]

where we define $\hat{y}_i$ as the predicted value, and $y_i$ as the real value.

model = LeastSquaresSVM.LSSVRegressor();
r1 = MLJBase.range(model, :σ, lower=7e-4, upper=1e-3);
r2 = MLJBase.range(model, :γ, lower=120.0, upper=130.0);
self_tuning_model = TunedModel(
    model=model,
    tuning=Grid(goal=400, rng=rng),
    resampling=CV(nfolds=5),
    range=[r1, r2],
    measure=MLJBase.rms,
    acceleration=CPUThreads(), # We use this to enable multithreading
);

And now we proceed to train all the models and find the best one!

mach = MLJ.machine(self_tuning_model, X, y);
MLJBase.fit!(mach, rows=train, verbosity=0);
fitted_params(mach).best_model

LSSVRegressor(
    kernel = :rbf,
    γ = 130.0,
    σ = 0.0007315789473684211,
    degree = 0)[34m @053[39m

Having found the best hyperparameters for the regressor model we proceed to check how the model generalizes and we use the test set to check the performance.

ŷ = MLJBase.predict(mach, rows=test);
result = round(MLJBase.rms(ŷ, y[test]), sigdigits=4);
@show result # Check the result

1.035

We can see that we did quite well. A value of 1, or close enough, is good. We expect it to reach a lower value, closer to zero, but maybe we needed more refinement in the grid search.

We can also see a plot of the predicted and true values. The closer these dots are to the diagonal means that the model performed well.

scatter(ŷ, y[test], markersize=9)
r = range(minimum(y[test]), maximum(y[test]); length=length(test))
plot!(r, r, linewidth=9)
plot!(size=(3000, 2100))

We can actually see that we are not that far off, maybe a little more search could definitely improve the performance of our model.

This page was generated using Literate.jl.