Using different kernels

Starting with version v0.6, two new kernels can be chosen: a linear kernel and a polynomial kernel. In this document we will see how to handle choosing a different kernel, and we'll showcase an example.

First, we need to import all the necessary packages.

using LeastSquaresSVM
using MLJ, MLJBase
using DataFrames, CSV
using CategoricalArrays, Random
using Plots
gr();
rng = MersenneTwister(812);

For this example, we will create a large classification problem. It is actually inspired from a similar classification problem from scikit-learn.

The idea is to have a very large number of features (5000), and a small number of instances. This has been reported as as being a good use case or rule of thumb

Whenever the number of features is larger than the number of instances, use a linear kernel.

X, y = MLJ.make_blobs(500, 2_000; centers=2, cluster_std=[1.5, 0.5]);

Of course, this is just to showcase the implementation within LeastSquaresSVM. There are actually better ways to handle this kind of problem, e.g. dimensionality-reduction algorithms.

The make_blobs function is very similar to that of scikit-learns. The idea is to create circles, or clusters, and to classify between them.

We'll use them to test our linear kernel.

We need to construct a DataFrame with the arrays created to better handle the data, as well as a better integration with MLJ.

df = DataFrame(X);
df.y = y;

Recall that we need to change the primitive types of Julia to scitypes.

dfnew = coerce(df, autotype(df));

We can then observe the first three columns, together with their new types. We'll just look at the first 8 features to avoid cluttering the space.

first(dfnew[:, 1:8], 3) |> pretty

┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
│ x1         │ x2         │ x3         │ x4         │ x5         │ x6         │ x7         │ x8         │
│ Float64    │ Float64    │ Float64    │ Float64    │ Float64    │ Float64    │ Float64    │ Float64    │
│ Continuous │ Continuous │ Continuous │ Continuous │ Continuous │ Continuous │ Continuous │ Continuous │
├────────────┼────────────┼────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│ -1.38381   │ -10.0134   │ -3.41697   │ -0.330675  │ -2.51671   │ -3.36892   │ 5.17333    │ 9.19215    │
│ 7.8381     │ -3.24597   │ -1.97893   │ -6.60269   │ -5.00902   │ 0.111031   │ -7.87598   │ 7.4602     │
│ 8.98682    │ -2.5873    │ -1.78164   │ -5.34829   │ -6.32319   │ 0.190421   │ -8.23561   │ 7.67024    │
└────────────┴────────────┴────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘

We should also check out the basic statistics of the dataset. We'll only use a small subset as the data frame it quite large.

describe(dfnew[1:20, 1:10], :mean, :std, :eltype)

10×4 DataFrame
 Row │ variable  mean       std      eltype
     │ Symbol    Float64    Float64  DataType
─────┼────────────────────────────────────────
   1 │ x1         1.81163   5.90859  Float64
   2 │ x2        -6.40472   3.06965  Float64
   3 │ x3        -4.33766   2.51566  Float64
   4 │ x4        -4.35342   1.90329  Float64
   5 │ x5        -3.09062   2.26507  Float64
   6 │ x6        -2.276     2.61437  Float64
   7 │ x7        -0.521823  7.09574  Float64
   8 │ x8         8.30596   1.34516  Float64
   9 │ x9         2.51241   1.68655  Float64
  10 │ x10       -0.283946  7.60654  Float64

Recall that we also need to standardize the dataset, we can see here that the mean is close to zero, but not quite, and we also need an unitary standard deviation.

Split the dataset into training and testing sets.

y, X = unpack(dfnew, ==(:y), colname -> true);
train, test = partition(eachindex(y), 0.75, shuffle=true, rng=rng);

In this document, we will use a pipeline to integrate both the standardization and the LS-SVM classifier into one step. This should make it easier to train and to also showcase the ease of use of the @pipeline macro from MLJ.jl.

First, we define our classifier and the hyperparameter range. The self_tuning_model variable will hold the model with the best performance.

For the case of a linear kernel, no hyperparameter is needed. Instead, the only hyperparameter that needs to be adjusted is the $\sigma$ value that is intrinsic of the least-squares formulation. We will search for a good hyperparameter now.

We will use the accuracy as a metric. The accuracy is simply defined as

\[\text{accuracy} = \frac{\text{number of correct predictions}}{\text{total number of predictions}}\]

Note that the accuracy is not always a good measure of classification, but it will do fine on this dataset.

model = LSSVClassifier(kernel=:linear);
r1 = range(model, :σ, lower=1.0, upper=1000.0);
self_tuning_model = TunedModel(
    model=model,
    tuning=Grid(goal=400, rng=rng),
    resampling=StratifiedCV(nfolds=5),
    range=[r1],
    measure=accuracy,
    acceleration=CPUThreads(), # We use this to enable multithreading
);

Then, we will build the pipeline. The first step is to standardize the inputs and then pass it to the classifier.

pipe = @pipeline(Standardizer(), self_tuning_model);

Warning

Remember that the least-squares formulation uses all the data samples, so the following will actually consume at least 6 GB of RAM or more. Do not run this on your hardware if you are not sure you have this kind of resources available. At the very least, replace CPUThreads() with CPU1() to disable multithreading. Methods to handle memory more efficiently will be available in future versions.

And now we proceed to train all the models and find the best one!

mach = machine(pipe, X, y);
fit!(mach, rows=train, verbosity=0);
fitted_params(mach).deterministic_tuned_model.best_model

LSSVClassifier(
    kernel = :linear,
    γ = 1.0,
    σ = 283.9248120300752,
    degree = 0)[34m @564[39m

Having found the best hyperparameters for the regressor model we proceed to check how the model generalizes. To do this we use the test set to check the performance.

ŷ = MLJBase.predict(mach, rows=test);
result = accuracy(ŷ, y[test]);
result

1.0

We can see that we did quite well. A value of 1, or close enough, means the classifier is perfect. That is, it can classify correctly between each class.

Finally, let us look at the so-called confusion matrix. This table shows us useful information about the performance of our classifier.

Let us compute it first, and then we'll analyse it. Notice, however, that we need to first coerce the types to OrderedFactor scitypes in order for the confusion matrix to be computed correctly.

ŷ = coerce(ŷ, OrderedFactor);
y_ordered = coerce(y[test], OrderedFactor);
confusion_matrix(ŷ, y_ordered)

              ┌───────────────────────────┐
              │       Ground Truth        │
┌─────────────┼─────────────┬─────────────┤
│  Predicted  │      1      │      2      │
├─────────────┼─────────────┼─────────────┤
│      1      │     60      │      0      │
├─────────────┼─────────────┼─────────────┤
│      2      │      0      │     65      │
└─────────────┴─────────────┴─────────────┘

The way you read the confusion matrix is the following. The main diagonal tells us how many correct predictions were obtained by the classifier for both classes. On the other hand, the other values are the following

The upper right value is known as the false positive. This is the number of instances that were classified as belonging to a given class, when actually they were instances of the other one. An example would be if we had an instance $x_1$ which belonged to the class b, but the classifier would have predicted it actually belonged to class a.
The lower left value is known as the false negative. This is the number of instances classified as not belonging to a given class, when they actually belonged to a class. An example would be if we had an instance $x_2$ which belonged to class a, and the classifier actually predicted it belonged to class b.

It might be a little bit confusing, so a good starting point for more information on the subject is the excellent Wikipedia article. You might also be interested in the following document from a Google's Machine Learning Crash Course.

In this case, we can see that no false negative or positive values were found, which means that the classifier did outstandingly good. Normally, we can expect to have at least some percentage of false negative or positives.

This page was generated using Literate.jl.