Baseline benchmark for multiclass classification

In this document we will benchmark the multiclass classification problem of the Least Squares Support Vector Classifier from LeastSquaresSVM and the one from the fantastic and very well-known libSVM solver.

The idea is to test both their prediction and training complexities, both in time and space. To accomplish this, we will be using the BenchmarkTools package.

We will test two cases here:

Training with hyperparameter tuning;
training without tuning.

This is because a well-tuned SVM is very important to predict good results. We almost always need a well-tuned SVM, so this is an important step.

The dataset we'll be using is the Iris dataset. This is a very simple and easy dataset, and we will deal with larger datasets in another document elsewhere.

21.9 μs

First, some imports and all that good stuff.

2.6 μs

xxxxxxxxxx
 
begin
    using MLJ, MLJModels
    using Random
    using BenchmarkTools
    using DataFrames
    using LeastSquaresSVM
end

18.0 s

SVC(
    kernel = LIBSVM.Kernel.RadialBasis,
    gamma = 0.0,
    weights = nothing,
    cost = 1.0,
    cachesize = 200.0,
    degree = 3,
    coef0 = 0.0,
    tolerance = 0.001,
    shrinking = true,
    probability = false)[34m @279[39m

xxxxxxxxxx
 
@load SVC pkg=LIBSVM

439 ms

Loading the dataset

We now load the dataset, and we'll convert it to a DataFrame to inspect it. We just want to check that the scitypes are correct.

3.8 μs

xxxxxxxxxx
 
X, y = @load_iris;

4.3 s

xxxxxxxxxx
 
dfIris = DataFrame(X);

150 ms

xxxxxxxxxx
 
dfIris.y = y;

4.7 μs

	sepal_length	sepal_width	petal_length	petal_width	y
	Float64	Float64	Float64	Float64	CategoricalValue
1	5.1	3.5	1.4	0.2	CategoricalValue{String,UInt32} "setosa"
2	4.9	3.0	1.4	0.2	CategoricalValue{String,UInt32} "setosa"
3	4.7	3.2	1.3	0.2	CategoricalValue{String,UInt32} "setosa"

xxxxxxxxxx
 
first(dfIris, 3)

71.0 ms

Training setup

Great, so everything looks good with the dataset. Let us now setup the dataset for training and testing. We'll do a 60-40% split on this dataset. Recall that this dataset only has 150 datasamples. We also always shuffle the splitting.

3.0 μs

xxxxxxxxxx
 
train, test = partition(eachindex(y), 0.6, shuffle=true);

129 ms

Training functions

To correctly benchmark the training steps, we need to setup and create some functions that collect all the training logic.

Here, we are doing two pairs of functions, one without hyperparameter tuning and the other with tuning. You can tell them apart because the ones with hyperparameter tuning have the suffix hp.

Also, to differentiate between both implementations, svm corresponds to the libsvm implementation. On the other hand, lssvm corresponds to the LeastSquaresSVM implementation.

7.3 μs

xxxxxxxxxx
 
function train_lssvm(X, y, train)
    model = LSSVClassifier()
    
    mach = machine(model, X, y)
    fit!(mach, rows=train, verbosity=0)
    
    return mach
end;

22.8 μs

xxxxxxxxxx
 
function train_svm(X, y, train)
    model = SVC()
    
    mach = machine(model, X, y)
    fit!(mach, rows=train, verbosity=0)
    
    return mach
end;

25.6 μs

x
 
function train_lssvm_hp(X, y, train)
    model = LSSVClassifier()
    
    r1 = range(model, :σ, lower=1e-2, upper=10.0)
    r2 = range(model, :γ, lower=1, upper=200.0)
    
    self_tuning_model = TunedModel(
        model=model,
        tuning=Grid(goal=625),
        resampling=CV(nfolds=5),
        range=[r1, r2],
        measure=accuracy
    )
    
    mach = machine(self_tuning_model, X, y)
    fit!(mach, rows=train)
    
    return mach
end;

51.9 μs

x
 
function train_svm_hp(X, y, train)
    model = SVC()
    
    r1 = range(model, :cost, lower=1, upper=1000.0)
    r2 = range(model, :gamma, lower=1e-2, upper=10.0)
    
    self_tuning_model = TunedModel(
        model=model,
        tuning=Grid(goal=625),
        resampling=CV(nfolds=5),
        range=[r1, r2],
        measure=accuracy
    )
    
    mach = machine(self_tuning_model, X, y)
    fit!(mach, rows=train)
    
    return mach
end;

48.4 μs

Dataset prep

With that out of the way, we need to remove the mean from the dataset, as well as setting its standard deviation to one. This is referred to as standardizing the dataset, and MLJ already has a model for such tasks.

We'll standardize our model before feeding it to our models, because SVMs are quite picky of non-standardized data.

4.5 μs

xxxxxxxxxx
 
Xstd = MLJ.transform(MLJ.fit!(MLJ.machine(Standardizer(), X)), X);

2.0 s

Benchmarking

Let's begin with benchmarking.

Without tuning

First, let's do a baseline benchmark, without hyperparameter tuning. Here, we are looking for time and space estimations for both implementations.

We'll start with the LeastSquaresSVM implementation.

5.7 μs

BenchmarkTools.Trial: 
  memory estimate:  421.30 KiB
  allocs estimate:  565
  --------------
  minimum time:     325.362 μs (0.00% GC)
  median time:      339.546 μs (0.00% GC)
  mean time:        383.158 μs (6.06% GC)
  maximum time:     3.779 ms (87.23% GC)
  --------------
  samples:          10000
  evals/sample:     1

xxxxxxxxxx
 
@benchmark train_lssvm($Xstd, $y, $train)

16.2 s

Memory allocation

Right out of the bat, we can see that there is quite a lot of allocation. In the order of KiB it might not seem much, but this will surely be more promiment in the hyperparameter tuning estimations.

This is expected due to the following reasons:

This formulation will always need all the dataset for both training and testing.
The solver (conjugate gradient, the iterative formulation) has to deal with allocations due to having to create a basis for each iteration.

Time estimation

Following the manual for BenchmarkTools we will use the median as out metric for comparing both models. This is because it is not really affected by noise or some other type of outliers.

8.1 ms

With this in mind our median time is of ≈ 400 microseconds, which is pretty okay. Some of this might come from some compiler optimizations and clever machine code.

We'll move on to the next implementation to have something to compare it with.

3.0 μs

BenchmarkTools.Trial: 
  memory estimate:  40.02 KiB
  allocs estimate:  367
  --------------
  minimum time:     296.172 μs (0.00% GC)
  median time:      308.793 μs (0.00% GC)
  mean time:        334.393 μs (1.13% GC)
  maximum time:     6.323 ms (91.58% GC)
  --------------
  samples:          10000
  evals/sample:     1

xxxxxxxxxx
 
@benchmark train_svm($Xstd, $y, $train)

11.0 s

Memory allocation

Wow, this is an order of maginitude lower compared to the LeastSquaresSVM implementation. But then, again, this is expected. The libSVM solver is sparse, meaning that it only selects a couple of data instances as its support vectors.

Time estimation

By comparing with the previous median value, this implementation is faster, and better. I kind of expected this result due to the fact the the libSVM solver is very highly optimized code, and written almost entirely in C/C++.

And by highly optimized code I mean that they use really clever tricks in various numerical computations throughout their implementation.

But even so the LeastSquaresSVM implementation is very close. Not too shabby.

33.5 μs

With tuning. Setup and general info.

The hyperparameter search is done with the Grid algorithm from MLJ, which is a conventional grid seach.

The reason for this choice is simple: we want to be fair.

Each model tunes its hyperparameters using 5-fold cross-validation on the training set only.

Each model has to tune two hyperparameters:

an intrinsic one; cost for the libSVM and γ for the LeastSquaresSVM,
the kernel hyperparameter; both models are using the same RBF kernel.

Each model will search for the best pair of hyperparameters in a square grid of 625 points. That is, 25 values for each hyperparameter.

It might seem like very few elements, but keep in mind that no parallelization is used for this search (which normally it is recommended to do). To make the comparison fair, we are doing this in a single thread, for each model.

The performance metric will be the classification accuracy.

31.9 μs

BenchmarkTools.Trial: 
  memory estimate:  1.18 GiB
  allocs estimate:  3038323
  --------------
  minimum time:     2.030 s (4.67% GC)
  median time:      2.042 s (5.16% GC)
  mean time:        2.040 s (5.07% GC)
  maximum time:     2.049 s (5.14% GC)
  --------------
  samples:          3
  evals/sample:     1

x
 
t1 = @benchmark train_lssvm_hp($Xstd, $y, $train)

18.4 s

Memory allocations

Well, this is not a surprise. For each pair of points, a model is trained. The whole dataset set (the training portion of it) is used each time. For now it is only comprised of 90 data instances. But imagine how large can this grow for datasets with more than just 150 data samples.

This is one of the reasons why the least squares formulation is not the popular one. Even though it is very simple to implement, it consumes memory like there is no tomorrow.

Time estimation

This is quite a surprise. I seriously did not expect it to perform as well as it does. It is fast, efficient, but I think this is mostly due to the nature of the iterative solver it uses. If a more direct approach were used, I firmly believe the running time would have blown up.

21.2 μs

BenchmarkTools.Trial: 
  memory estimate:  169.82 MiB
  allocs estimate:  1521240
  --------------
  minimum time:     2.146 s (1.10% GC)
  median time:      2.160 s (1.34% GC)
  mean time:        2.169 s (1.25% GC)
  maximum time:     2.202 s (1.31% GC)
  --------------
  samples:          3
  evals/sample:     1

xxxxxxxxxx
 
t2 = @benchmark train_svm_hp($Xstd, $y, $train)

17.8 s

Memory allocations

Again, this is not a surprise. By using just a few data instances, the libSVM implementation is very conservative, memory-wise.

Do note, however, that the difference in memory usage is mind-blowing. The difference between this implementation and the least squares one is 7.1222 smaller.

Time estimation

This is really the weird part. I would have expected that this implementation would be faster. Maybe it has something to do with the number of samples from the benchmark. Anyway, the results show that both implementations are equivalent.

This is great news, because it means that no matter which implementation you use, you will always have your results in a timely manner. You can choose between both implementations and the prediction time is almost the same.

39.5 μs

Prediction

In this last section I just wanted to show the prediction time for both implementations.

We are assuming that both models have been tuned correctly.

First, I will wrap the prediction steps in a function, evaluation the accuracy of prediction for each model.

20.1 μs

xxxxxxxxxx
 
function predict_model(y, test, mach)
    results = MLJ.predict(mach, rows=test)
    acc = MLJ.accuracy(results, y[test])
    
    return acc
end;

33.3 μs

I need a tuned least squares SVM, so I do that first.

10.2 μs

xxxxxxxxxx
 
mach1 = train_lssvm_hp(Xstd, y, train);

2.2 s

BenchmarkTools.Trial: 
  memory estimate:  310.78 KiB
  allocs estimate:  512
  --------------
  minimum time:     131.098 μs (0.00% GC)
  median time:      139.659 μs (0.00% GC)
  mean time:        172.403 μs (9.78% GC)
  maximum time:     5.905 ms (94.55% GC)
  --------------
  samples:          10000
  evals/sample:     1

xxxxxxxxxx
 
@benchmark predict_model($y, $test, $mach1)

8.2 s

acc1

0.9666666666666667

xxxxxxxxxx
 
acc1 = predict_model(y, test, mach1)

230 μs

Results for least squares implementation

Good accuracy, as expected. Very high memory consumption, as expected. The running time is quite high, but nothing to really worry about.

10.8 μs

xxxxxxxxxx
 
mach2 = train_svm_hp(Xstd, y, train);

2.0 s

BenchmarkTools.Trial: 
  memory estimate:  18.73 KiB
  allocs estimate:  85
  --------------
  minimum time:     76.689 μs (0.00% GC)
  median time:      79.796 μs (0.00% GC)
  mean time:        85.311 μs (1.28% GC)
  maximum time:     6.371 ms (94.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

xxxxxxxxxx
 
@benchmark predict_model($y, $test, $mach2)

7.0 s

acc2

0.95

xxxxxxxxxx
 
acc2 = predict_model(y, test, mach2)

164 μs

Results for `libSVM` implementation

Good accuracy, as expected. Low memory consumption, as expected.

Now, regarding the running time, this is definetely what I was expecting on the training step. We can clearly see, by comparing the median time between both benchmarks, that the libSVM implementation is almost twice as fast as the least squares implementation.

If it is using less data, that means less computations. But also, recall that this solver (and code) is very optimized.

Conclusions

Most of the results presented here represent a baseline benchmark result on a simple multiclass classification problem.

The results are expected in most cases. Although it is good to see the LeastSquaresSVM doing well on the running time, both in training and in prediction.

20.1 μs

Baseline benchmark for multiclass classification

Loading the dataset

Training setup

Training functions

Dataset prep

Benchmarking

Without tuning

Memory allocation

Time estimation

Memory allocation

Time estimation

With tuning. Setup and general info.

Memory allocations

Time estimation

Memory allocations

Time estimation

Prediction

Results for least squares implementation

Results for libSVM implementation

Conclusions

Results for `libSVM` implementation