Baseline benchmark for multiclass classification
In this document we will benchmark the multiclass classification problem of the Least Squares Support Vector Classifier from LeastSquaresSVM
and the one from the fantastic and very well-known libSVM
solver.
The idea is to test both their prediction and training complexities, both in time and space. To accomplish this, we will be using the BenchmarkTools
package.
We will test two cases here:
Training with hyperparameter tuning;
training without tuning.
This is because a well-tuned SVM is very important to predict good results. We almost always need a well-tuned SVM, so this is an important step.
The dataset we'll be using is the Iris dataset. This is a very simple and easy dataset, and we will deal with larger datasets in another document elsewhere.
First, some imports and all that good stuff.
xxxxxxxxxx
begin
using MLJ, MLJModels
using Random
using BenchmarkTools
using DataFrames
using LeastSquaresSVM
end
SVC(
kernel = LIBSVM.Kernel.RadialBasis,
gamma = 0.0,
weights = nothing,
cost = 1.0,
cachesize = 200.0,
degree = 3,
coef0 = 0.0,
tolerance = 0.001,
shrinking = true,
probability = false)[34m @279[39m
xxxxxxxxxx
SVC pkg=LIBSVM
Loading the dataset
We now load the dataset, and we'll convert it to a DataFrame
to inspect it. We just want to check that the scitypes
are correct.
xxxxxxxxxx
X, y = ;
xxxxxxxxxx
dfIris = DataFrame(X);
xxxxxxxxxx
dfIris.y = y;
sepal_length | sepal_width | petal_length | petal_width | y | |
---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | CategoricalValue | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | CategoricalValue{String,UInt32} "setosa" |
2 | 4.9 | 3.0 | 1.4 | 0.2 | CategoricalValue{String,UInt32} "setosa" |
3 | 4.7 | 3.2 | 1.3 | 0.2 | CategoricalValue{String,UInt32} "setosa" |
xxxxxxxxxx
first(dfIris, 3)
Training setup
Great, so everything looks good with the dataset. Let us now setup the dataset for training and testing. We'll do a 60-40% split on this dataset. Recall that this dataset only has 150 datasamples. We also always shuffle the splitting.
xxxxxxxxxx
train, test = partition(eachindex(y), 0.6, shuffle=true);
Training functions
To correctly benchmark the training steps, we need to setup and create some functions that collect all the training logic.
Here, we are doing two pairs of functions, one without hyperparameter tuning and the other with tuning. You can tell them apart because the ones with hyperparameter tuning have the suffix hp
.
Also, to differentiate between both implementations, svm
corresponds to the libsvm
implementation. On the other hand, lssvm
corresponds to the LeastSquaresSVM
implementation.
xxxxxxxxxx
function train_lssvm(X, y, train)
model = LSSVClassifier()
mach = machine(model, X, y)
fit!(mach, rows=train, verbosity=0)
return mach
end;
xxxxxxxxxx
function train_svm(X, y, train)
model = SVC()
mach = machine(model, X, y)
fit!(mach, rows=train, verbosity=0)
return mach
end;
x
function train_lssvm_hp(X, y, train)
model = LSSVClassifier()
r1 = range(model, :σ, lower=1e-2, upper=10.0)
r2 = range(model, :γ, lower=1, upper=200.0)
self_tuning_model = TunedModel(
model=model,
tuning=Grid(goal=625),
resampling=CV(nfolds=5),
range=[r1, r2],
measure=accuracy
)
mach = machine(self_tuning_model, X, y)
fit!(mach, rows=train)
return mach
end;
x
function train_svm_hp(X, y, train)
model = SVC()
r1 = range(model, :cost, lower=1, upper=1000.0)
r2 = range(model, :gamma, lower=1e-2, upper=10.0)
self_tuning_model = TunedModel(
model=model,
tuning=Grid(goal=625),
resampling=CV(nfolds=5),
range=[r1, r2],
measure=accuracy
)
mach = machine(self_tuning_model, X, y)
fit!(mach, rows=train)
return mach
end;
Dataset prep
With that out of the way, we need to remove the mean from the dataset, as well as setting its standard deviation to one. This is referred to as standardizing the dataset, and MLJ
already has a model for such tasks.
We'll standardize our model before feeding it to our models, because SVMs are quite picky of non-standardized data.
xxxxxxxxxx
Xstd = MLJ.transform(MLJ.fit!(MLJ.machine(Standardizer(), X)), X);
Benchmarking
Let's begin with benchmarking.
Without tuning
First, let's do a baseline benchmark, without hyperparameter tuning. Here, we are looking for time and space estimations for both implementations.
We'll start with the LeastSquaresSVM
implementation.
BenchmarkTools.Trial:
memory estimate: 421.30 KiB
allocs estimate: 565
--------------
minimum time: 325.362 μs (0.00% GC)
median time: 339.546 μs (0.00% GC)
mean time: 383.158 μs (6.06% GC)
maximum time: 3.779 ms (87.23% GC)
--------------
samples: 10000
evals/sample: 1
xxxxxxxxxx
train_lssvm($Xstd, $y, $train)
Memory allocation
Right out of the bat, we can see that there is quite a lot of allocation. In the order of KiB it might not seem much, but this will surely be more promiment in the hyperparameter tuning estimations.
This is expected due to the following reasons:
This formulation will always need all the dataset for both training and testing.
The solver (conjugate gradient, the iterative formulation) has to deal with allocations due to having to create a basis for each iteration.
Time estimation
Following the manual for BenchmarkTools
we will use the median as out metric for comparing both models. This is because it is not really affected by noise or some other type of outliers.
With this in mind our median time is of ≈ 400 microseconds, which is pretty okay. Some of this might come from some compiler optimizations and clever machine code.
We'll move on to the next implementation to have something to compare it with.
BenchmarkTools.Trial:
memory estimate: 40.02 KiB
allocs estimate: 367
--------------
minimum time: 296.172 μs (0.00% GC)
median time: 308.793 μs (0.00% GC)
mean time: 334.393 μs (1.13% GC)
maximum time: 6.323 ms (91.58% GC)
--------------
samples: 10000
evals/sample: 1
xxxxxxxxxx
train_svm($Xstd, $y, $train)
Memory allocation
Wow, this is an order of maginitude lower compared to the LeastSquaresSVM
implementation. But then, again, this is expected. The libSVM
solver is sparse, meaning that it only selects a couple of data instances as its support vectors.
Time estimation
By comparing with the previous median value, this implementation is faster, and better. I kind of expected this result due to the fact the the libSVM
solver is very highly optimized code, and written almost entirely in C/C++.
And by highly optimized code I mean that they use really clever tricks in various numerical computations throughout their implementation.
But even so the LeastSquaresSVM
implementation is very close. Not too shabby.
With tuning. Setup and general info.
The hyperparameter search is done with the Grid
algorithm from MLJ
, which is a conventional grid seach.
The reason for this choice is simple: we want to be fair.
Each model tunes its hyperparameters using 5-fold cross-validation on the training set only.
Each model has to tune two hyperparameters:
an intrinsic one; cost for the
libSVM
and γ for theLeastSquaresSVM
,the kernel hyperparameter; both models are using the same RBF kernel.
Each model will search for the best pair of hyperparameters in a square grid of 625 points. That is, 25 values for each hyperparameter.
It might seem like very few elements, but keep in mind that no parallelization is used for this search (which normally it is recommended to do). To make the comparison fair, we are doing this in a single thread, for each model.
The performance metric will be the classification accuracy.
BenchmarkTools.Trial:
memory estimate: 1.18 GiB
allocs estimate: 3038323
--------------
minimum time: 2.030 s (4.67% GC)
median time: 2.042 s (5.16% GC)
mean time: 2.040 s (5.07% GC)
maximum time: 2.049 s (5.14% GC)
--------------
samples: 3
evals/sample: 1
x
t1 = train_lssvm_hp($Xstd, $y, $train)
Memory allocations
Well, this is not a surprise. For each pair of points, a model is trained. The whole dataset set (the training portion of it) is used each time. For now it is only comprised of 90 data instances. But imagine how large can this grow for datasets with more than just 150 data samples.
This is one of the reasons why the least squares formulation is not the popular one. Even though it is very simple to implement, it consumes memory like there is no tomorrow.
Time estimation
This is quite a surprise. I seriously did not expect it to perform as well as it does. It is fast, efficient, but I think this is mostly due to the nature of the iterative solver it uses. If a more direct approach were used, I firmly believe the running time would have blown up.
BenchmarkTools.Trial:
memory estimate: 169.82 MiB
allocs estimate: 1521240
--------------
minimum time: 2.146 s (1.10% GC)
median time: 2.160 s (1.34% GC)
mean time: 2.169 s (1.25% GC)
maximum time: 2.202 s (1.31% GC)
--------------
samples: 3
evals/sample: 1
xxxxxxxxxx
t2 = train_svm_hp($Xstd, $y, $train)
Memory allocations
Again, this is not a surprise. By using just a few data instances, the libSVM
implementation is very conservative, memory-wise.
Do note, however, that the difference in memory usage is mind-blowing. The difference between this implementation and the least squares one is 7.1222 smaller.
Time estimation
This is really the weird part. I would have expected that this implementation would be faster. Maybe it has something to do with the number of samples from the benchmark. Anyway, the results show that both implementations are equivalent.
This is great news, because it means that no matter which implementation you use, you will always have your results in a timely manner. You can choose between both implementations and the prediction time is almost the same.
Prediction
In this last section I just wanted to show the prediction time for both implementations.
We are assuming that both models have been tuned correctly.
First, I will wrap the prediction steps in a function, evaluation the accuracy of prediction for each model.
xxxxxxxxxx
function predict_model(y, test, mach)
results = MLJ.predict(mach, rows=test)
acc = MLJ.accuracy(results, y[test])
return acc
end;
I need a tuned least squares SVM, so I do that first.
xxxxxxxxxx
mach1 = train_lssvm_hp(Xstd, y, train);
BenchmarkTools.Trial:
memory estimate: 310.78 KiB
allocs estimate: 512
--------------
minimum time: 131.098 μs (0.00% GC)
median time: 139.659 μs (0.00% GC)
mean time: 172.403 μs (9.78% GC)
maximum time: 5.905 ms (94.55% GC)
--------------
samples: 10000
evals/sample: 1
xxxxxxxxxx
predict_model($y, $test, $mach1)
0.9666666666666667
xxxxxxxxxx
acc1 = predict_model(y, test, mach1)
Results for least squares implementation
Good accuracy, as expected. Very high memory consumption, as expected. The running time is quite high, but nothing to really worry about.
xxxxxxxxxx
mach2 = train_svm_hp(Xstd, y, train);
BenchmarkTools.Trial:
memory estimate: 18.73 KiB
allocs estimate: 85
--------------
minimum time: 76.689 μs (0.00% GC)
median time: 79.796 μs (0.00% GC)
mean time: 85.311 μs (1.28% GC)
maximum time: 6.371 ms (94.00% GC)
--------------
samples: 10000
evals/sample: 1
xxxxxxxxxx
predict_model($y, $test, $mach2)
0.95
xxxxxxxxxx
acc2 = predict_model(y, test, mach2)
Results for libSVM
implementation
Good accuracy, as expected. Low memory consumption, as expected.
Now, regarding the running time, this is definetely what I was expecting on the training step. We can clearly see, by comparing the median time between both benchmarks, that the libSVM
implementation is almost twice as fast as the least squares implementation.
If it is using less data, that means less computations. But also, recall that this solver (and code) is very optimized.
Conclusions
Most of the results presented here represent a baseline benchmark result on a simple multiclass classification problem.
The results are expected in most cases. Although it is good to see the LeastSquaresSVM
doing well on the running time, both in training and in prediction.