Model Validation in ML (Part II)

Mohamed Abdelrazek

6 min readMay 6, 2021

In this article i will complete what i have started in this article

In part one I have covered the following methods for validation :

Re-Substitution
Train/test split
k-Fold Cross-Validation
Leave-one-out Cross-Validation
Leave-one-group-out Cross-Validation

So let’s continue what we started by covering the following validation methods:

Random Subsampling.
Bootstrapping
Nested Cross-Validation
Time-series Cross-Validation
Stratified Cross validation
Wilcoxon signed-rank test
McNemar’s test
5x2CV paired t-test
5x2CV combined F test

1. Random Subsampling

In this technique, multiple sets of data are randomly chosen from the dataset and combined to form a test dataset. The remaining data forms the training dataset. The following diagram represents the random subsampling validation technique. The error rate of the model is the average of the error rate of each iteration.

*Random subsampling validation technique*

2. Bootstrapping

In this technique, the training dataset is randomly selected with replacement. The remaining examples that were not selected for training are used for testing. Unlike K-fold cross-validation, the value is likely to change from fold-to-fold. The error rate of the model is average of the error rate of each iteration. The following diagram represents the same.

3. Nested Cross-Validation

When you are optimizing the hyperparameters of your model and you use the same k-Fold CV strategy to tune the model and evaluate performance you run the risk of overfitting. You do not want to estimate the accuracy of your model on the same split that you found the best hyperparameters for.

Instead, we use a Nested Cross-Validation strategy allowing to separate the hyperparameter tuning step from the error estimation step. To do this, we nest two k-fold cross-validation loops:

The inner loop for hyperparameter tuning and
the outer loop for estimating accuracy.

The algorithm is as follows

Divide the dataset into KK cross-validation folds at random.
For each fold k=1,2,…,Kk=1,2,…,K: outer loop for evaluation of the model with selected hyperparameter

2.1 Let test be fold k

2.2 Let trainval be all the data except those in fold kk

2.3 Randomly split trainval into LL folds

2.4 For each fold l=1,2,…Ll=1,2,…L: inner loop for hyperparameter tuning

2.4.1 Let val be fold ll

2.4.2 Let train be all the data except those in test or val

2.4.3 Train with each hyperparameter on train, and evaluate it on val. Keep track of the performance metrics

2.5 For each hyperparameter setting, calculate the average metrics score over the LL folds, and choose the best hyperparameter setting.

2.6 Train a model with the best hyperparameter on trainval. Evaluate its performance on test and save the score for fold kk.

3. Calculate the mean score over all KK folds, and report as the generalization error.

4. Time series cross-validation

Splitting time series data randomly does not help as the time-related data will be messed up.If we are working on predicting stock prices and if we randomly split the data then it will not help. Hence we need a different approach for performing cross-validation.

For time series cross-validation we use forward chaining also referred as rolling-origin. Origin at which the forecast is based rolls forward in time.

In time series cross-validation each day is a test data and we consider the previous day’s data is the training set.

D1, D2, D3 etc. are each day’s data and days highlighted in blue are used for training and days highlighted in yellow are used for test.

we start training the model with a minimum number of observations and use the next day’s data to test the model and we keep moving through the data set. This ensures that we consider the time series aspect of the data for prediction.

NOTE: Make sure to order your data according to the time index that you use seeing as you do not supply the TimeSeriesSplit with a time index. Thus, it will create the split simply based on the order in which the records appear.

5. Stratified cross-validation

Stratification is a technique where we rearrange the data in a way that each fold has a good representation of the whole dataset. It forces each fold to have at least m instances of each class. This approach ensures that one class of data is not overrepresented especially when the target variable is unbalanced.

For example in a binary classification problem where we want to predict if a passenger on Titanic survived or not. we have two classes here Passenger either survived or did not survive. We ensure that each fold has a percentage of passengers that survived and a percentage of passengers that did not survive.

Comparing Models which model is better ?

When do you consider one model to be better than another? If one model’s accuracy is insignificantly higher than another, is that a sufficient enough reason for choosing the best model?

As a Data Scientist, I want to make sure that I understand if a model is actually significantly more accurate than another. Fortunately, many methods exist that apply statistics to the selection of Machine Learning models.

1. Wilcoxon signed-rank test

One such method is the Wilcoxon signed-rank test which is the non-parametric version of the paired Student’s t-test. It can be used when the sample size is small and the data does not follow a normal distribution.

We can apply this significance test for comparing two Machine Learning models. Using k-fold cross-validation we can create, for each model, k accuracy scores. This will result in two samples, one for each model.

Then, we can use the Wilcoxon signed-rank test to test if the two samples differ significantly from each other. If they do, then one is more accurate than the other.

Wilcoxon signed-rank test procedure for comparing two Machine Learning models

Application of the Wilcoxon signed-rank test

The result will be a p-value. If that value is lower than 0.05 we can reject the null hypothesis that there are no significant differences between the models.

NOTE: It is important that you keep the same folds between the models to make sure the samples are drawn from the same population. This is achieved by simply setting the same random_state in the cross-validation procedure.

2. McNemar’s Test

McNemar’s test is used to check the extent to which the predictions between one model and another match. This is referred to as the homogeneity of the contingency table. From that table, we can calculate x² which can be used to compute the p-value: