Model Validation in ML (Part II)
In this article i will complete what i have started in this article
In part one I have covered the following methods for validation :
- Re-Substitution
- Train/test split
- k-Fold Cross-Validation
- Leave-one-out Cross-Validation
- Leave-one-group-out Cross-Validation
So let’s continue what we started by covering the following validation methods:
- Random Subsampling.
- Bootstrapping
- Nested Cross-Validation
- Time-series Cross-Validation
- Stratified Cross validation
- Wilcoxon signed-rank test
- McNemar’s test
- 5x2CV paired t-test
- 5x2CV combined F test
1. Random Subsampling
In this technique, multiple sets of data are randomly chosen from the dataset and combined to form a test dataset. The remaining data forms the training dataset. The following diagram represents the random subsampling validation technique. The error rate of the model is the average of the error rate of each iteration.
2. Bootstrapping
In this technique, the training dataset is randomly selected with replacement. The remaining examples that were not selected for training are used for testing. Unlike K-fold cross-validation, the value is likely to change from fold-to-fold. The error rate of the model is average of the error rate of each iteration. The following diagram represents the same.
3. Nested Cross-Validation
When you are optimizing the hyperparameters of your model and you use the same k-Fold CV strategy to tune the model and evaluate performance you run the risk of overfitting. You do not want to estimate the accuracy of your model on the same split that you found the best hyperparameters for.
Instead, we use a Nested Cross-Validation strategy allowing to separate the hyperparameter tuning step from the error estimation step. To do this, we nest two k-fold cross-validation loops:
- The inner loop for hyperparameter tuning and
- the outer loop for estimating accuracy.
The algorithm is as follows
- Divide the dataset into KK cross-validation folds at random.
- For each fold k=1,2,…,Kk=1,2,…,K: outer loop for evaluation of the model with selected hyperparameter
2.1 Let test
be fold k
2.2 Let trainval
be all the data except those in fold kk
2.3 Randomly split trainval
into LL folds
2.4 For each fold l=1,2,…Ll=1,2,…L: inner loop for hyperparameter tuning
2.4.1 Let val
be fold ll
2.4.2 Let train
be all the data except those in test
or val
2.4.3 Train with each hyperparameter on train
, and evaluate it on val
. Keep track of the performance metrics
2.5 For each hyperparameter setting, calculate the average metrics score over the LL folds, and choose the best hyperparameter setting.
2.6 Train a model with the best hyperparameter on trainval
. Evaluate its performance on test
and save the score for fold kk.
3. Calculate the mean score over all KK folds, and report as the generalization error.
4. Time series cross-validation
Splitting time series data randomly does not help as the time-related data will be messed up.If we are working on predicting stock prices and if we randomly split the data then it will not help. Hence we need a different approach for performing cross-validation.
For time series cross-validation we use forward chaining also referred as rolling-origin. Origin at which the forecast is based rolls forward in time.
In time series cross-validation each day is a test data and we consider the previous day’s data is the training set.
D1, D2, D3 etc. are each day’s data and days highlighted in blue are used for training and days highlighted in yellow are used for test.
we start training the model with a minimum number of observations and use the next day’s data to test the model and we keep moving through the data set. This ensures that we consider the time series aspect of the data for prediction.
NOTE: Make sure to order your data according to the time index that you use seeing as you do not supply the TimeSeriesSplit with a time index. Thus, it will create the split simply based on the order in which the records appear.
5. Stratified cross-validation
Stratification is a technique where we rearrange the data in a way that each fold has a good representation of the whole dataset. It forces each fold to have at least m instances of each class. This approach ensures that one class of data is not overrepresented especially when the target variable is unbalanced.
For example in a binary classification problem where we want to predict if a passenger on Titanic survived or not. we have two classes here Passenger either survived or did not survive. We ensure that each fold has a percentage of passengers that survived and a percentage of passengers that did not survive.
Comparing Models which model is better ?
When do you consider one model to be better than another? If one model’s accuracy is insignificantly higher than another, is that a sufficient enough reason for choosing the best model?
As a Data Scientist, I want to make sure that I understand if a model is actually significantly more accurate than another. Fortunately, many methods exist that apply statistics to the selection of Machine Learning models.
1. Wilcoxon signed-rank test
One such method is the Wilcoxon signed-rank test which is the non-parametric version of the paired Student’s t-test. It can be used when the sample size is small and the data does not follow a normal distribution.
We can apply this significance test for comparing two Machine Learning models. Using k-fold cross-validation we can create, for each model, k accuracy scores. This will result in two samples, one for each model.
Then, we can use the Wilcoxon signed-rank test to test if the two samples differ significantly from each other. If they do, then one is more accurate than the other.
The result will be a p-value. If that value is lower than 0.05 we can reject the null hypothesis that there are no significant differences between the models.
NOTE: It is important that you keep the same folds between the models to make sure the samples are drawn from the same population. This is achieved by simply setting the same random_state in the cross-validation procedure.
2. McNemar’s Test
McNemar’s test is used to check the extent to which the predictions between one model and another match. This is referred to as the homogeneity of the contingency table. From that table, we can calculate x² which can be used to compute the p-value:
Again, if the p-value is lower than 0.05 we can reject the null hypothesis and see that one model is significantly better than the other.
We can use mlxtend
package to create the table and calculate the corresponding p-value:
References:
An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani