Model validation In ML (Part I)

Definition:

7 min readMay 5, 2021

In machine learning, model validation is referred to as the process where a trained model is evaluated with a testing data set. The testing data set is a separate portion of the same data set from which the training set is derived. The main purpose of using the testing data set is to test the generalization ability of a trained model.Model validation is carried out after model training. Together with model training, model validation aims to find an optimal model with the best performance.

Using proper validation techniques helps you understand your model, but most importantly, estimate an unbiased generalization performance.

There is no single validation method that works in all scenarios. It is important to understand if you are dealing with groups, time-indexed data, or if you are leaking data in your validation procedure.The basis of all validation techniques is splitting your data when training your model. The reason for doing so is to understand what would happen if your model is faced with data it has not seen before. Because if we don’t have a pre-defined data to compare our predictions we can’t validate our model

Which validation method is right for my use case?

When researching these aspects I found plenty of articles describing evaluation techniques, but validation techniques typically stop at k-Fold cross-validation.I would like to show you some different validation methods

The following methods for validation will be demonstrated:

Re-substitution
Train/test split
k-Fold Cross-Validation
Leave-one-out Cross-Validation
Leave-one-group-out Cross-Validation
Random subsampling
Bootstrapping
Nested Cross-Validation
Time-series Cross-Validation
Wilcoxon signed-rank test
McNemar’s test
5x2CV paired t-test
5x2CV combined F test

1-Resubstitution

If all the data is used for training the model and the error rate is evaluated based on outcome vs. actual value from the same training data set, this error is called the re-substitution error. This technique is called the re-substitution validation technique.

2- Train-Test Split method

As I said before, the data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

However, train/test split does have its dangers — what if the split we make isn’t random? What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women or only people at a certain age? (imagine a file ordered by one of these). This will result in over-fitting, even though we’re trying to avoid it! This is where cross validation comes in.

Overfitting or Sampling bias is systematic error due to a non-random sample of a population, causing some members of the population to be less likely to be included than others, resulting in a biased sample.

This problem can be avoided using another validation technique like K-fold Cross validation

3-Holdout Validation

When optimizing the hyperparameters of your model, you might overfit your model if you were to optimize using the train/test split.

Why? Because the model searches for the hyperparameters that fit the specific train/test you made.

To solve this issue, you can create an additional holdout set. This is often 10% of the data which you have not used in any of your processing/validation steps.

The typical steps to execute model validation using Holdout are:

include training the model or commonly multiple models on the training set. The validation set which is a hold-out set from the training set i.e. a portion of training set kept aside is then used to optimize the hyper-parameters of the models and evaluate the model.
Thus, the validation set is used to tune the various hyper-parameters and select the best performing algorithm. However, to fully determine that the selected algorithm is correct we apply the model to the training dataset.
This is done because as when we tune the hyper-parameters based on the validation set, we end up slightly overfitting our model based on the validation set.
Thus, the accuracy we receive from the validation set is not considered final and another hold-out dataset which is the test dataset is used to evaluate the final selected model and the error found here is considered as the generalization error.

TIP: If only use a train/test split, then I would advise comparing the distributions of your train and test sets. If they differ significantly, then you might run into problems with generalization. Use Facets to easily compare their distributions.

4-k-Fold Cross-Validation (k-Fold CV)

In K-Folds Cross Validation we split our data into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set.

The advantage is that all observations are used for both training and validation, and each observation is used once for validation. We typically choose either i=5 or k=10 as they find a nice balance between computational complexity and validation accuracy:

TIP: The scores of each fold from cross-validation techniques are more insightful than one may think. They are mostly used to simply extract the average performance. However, one might also look at the variance or standard deviation of the resulting folds as it will give information about the stability of the model across different data inputs.

Side note:

There are two types of cross-validation can be distinguished, exhaustive and non-exhaustive cross-validation. Exhaustive cross-validation methods are cross-validation methods which learn and test on all possible ways to divide the original sample into a training and a validation set such as Leave-p-out cross-validation and Leave-one-out cross-validation. Non-exhaustive cross-validation methods do not compute all ways of splitting the original sample. Methods of Non-exhaustive cross-validation include Holdout method and k-fold cross-validation.

5- Leave-one-out Cross-Validation (LOOCV)

A variant of k-Fold CV is Leave-one-out Cross-Validation (LOOCV). LOOCV uses each sample in the data as a separate test set while all remaining samples form the training set. This variant is identical to k-fold CV when k = n (number of observations).

LOOCV is a variant of K fold where k=n.

Source: Introduction to Statistical Learning. Blue line is the true test error, black dashed line in LOOCV test error and orange is 10 fold CV test error

NOTE: LOOCV is computationally very costly as the model needs to be trained n times. Only do this if the data is small or if you can handle that many computations.

Advantages of LOOCV

Far less bias as we have used the entire dataset for training compared to the validation set approach where we use only a subset(60% in our example above) of the data for training.
No randomness in the training/test data as performing LOOCV multiple times will yield same results

Disadvantages of LOOCV

MSE will vary as test data uses a single observation.This can introduce variability. If the data point is an outlier than the variability will be much higher.
Execution is expensive as the model has to be fitted n times

6-Leave-one-group-out Cross-Validation (LOGOCV)

The issue with k-Fold CV is that you might want each fold to only contain a single group. For example, let’s say you have a dataset of 20 companies and their clients and you want to predict the success of these companies.To keep the folds “pure” and only contain a single company you would create a fold for each company. That way, you create a version of k-Fold CV and LOOCV where you leave one company/group out.

Conclusion:

what method should we use? How many folds? Well, the more folds we have, we will be reducing the error due the bias but increasing the error due to variance; the computational price would go up too, obviously — the more folds you have, the longer it would take to compute it and you would need more memory. With a lower number of folds, we’re reducing the error due to variance, but the error due to bias would be bigger. It’s would also computationally cheaper. Therefore, in big datasets, k=3 is usually advised. In smaller datasets, as I’ve mentioned before, it’s best to use LOOCV.