Jump to content

Model Cross Validation is a term used in Machine Learning to estimate the predictive accuracy of the model when fed with real life data.

 

An application-oriented question on the topic along with responses can be seen below. The best answer was provided by Glory Gerald on 27th Oct 2020.

 

Applause for all the respondents - Aritra Das Gupta, Glory Gerald, Sanat Kumar.

 

Also review the answer provided by Mr Venugopal R, Benchmark Six Sigma's in-house expert.
 

Question

Q 308. What is model cross-validation? What are the different methods available? Discuss advantages and disadvantages of each method.

 

Note for website visitors - Two questions are asked every week on this platform. One on Tuesday and the other on Friday.

Link to post
Share on other sites

5 answers to this question

Recommended Posts

  • 0
  • Solution

Introduction : In Machine Learning it has become a very common practice to test various models to find a better performing model. The resultant improvement score from these models is sometimes challenging to differentiate if the relationship in the data is captured better or we are just overfitting the data. Hence Validation techniques are used to help us to get out of this dilemma, and the same is more helpful in achieving generalized relationships.

 

What is Model Cross Validation ?

Cross Validation is a technique where a particular data set is reserved as a sample on which you do not train the model, however the same sample is used later as validation sample to test your model  before finalizing it.

Here the basic idea is to divide the data into two sets, the Training set and the Validation set.

  • The training set is used to train the model
  • The validation set is used to validate the model by estimating the prediction error

Common methods of Validation techniques and the Pros and Cons of using each of them :

 

1) The Validation Set Approach (Data Split): In this approach the data is randomly split into two sets. One set that is 50% of the dataset is used to train the model and the remaining 50% of the dataset is used to test the model.

 

 Pros:

  • The technique is useful when you have a large data set that can be partitioned. 

Cons:  

  • The test error rate can be highly variable, depending on which observations are included are included in the training set and which observations are included in the validation set, due to which there are high chances of missing out on some interesting information about the data which will lead to a higher bias.

 

2) Leave one out Cross Validation (LOOCV): In this approach only one data point from the dataset is reserved and rest of the data is used to train the model to record the test error associated with the prediction. This process is repeated for every data point. Finally compute the overall  prediction errors by taking the average of all the test error estimates recorded.

 

Pros:  

  • As all the data points are used hence the bias will be low.

Cons:

  • As the process is repeated n times where n refers to number of data points, it results to higher execution time.
  • This approach might lead to higher variation in prediction error as we test the model performance against one data point at each iteration. If some data points are outliers it can lead to higher variation as our estimation gets highly influenced by these data points.

 

3) k fold Cross Validation: In this approach, the Cons of above two approaches are addressed. Here the model performance on different subset of the training data is evaluated and then the average prediction error rate is calculated.

 

Below steps are followed in this approach:

  • The dats is split randomly into k folds/k subsets.
  • One subset is reserved and rest of the subsets are used to train the model.
  • Record the error you observe on each of the predictions.
  • The process is repeated until each of the k subsets are tested.
  • Finally, the average of the k recorded errors are taken to compute the cross validation error that serves as the performance metric for the model.

Pros:

  • This approach is a robust method for estimating the accuracy of a model. It generally gives more accurate estimates of the test error rate than the LOOCV method does.
  • k fold CV is computational when compared to LOOCV.

Cons:

  • A randomly selected fold might not adequately represent the minor class especially in cases where there are huge class imbalance.

A lower value of k takes us towards validation set approach that is more biased and hence undesirable, whereas a higher value of k leads to LOOCV approach that is comparatively less biased however there are chances of high variation.

 

4) Repeated k fold Cross Validation: In this approach the process of splitting the data into k folds can be repeated n times thus resulting to n random partitions of the original sample. The results are then averaged to come up with a single estimation.

 

Pros:

  • This method is advantageous if the train set does not adequately represent the entire population.
  • Selected fold can be good representation of the whole dataset.

 

5)Stratified k fold Cross Validation: In this approach the data is rearranged to ensure that each fold is a good representation of the whole dataset.

 

Pros:

  • Better approach than dealing with both bias and variance.
  • Selected fold can be good representation of the whole dataset.

Cons:

  • If the selected fold is not a good representative of the whole dataset , then its advisable to use Repeated k fold Cross Validation technique.

 

6) Adversarial Validation: This approach generally checks the degree of similarity between the test/validation and train sets in terms of feature distribution. If there is no much similarity then we can suspect that the datasets are quite different from each other. This can be quantified by combining train and test sets, assigning labels such as 0 - train, 1- test and evaluating a binary classification task.

 

Pros:

  • This technique is used when some of the cross validation techniques give scores that are not even close to ballpark of the test score, and this happens as the test and the train sets are quite different or highly dissimilar, thus using adversarial validation technique will make our validation strategy more robust.

Cons:

  • This technique may not be useful if the distribution of the test set changes, as the validation set might no longer be a good subset to evaluate the model.

 

7) Cross Validation for Time Series: A time series dataset cannot be randomly split because the time section of the data will be messed up. For a time series forecasting problem, folds for time series cross validation are created in  a forward chaining fashion.

 

Pros:

  • Recommended technique for a time series data set.

 

8) Custom Cross Validation Techniques: There is no single method that works best for all kinds of problem statements, hence a custom cross validation technique can be created based on a feature, or combination of features that will give the user stable cross validation scores.

 

Conclusion In order to achieve a better predictive model , we must always focus on achieving a balance between bias and variance. Cross Validation in other words is also called as Resampling method as it involves fitting the same statistical method multiple times using different subsets of the data.

 

Link to post
Share on other sites
  • 1

Benchmark Six Sigma Expert View by Venugopal R

The Context

Model Cross Validation is a phrase that is used in the context of Machine Learning. In order to perform a Machine Learning exercise, we need a large numbers of historical data sets pertaining to the model. For example, a software company wants to know, of all the customers who used their trial version, how many are likely to get converted to paid version. They identify certain characteristics that may include some customer data and usage patterns for those who have registered as trial users – and which ones had conversion to paid version.

 

We are interested to know, once such a model is deployed, what would be the accuracy of the decision made by the model we need to validate before deployment, and also on ongoing basis. A large amount of such data sets may be used to train for machine learning and a portion of the data may be used for testing the effectiveness of the classification.

 

Train-Test Split

The broad stages for preparing data for machine learning include Data gathering, Feature Engineering, Feature Selection, Model Creation and Model Deployment. As part of Model Creation, a decision on ‘Train-Test split’ is taken; I.e. a portion of the data records will be identified for using on training (the machine) and the remaining portion for using on testing (the accuracy of the model).

 

Various methods have evolved for this train-test split for model cross validation. We will discuss a few of them as follows.

 

Repeated Random state based split

For example, 70% of the data may be used for training and 30% for testing. This ratio could vary for different situations. We can perform several 'randomized pick' of the train-data and the test-data by defining 'Random State' numbers. While each random state number represents different randomization, the advantage in this method is that we will obtain the same randomization if any particular random state number is repeated. The accuracy levels obtained by performing such repeated train-test cycles may be averaged. We can also obtain the maximum and minimum estimates. The disadvantage of this method is that if there is any bias despite randomization, that will influence the accuracy results. Some records may never get selected in the test sample and some could get repeated multiple times.

 

Leave One Out Cross Validation (LOOCV)

In this method, out of all the records, one will be selected at a time for testing and all the remaining will be used for training. Thus the advantage here is that every record will get the opportunity to act as a test sample. The number of iterations required as per this method will be equal to the number of records. The disadvantage in this method is the requirement of high computing power. Another concern with this method is that since the train data contains all the records except one, it will result in ‘low bias’, which could result accuracy issues when new data set is loaded in the system.

 

K-fold Cross Validation

In this method, a ‘k’ value is decided for the data set under consideration. The ‘k’ value is the number of iterations that we want to run the test. If the total number of records is N, and we decide ‘k’ value, for each iteration, we will have to select N/k records for the test data and the remaining as train data. For example, if our total number of data sets is 1000 and we decide a ‘k’ value of 200, we will have to run 5 iterations. For each iteration, we will have to take 200 records as the test data. For the first iteration we will take the first 200 records as test-data and the remaining as train-data. For the second iteration we will take the second 200 records as the test-data and so on. The advantage of this method over the LOOCV method is that the number of iterations would be lesser. The accuracy values for each of the iterations are obtained and the average is taken as the accuracy for the model. We may also take the maximum and minimum accuracy values as well. The disadvantage in this method is that if there is a pattern of change that exists across the records from first to last, it can impact the accuracies due to the significant variation between the test data sets.

 

Stratified Cross Validation

Let’s consider that we have a data set, where each record needs to be classified into two classes, viz. Yes and No. Ideally, we would like to train the system in such a manner that it correctly classifies all the ‘Yes’ cases and ‘No’ classes. Sometimes we have more classes, and we want to train the system to correctly classify each class. Hence, the advantage in the stratified CV method is that the train-sample and test-sample are selected in such a way that a reasonable representation of each class of records is maintained. This way, we train and evaluate the system for its capability to perform on each of the class, as present in the data set.

 

Time Series cross validation

This method is used when the data is based on time series. For example, the stock prices over time. In such data sets, the train-test split cannot be done in the same way as in the earlier methods, since the test data has always to be taken from the latest observations in the series. We can either take the last one or a last few. All the preceding data will become the train data. As new data keeps adding, the train data will keep growing and the test data will always be the last or last few data. There is a variant of this method known as “Sliding window” method, where the number of test data is maintained constant. This is done by omitting the first data whenever a new data gets added at the end.

 

The above methods are a few that are used for model cross validations, and there are more methods.

Link to post
Share on other sites
  • 0

Cross validation is a method by which a model is tested. This ensures that the sample is robust . In this method a portion of the sample is not included to check the effectiveness of the model.

 

After a model is built a data set is picked to check the effectiveness of the model. As a rule of thumb generally the entire data set is included to check the effectiveness of the sample. In Cross Validation however a part of the data is retained. For example in a cross validation may be 80% of the data is used where as 20% of the data is retained.

 

K Fold Cross Validation - In this the data set is divided into equal parts . For example there is a 5 step algorithm which needs to be created the data set will be 1/K =1/5=20%. The total of 80% of the data is sent to train the model where as the remaining 20% is retained. The process is to score the model based on 80% of the sample and once the same is finished the remaining 20% which was retained is run in the model and the scoring for the same is done.

 

Monte Carlo Cross Validation - In Monte Carlo the data set is divided into equal parts like K Fold cross validation however in K fold its only done once where as Monte Carlo its repeated multiple times.In Monte Carlo method the sample is picked independently. For example there are 25 data points and a group of 5 . The data set is randomly selected and the results are noted. The data set is again created using random sample and the model is checked.

 

The advantage is that this method is very effective if there is iterative method and also the model is tested for the accuracy. The disadvantage is this can be very expensive. 

 

 

Link to post
Share on other sites
  • 0

Model cross validation is the method commonly used in Machine Learning which helps in estimating the variability (consistency) and reliability (performance of model over a period of time) of a model.

While creating a model there are 2 things:

1)      Train  the data (which means estimating the parameters)

2)      Test the algorithm (evaluate the performance)

 

Out of the total data collected partial data is used for creating the model and partial for testing the model. Mostly multiple sets of data are from the population and testing is performed on these sets using different methods which are called as cross training. This helps in understanding which model is reliable and accurate

 

There are mainly two types of cross validation “Exhaustive and Non Exhaustive”

 

Exhaustive – method which tests all possible ways of dividing a sample into various sets used for training and testing

1)      Leave p out cross-validation: in this concept p-observations of the data are left and remaining data is used for training the model. This is repeated throughout the original sample

Pros: Simple and easy to implement

Cons: Time taking approach

2)      Leave one out cross-validation: : in this concept 1 observations of the data are left and remaining data”n-1” is used for training the model. This is repeated throughout the original sample

Pros: Simple and easy to implement

Cons: Time taking approach

 

Non- Exhaustive – method which does not all possible ways of dividing a sample for training and testing

 

1)      K fold cross validation – in this original datasets is divided in K sets and one set is used for testing and “k-1” used for training

Pros: Low brassiness, less complexity and entire data is used for training and testing

Cons: Not advised for imbalance data sheet

2)      Hold out cross validation- Randomly 2 data set is created from Original Data for test and training

Pros: Simple and easy to implement

Cons: Lot of data is not used for training (creating the model) which might negatively impact accuracy of the model

3)      Stratified k-fold cross validation: This concept is used for imbalanced data sheet. Original data is divided in K sets “ensuring one particular class or instance is not over presented when data set is imbalance”. Apart from that set all the other data is used for training

Pros: It is used for imbalance data

Cons: Not used for time series data

Link to post
Share on other sites
  • 0

Glory Gerald has provided the best answer to this question by providing the plus' and minus' for all the various model validation methods. 

 

Also review the answer provided by Mr Venugopal R, Benchmark Six Sigma's in-house expert. 

 

 

Link to post
Share on other sites
Guest
This topic is now closed to further replies.
  • Who's Online (See full list)

    There are no registered users currently online

  • Forum Statistics

    • Total Topics
      2,876
    • Total Posts
      14,575
  • Member Statistics

    • Total Members
      55,085
    • Most Online
      888

    Newest Member
    Lavanya Ahuja
    Joined
×
×
  • Create New...