Cross Validation and Grid Search for Model Selection in Python
Introduction
A typical machine learning process involves training different models on the dataset and selecting the one with best performance. However, evaluating the performance of algorithm is not always a straight forward task. There are several factors that can help you determine which algorithm performance best. One such factor is the performance on cross validation set and another other factor is the choice of parameters for an algorithm.
In this article we will explore these two factors in detail. We will first study what cross validation is, why it is necessary, and how to perform it via Python’s Scikit-Learn library. We will then move on to the Grid Search algorithm and see how it can be used to automatically select the best parameters for an algorithm.
Cross Validation
Normally in a machine learning process, data is divided into training and test sets; the training set is then used to train the model and the test set is used to evaluate the performance of a model. However, this approach may lead to variance problems. In simpler words, a variance problem refers to the scenario where our accuracy obtained on one test is very different to accuracy obtained on another test set using the same algorithm.
The solution to this problem is to use K-Fold Cross-Validation for performance evaluation where K is any number. The process of K-Fold Cross-Validation is straightforward. You divide the data into K folds. Out of the K folds, K-1 sets are used for training while the remaining set is used for testing. The algorithm is trained and tested K times, each time a new set is used as testing set while remaining sets are used for training. Finally, the result of the K-Fold Cross-Validation is the average of the results obtained on each set.
Suppose we want to perform 5-fold cross validation. To do so, the data is divided into 5 sets, for instance we name them SET A, SET B, SET C, SET D, and SET E. The algorithm is trained and tested K times. In the first fold, SET A to SET D are used as training set and SET E is used as testing set as shown in the figure below:
In the second fold, SET A, SET B, SET C, and SET E are used for training and SET D is used as testing. The process continues until every set is at least used once for training and once for testing. The final result is the average of results obtained using all folds. This way we can get rid of the variance. Using standard deviation of the results obtained from each fold we can in fact find the variance in the overall result.
Cross Validation with Scikit-Learn
In this section we will use cross validation to evaluate the performance of Random Forest Algorithm for classification. The problem that we are going to solve is to predict the quality of wine based on 12 attributes. The details of the dataset are available at the following link:
https://archive.ics.uci.edu/ml/datasets/wine+quality
We are only using the data for red wine in this article.
Follow these steps to implement cross validation using Scikit-Learn:
1. Importing Required Libraries
The following code imports a few of the required libraries:
import pandas as pd
import numpy as np
2. Importing the Dataset
Download the dataset, which is available online at this link:
https://www.kaggle.com/piyushgoyal443/red-wine-dataset
Once we have downloaded it, we placed the file in the “Datasets” folder of our “D” drive for the sake of this article. The dataset name is “winequality-red.csv”. Note that you’ll need to change the file path to match the location in which you saved the file on your computer.
Execute the following command to import the dataset:
dataset = pd.read_csv(r"D:/Datasets/winequality-red.csv", sep=';')
The dataset was semi-colon separated, therefore we have passed the “;” attribute to the “sep” parameter so pandas is able to properly parse the file.
3. Data Analysis
Execute the following script to get an overview of the data:
dataset.head()
The output looks like this:
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
4. Data Preprocessing
Execute the following script to divide data into label and feature sets.
X = dataset.iloc[:, 0:11].values
y = dataset.iloc[:, 11].values
Since we are using cross validation, we don’t need to divide our data into training and test sets. We want all of the data in the training set so that we can apply cross validation on that. The simplest way to do this is to set the value for the test_size
parameter to 0. This will return all the data in the training set as follows:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0, random_state=0)
5. Scaling the Data
If you look at the dataset you’ll notice that it is not scaled well. For instance the “volatile acidity” and “citric acid” column have values between 0 and 1, while most of the rest of the columns have higher values. Therefore, before training the algorithm, we will need to scale our data down.
Here we will use the StandardScalar
class.
from sklearn.preprocessing import StandardScaler
feature_scaler = StandardScaler()
X_train = feature_scaler.fit_transform(X_train)
X_test = feature_scaler.transform(X_test)
6. Training and Cross Validation
The first step in the training and cross validation phase is simple. You just have to import the algorithm class from the sklearn
library as shown below:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=300, random_state=0)
Next, to implement cross validation, the cross_val_score
method of the sklearn.model_selection
library can be used. The cross_val_score
returns the accuracy for all the folds. Values for 4 parameters are required to be passed to the cross_val_score
class. The first parameter is estimator which basically specifies the algorithm that you want to use for cross validation. The second and third parameters, X
and y
, contain the X_train
and y_train
data i.e. features and labels. Finally the number of folds is passed to the cv
parameter as shown in the following code:
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=5)
Once you’ve executed this, let’s simply print the accuracies returned for five folds by the cross_val_score
method by calling print
on all_accuracies
.
print(all_accuracies)
Output:
[ 0.72360248 0.68535826 0.70716511 0.68553459 0.68454259 ]
To find the average of all the accuracies, simple use the mean()
method of the object returned by cross_val_score
method as shown below:
print(all_accuracies.mean())
The mean value is 0.6972, or 69.72%.
Finally let’s find the standard deviation of the data to see degree of variance in the results obtained by our model. To do so, call the std()
method on the all_accuracies
object.
print(all_accuracies.std())
The result is: 0.01572 which is 1.57%. This is extremely low, which means that our model has a very low variance, which is actually very good since that means that the prediction that we obtained on one test set is not by chance. Rather, the model will perform more or less similar on all test sets.
Grid Search for Parameter Selection
A machine learning model has two types of parameters. The first type of parameters are the parameters that are learned through a machine learning model while the second type of parameters are the hyper parameter that we pass to the machine learning model.
In the last section, while predicting the quality of wine, we used the Random Forest algorithm. The number of estimators we used for the algorithm was 300. Similarly in KNN algorithm we have to specify the value of K and for SVM algorithm we have to specify the type of Kernel. These estimators – the K value and Kernel – are all types of hyper parameters.
Normally we randomly set the value for these hyper parameters and see what parameters result in best performance. However randomly selecting the parameters for the algorithm can be exhaustive.
Also, it is not easy to compare performance of different algorithms by randomly setting the hyper parameters because one algorithm may perform better than the other with different set of parameters. And if the parameters are changed, the algorithm may perform worse than the other algorithms.
Therefore, instead of randomly selecting the values of the parameters, a better approach would be to develop an algorithm which automatically finds the best parameters for a particular model. Grid Search is one such algorithm.
Grid Search with Scikit-Learn
Let’s implement the grid search algorithm with the help of an example. The script in this section should be run after the script that we created in the last section.
To implement the Grid Search algorithm we need to import GridSearchCV
class from the sklearn.model_selection
library.
The first step you need to perform is to create a dictionary of all the parameters and their corresponding set of values that you want to test for best performance. The name of the dictionary items corresponds to the parameter name and the value corresponds to the list of values for the parameter.
Let’s create a dictionary of parameters and their corresponding values for our Random Forest algorithm. Details of all the parameters for the random forest algorithm are available in the Scikit-Learn docs.
To do this, execute the following code:
grid_param = {
'n_estimators': [100, 300, 500, 800, 1000],
'criterion': ['gini', 'entropy'],
'bootstrap': [True, False]
}
Take a careful look at the above code. Here we create grid_param
dictionary with three parameters n_estimators
, criterion
, and bootstrap
. The parameter values that we want to try out are passed in the list. For instance, in the above script we want to find which value (out of 100, 300, 500, 800, and 1000) provides the highest accuracy.
Similarly, we want to find which value results in the highest performance for the criterion
parameter: “gini” or “entropy”? The Grid Search algorithm basically tries all possible combinations of parameter values and returns the combination with the highest accuracy. For instance, in the above case the algorithm will check 20 combinations (5 x 2 x 2 = 20).
The Grid Search algorithm can be very slow, owing to the potentially huge number of combinations to test. Furthermore, cross validation further increases the execution time and complexity.
Once the parameter dictionary is created, the next step is to create an instance of the GridSearchCV
class. You need to pass values for the estimator
parameter, which basically is the algorithm that you want to execute. The param_grid
parameter takes the parameter dictionary that we just created as parameter, the scoring
parameter takes the performance metrics, the cv
parameter corresponds to number of folds, which is 5 in our case, and finally the n_jobs
parameter refers to the number of CPU’s that you want to use for execution. A value of -1 for n_jobs
parameter means that use all available computing power. This can be handy if you have large number amount of data.
Take a look at the following code:
gd_sr = GridSearchCV(estimator=classifier,
param_grid=grid_param,
scoring='accuracy',
cv=5,
n_jobs=-1)
Once the GridSearchCV
class is initialized, the last step is to call the fit
method of the class and pass it the training and test set, as shown in the following code:
gd_sr.fit(X_train, y_train)
This method can take some time to execute because we have 20 combinations of parameters and a 5-fold cross validation. Therefore the algorithm will execute a total of 100 times.
Once the method completes execution, the next step is to check the parameters that return the highest accuracy. To do so, print the sr.best_params_
attribute of the GridSearchCV
object, as shown below:
best_parameters = gd_sr.best_params_
print(best_parameters)
Output:
{'bootstrap': True, 'criterion': 'gini', 'n_estimators': 1000}
The result shows that the highest accuracy is achieved when the n_estimators
are 1000, bootstrap
is True
and criterion
is “gini”.
Note: It would be a good idea to add more number of estimators and see if performance further increases since the highest allowed value of n_estimators
was chosen.
The last and final step of Grid Search algorithm is to find the accuracy obtained using the best parameters. Previously we had a mean accuracy of 69.72% with 300 n_estimators
.
To find the best accuracy achieved, execute the following code:
best_result = gd_sr.best_score_
print(best_result)
The accuracy achieved is: 0.6985 of 69.85% which is only slightly better than 69.72%. To improve this further, it would be good to test values for other parameters of Random Forest algorithm, such as max_features
, max_depth
, max_leaf_nodes
, etc. to see if the accuracy further improves or not.
Conclusion
In this article we studied two very commonly used techniques for performance evaluation and model selection of an algorithm. K-Fold Cross-Validation can be used to evaluate performance of a model by handling the variance problem of the result set. Furthermore, to identify the best algorithm and best parameters, we can use the Grid Search algorithm.