Machine Learning Model Selection and Validation

Overview

Teaching: 30 min
Exercises: 30 min

Questions

How can we be sure that our model is performing as well as we think?

What are some techniques to overcome common ML issues?

Objectives

Apply a Student’s t-test to validation accuracies.

Overcome challenges posed by imbalanced data.

In the last session we trained a classification model on some micrscope imagery conatin protein crystals. Towards the end of the session we applied a range of different performance metrics that illuminated different aspects of the models performance. Ultimately we observed that the model was heavily biased by the unbalanced distribution of classes in our training set. How do we know this wasn’t just by chance? Can we get some statistical guarantee about our findings?

Single Model Validation

A nice way to test statistical significance is through the use of cross validation. For each fold of validation we can measure an accuracy (or some other metric of interest), thus with a sample of accuracies we can now estimate a distribution over accuracies. With a distribution of accuracies we can now apply a Student’s t-Test to test for a statistically significant difference from some benchmark value. We can do it all with a few lines of Python code. In case your environment has reset, or you are coming in to this session raw, let’s just redownload or data and define our classifier.

from utils.datasets import c3
(X_train, y_train), (X_test, y_test) = c3.load_data()

Downloading datafile to /root/data/crystals.npy ...
... done

Downloading datafile to /root/data/clear.npy ...
... done

Sub-sampling dataset...
... shuffling and splitting
... done

Authentication

The first time you execute one of the dataset utility functions you will be asked to authenticate with you google credentials, don’t be afriad.
Go to the following link in your browser:

   https://accounts.google.com/o/oauth2/auth?redirect_uri=.....

Enter verification code:

from sklearn import svm
classifier = svm.SVC(gamma=0.001, verbose=True)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=True)

Cross Validation

Cross validation can be implemented using scikit-learn in a single line

from sklearn.model_selection import cross_validate

cv_results = cross_validate(classifier, X_train, y=y_train, cv=10, verbose=3, n_jobs=-1)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  5.3min finished

The above code will perform a 10-fold cross-validation on our traing data set. You can specify specific metrics that you would like to collect across the folds, but there are some default values chosen for you

print(cv_results.keys())

dict_keys(['fit_time', 'score_time', 'test_score', 'train_score'])

For our t-Test we are interested in the test_score metric, which in this case is simply the accuracy. We can now calculate the one-sided t-statistic for our accuracy and some benchmark value, lets say an accuracy of 80%.

from scipy import stats

accuracies = cv_results["test_score"]
k = len(accuracies)
benchmark = 0.8

m = np.mean(accuracies)
S = np.std(accuracies)
SE = S / np.sqrt(k)

t = (m - benchmark) / SE

p = stats.t.cdf(t,df=2*k - 2)
print("The accuracy of you model is %.2f\u00B1%.2f" %(m, 2.58*SE))
print("The p-value for this test is %.3f" % p)

The accuracy of you model is 0.80±0.00
The p-value for this test is 0.103

In this case it looks like we cannot say with confidence that our model accuracy is greater than or equal to 80%.

Resampling

So, we have identified that our dataset is imabalanced and is causing our model to underperform. How can we fix this problem? A simple way is to resample our dataset, that is oversample the minority class or undersample the majority class. There is a nice project dedicated to providing easy to use solution to such problems, imbalanced-learn. It can both over and under sample datasets for us with relative ease. Let’s start with over sampling

from sklearn.utils import shuffle
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()

X_oversampled, y_oversampled = ros.fit_resample(X_train, y_train)
X_oversampled, y_oversampled = shuffle(X_oversampled, y_oversampled)
  
print("Fraction of samples assigned to the crystal class %.2f%%" % (np.sum(y_oversampled) / y_oversampled.shape[0] * 100.))

Fraction of samples assigned to the crystal class 50.00%

Too easy. Let’s run our cross-validation analysis again and see if this approach improves the model at all.

classifier = svm.SVC(gamma=0.001)

cv_results = cross_validate(classifier, X_oversampled, y=y_oversampled, cv=10, verbose=3, n_jobs=-1)

accuracies = cv_results["test_score"]
k = len(accuracies)
benchmark = 0.8

m = np.mean(accuracies)
S = np.std(accuracies)
SE = S / np.sqrt(k)

t = np.abs((m - benchmark) / SE)

p = 1. - stats.t.cdf(t,df=2*k - 2)
print("The accuracy of you model is %.2f\u00B1%.2f" %(m, 2.58*SE))
print("The p-value for this test is %.3f" % p)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed: 10.2min finished

The accuracy of you model is 0.99±0.00
The p-value for this test is 0.000

Wow! Well it has certainly improved our accuracy, with a statistcally significant difference from our benchmark value, but something is not right here. We have a confidence interval of +- 0.? That seems a little suspicious. Let’s run some of the performance metrics we covered in the previous section.

predicted = classifier.predict(X_test)

print("Classification report for classifier %s:\n%s\n"
      % (classifier, metrics.classification_report(y_test, predicted)))

Classification report for classifier SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False):
              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00       123
         1.0       0.80      1.00      0.89       477

   micro avg       0.80      0.80      0.80       600
   macro avg       0.40      0.50      0.44       600
weighted avg       0.63      0.80      0.70       600

Hmmmm. We seem to have gone back to square one. Why are we not performing so well on our test set? We were getting 99% accuracy on our cross-validation run above. This is a classic indicator of over-fitting. What we are seeing here is an extreme difference between our training accuracy and our test accuracy, a typical red flag for over-fitting.

Over-fitting is essentially the point were your model starts remembering individual data points instead of learning generalisable patterns. I think the presence of the zero variance in the cross-validation accuracy and the large divergence of the test and train accuracies are dead give aways that we have overfit. This is most likely due to the fact that we have oversampled the minority class. Given the 4:1 ration of crystal to clear images, each clear image would be, on average, sampled four times. So, perhaps the model has learned individual clear samples (having been shown them many more times than the crystal images) and has learned to detect some features of the crystal class. Thus when it is shown clear images it hasn’t seen before it does not recognise them as clear.

Let’s try the alternative, let;s undersample.

from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(random_state=0)

X_resampled, y_resampled = cc.fit_resample(X_train, y_train)
X_resampled, y_resampled = shuffle(X_resampled, y_resampled)

classifier = svm.SVC(gamma=0.001)

cv_results = cross_validate(classifier, X_oversampled, y=y_oversampled, cv=10, verbose=3, n_jobs=-1)

accuracies = cv_results["test_score"]
k = len(accuracies)
benchmark = 0.8

m = np.mean(accuracies)
S = np.std(accuracies)
SE = S / np.sqrt(k)

t = np.abs((m - benchmark) / SE)

p = 1. - stats.t.cdf(t,df=2*k - 2)
print("The accuracy of you model is %.2f\u00B1%.2f" %(m, 2.58*SE))
print("The p-value for this test is %.3f" % p)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   51.5s finished

The accuracy of you model is 0.51±0.01
The p-value for this test is 0.000

So now we are back to random guessing? It doesn’t look like our model is really having a good time here.

predicted = classifier.predict(X_test)

print("Classification report for classifier %s:\n%s\n"
      % (classifier, metrics.classification_report(y_test, predicted)))

Classification report for classifier SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False):
              precision    recall  f1-score   support

         0.0       1.00      0.01      0.02       123
         1.0       0.80      1.00      0.89       477

   micro avg       0.80      0.80      0.80       600
   macro avg       0.90      0.50      0.45       600
weighted avg       0.84      0.80      0.71       600

One Last Model

Okay. One last try. Let’s see what a decision tree can do.

from sklearn import tree

classifier = tree.DecisionTreeClassifier()
classifier.fit(X_train, y_train)

expected = y_test
predicted = classifier.predict(X_test)

print("Classification report for classifier %s:\n%s\n"
      % (classifier, metrics.classification_report(expected, predicted)))

plot_confusion_matrix(y_test, predicted, classes=np.array(["Clear", "Crystal"]))

cv_results = cross_validate(classifier, X_train, y=y_train, cv=10, verbose=3, n_jobs=-1)

accuracies = cv_results["test_score"]
k = len(accuracies)
benchmark = 0.8

m = np.mean(accuracies)
S = np.std(accuracies)
SE = S / np.sqrt(k)

t = np.abs((m - benchmark) / SE)

p = 1. - stats.t.cdf(t,df=2*k - 2)
print("The accuracy of you model is %.2f\u00B1%.2f" %(m, 2.58*SE))
print("The p-value for this test is %.3f" % p)

Classification report for classifier DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'):
              precision    recall  f1-score   support

         0.0       0.67      0.68      0.68       123
         1.0       0.92      0.91      0.92       477

   micro avg       0.87      0.87      0.87       600
   macro avg       0.79      0.80      0.80       600
weighted avg       0.87      0.87      0.87       600


Confusion matrix, without normalization
[[ 84  39]
 [ 41 436]]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   38.6s finished

The accuracy of you model is 0.85±0.02
The p-value for this test is 0.000

We seem to have improved in every aspect here. We have a respectable accuracy that is statistically distinct from the benchmark value. We seem to have decreased the effects of overfitting while still maintaining a high accuracy. Yay decision trees!

Key Points

Traditional statistical methods for validation are still extremely useful.

Most of the time your data is much more important that the model itself.

previous episode

C3DIS 2019 - Introduction to Machine Learning

lesson home