44Cross validation
55================
66
7- .. currentmodule :: imblearn.cross_validation
7+ .. currentmodule :: imblearn.model_selection
88
99
10- .. _ instance_hardness_threshold :
10+ .. _ instance_hardness_threshold_cv :
1111
12- The term instance hardness is used in literature to express the difficulty to
13- correctly classify an instance. An instance for which the predicted probability
14- of the true class is low, has large instance hardness. The way these
15- hard-to-classify instances are distributed over train and test sets in cross
16- validation, has significant effect on the test set performance metrics. The
17- ` InstanceHardnessCV ` splitter distributes samples with large instance hardness
18- equally over the folds, resulting in more robust cross validation.
12+ The term instance hardness is used in literature to express the difficulty to correctly
13+ classify an instance. An instance for which the predicted probability of the true class
14+ is low, has large instance hardness. The way these hard-to-classify instances are
15+ distributed over train and test sets in cross validation, has significant effect on the
16+ test set performance metrics. The :class: ` ~imblearn.model_selection.InstanceHardnessCV `
17+ splitter distributes samples with large instance hardness equally over the folds,
18+ resulting in more robust cross validation.
1919
2020We will discuss instance hardness in this document and explain how to use the
21- ` InstanceHardnessCV ` splitter.
21+ :class: ` ~imblearn.model_selection. InstanceHardnessCV ` splitter.
2222
2323Instance hardness and average precision
2424=======================================
25+
2526Instance hardness is defined as 1 minus the probability of the most probable class:
2627
2728.. math ::
@@ -32,7 +33,7 @@ In this equation :math:`H(x)` is the instance hardness for a sample with feature
3233:math: `x` and :math: `P(\hat {y}|x)` the probability of predicted label :math: `\hat {y}`
3334given the features. If the model predicts label 0 and gives a `predict_proba ` output
3435of [0.9, 0.1], the probability of the most probable class (0) is 0.9 and the
35- instance hardness is 1-0.9=0.1.
36+ instance hardness is ` 1-0.9=0.1 ` .
3637
3738Samples with large instance hardness have significant effect on the area under
3839precision-recall curve, or average precision. Especially samples with label 0
@@ -42,7 +43,7 @@ where the area is largest; the precision is lowered in the range of low recall
4243and high thresholds. When doing cross validation, e.g. in case of hyperparameter
4344tuning or recursive feature elimination, random gathering of these points in
4445some folds introduce variance in CV results that deteriorates robustness of the
45- cross validation task. The ` InstanceHardnessCV `
46+ cross validation task. The :class: ` ~imblearn.model_selection. InstanceHardnessCV `
4647splitter aims to distribute the samples with large instance hardness over the
4748folds in order to reduce undesired variance. Note that one should use this
4849splitter to make model *selection * tasks robust like hyperparameter tuning and
@@ -53,8 +54,8 @@ want to know the variance of performance to be expected in production.
5354Create imbalanced dataset with samples with large instance hardness
5455===================================================================
5556
56- Let’ s start by creating a dataset to work with. We create a dataset with 5% class
57- imbalance using scikit-learn’s ` make_blobs ` function.
57+ Let' s start by creating a dataset to work with. We create a dataset with 5% class
58+ imbalance using scikit-learn's :func: ` ~sklearn.datasets. make_blobs ` function.
5859
5960 >>> import numpy as np
6061 >>> from matplotlib import pyplot as plt
@@ -66,8 +67,8 @@ imbalance using scikit-learn’s `make_blobs` function.
6667 >>> plt.scatter(X[:, 0 ], X[:, 1 ], c = y)
6768 >>> plt.show()
6869
69- .. image :: ./auto_examples/cross_validation /images/sphx_glr_plot_instance_hardness_cv_001.png
70- :target: ./auto_examples/cross_validation /plot_instance_hardness_cv.html
70+ .. image :: ./auto_examples/model_selection /images/sphx_glr_plot_instance_hardness_cv_001.png
71+ :target: ./auto_examples/model_selection /plot_instance_hardness_cv.html
7172 :align: center
7273
7374Now we add some samples with large instance hardness
@@ -80,40 +81,48 @@ Now we add some samples with large instance hardness
8081 >>> plt.scatter(X[:, 0 ], X[:, 1 ], c = y)
8182 >>> plt.show()
8283
83- .. image :: ./auto_examples/cross_validation /images/sphx_glr_plot_instance_hardness_cv_002.png
84- :target: ./auto_examples/cross_validation /plot_instance_hardness_cv.html
84+ .. image :: ./auto_examples/model_selection /images/sphx_glr_plot_instance_hardness_cv_002.png
85+ :target: ./auto_examples/model_selection /plot_instance_hardness_cv.html
8586 :align: center
8687
87- Assess cross validation performance variance using InstanceHardnessCV splitter
88- ==============================================================================
88+ Assess cross validation performance variance using ` InstanceHardnessCV ` splitter
89+ ================================================================================
8990
90- Then we take a ` LogisticRegressionClassifier ` and assess the cross validation
91- performance using a ` StratifiedKFold ` cv splitter and the ` cross_validate `
92- function.
91+ Then we take a :class: ` ~sklearn.linear_model.LogisticRegression ` and assess the
92+ cross validation performance using a :class: ` ~sklearn.model_selection. StratifiedKFold `
93+ cv splitter and the :func: ` ~sklearn.model_selection.cross_validate ` function.
9394
9495 >>> from sklearn.ensemble import LogisticRegressionClassifier
9596 >>> clf = LogisticRegressionClassifier(random_state = random_state)
9697 >>> skf_cv = StratifiedKFold(n_splits = 5 , shuffle = True ,
9798 ... random_state= random_state)
9899 >>> skf_result = cross_validate(clf, X, y, cv = skf_cv, scoring = " average_precision" )
99100
100- Now, we do the same using an ` InstanceHardnessCV ` splitter. We use provide our
101- classifier to the splitter to calculate instance hardness and distribute samples
102- with large instance hardness equally over the folds.
101+ Now, we do the same using an :class: ` ~imblearn.model_selection. InstanceHardnessCV `
102+ splitter. We use provide our classifier to the splitter to calculate instance hardness
103+ and distribute samples with large instance hardness equally over the folds.
103104
104105 >>> ih_cv = InstanceHardnessCV(estimator = clf, n_splits = 5 ,
105106 ... random_state= random_state)
106107 >>> ih_result = cross_validate(clf, X, y, cv = ih_cv, scoring = " average_precision" )
107108
108- When we plot the test scores for both cv splitters, we see that the variance using
109- the `InstanceHardnessCV ` splitter is lower than for the `StratifiedKFold ` splitter.
109+ When we plot the test scores for both cv splitters, we see that the variance using the
110+ :class: `~imblearn.model_selection.InstanceHardnessCV ` splitter is lower than for the
111+ :class: `~sklearn.model_selection.StratifiedKFold ` splitter.
110112
111113 >>> plt.boxplot([skf_result[' test_score' ], ih_result[' test_score' ]],
112114 ... tick_labels= [" StratifiedKFold" , " InstanceHardnessCV" ],
113115 ... vert= False )
114116 >>> plt.xlabel(' Average precision' )
115117 >>> plt.tight_layout()
116118
117- .. image :: ./auto_examples/cross_validation/images/sphx_glr_plot_instance_hardness_cv_003.png
118- :target: ./auto_examples/cross_validation/plot_instance_hardness_cv.html
119- :align: center
119+ .. image :: ./auto_examples/model_selection/images/sphx_glr_plot_instance_hardness_cv_003.png
120+ :target: ./auto_examples/model_selection/plot_instance_hardness_cv.html
121+ :align: center
122+
123+ Be aware that the most important part of cross-validation splitters is to simulate the
124+ conditions that one will encounter in production. Therefore, if it is likely to get
125+ difficult samples in production, one should use a cross-validation splitter that
126+ emulates this situation. In our case, the
127+ :class: `~sklearn.model_selection.StratifiedKFold ` splitter did not allow to distribute
128+ the difficult samples over the folds and thus it was likely a problem for our use case.
0 commit comments