0% found this document useful (0 votes)
8 views21 pages

Cross Validation

Cross-validation is a technique for evaluating machine learning models by splitting the dataset into training and testing subsets, ensuring robust performance. In a K-Fold cross-validation example using the Iris dataset and a RandomForestClassifier, accuracy scores for five folds were calculated, resulting in a mean score of 0.9134 and a standard deviation of 0.034. This method helps assess the model's reliability across different data partitions.

Uploaded by

ayeshasadiqa148
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views21 pages

Cross Validation

Cross-validation is a technique for evaluating machine learning models by splitting the dataset into training and testing subsets, ensuring robust performance. In a K-Fold cross-validation example using the Iris dataset and a RandomForestClassifier, accuracy scores for five folds were calculated, resulting in a mean score of 0.9134 and a standard deviation of 0.034. This method helps assess the model's reliability across different data partitions.

Uploaded by

ayeshasadiqa148
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Cross-Validation

Definition
• Cross-validation is a technique used to evaluate the performance of a
machine learning model by partitioning the original dataset into
training and testing subsets. This method helps to ensure that the
model's performance is robust and not dependent on a particular set
of data.
Dry Run of Cross-Validation with K-
Fold
• Context:

• Dataset: Iris dataset with 150 samples and 4 features (sepal length,
sepal width, petal length, petal width).
• Classes: 3 classes (0, 1, 2) representing different species of Iris.
• Model: RandomForestClassifier (default settings).
• Cross-Validation: KFold with 5 splits.
• Load Dataset:

• X: 150 samples with 4 features each.


• y: 150 target labels, each being 0, 1, or 2.

• KFold Definition:

• Number of splits (n_splits=5): 150 samples / 5 splits = 30 samples per fold.


• Shuffle: Enabled for randomness.
• Random State: Set to 42 for reproducibility.
Fold 1
• Data Splitting:

• Training data: 120 samples (since 150 samples split into 5, leaving 4 folds for
training).
• Testing data: 30 samples (1 fold reserved for testing).

• Let's assume the indices for the testing data in the first fold are: [0, 1, 2, ..., 29].

• Training indices: [30, 31, 32, ..., 149]


• Testing indices: [0, 1, 2, ..., 29]
Continue
• Training:
• The RandomForestClassifier is trained on the 120 training samples.
The training involves building multiple decision trees.
• Prediction:
• The trained model predicts the classes for the 30 testing samples.
• Let’s assume the predictions are [0, 0, 1, 1, 2, ..., 1] for the first 30
samples.
Continue
• Evaluation:

• Calculate the accuracy by comparing the predicted labels with the actual labels.
• Assume the actual labels are [0, 0, 1, 1, 2, ..., 2].
• If 27 out of 30 predictions are correct, the accuracy score for this fold would be
27/30 = 0.90.

• Store Score:

• Accuracy score of 0.90 is stored in the scores array.


Fold 2
• Data Splitting:

• Testing data: Another set of 30 samples, say indices [30, 31, 32, ..., 59].
• Training data: Remaining 120 samples.

• Training:

• The model is trained again from scratch on these new 120 training samples.

• Prediction:

• The model makes predictions on the new testing set of 30 samples.


• Assume predictions are [1, 1, 0, ..., 2].
Continue
• Evaluation:

• Assume actual labels are [1, 1, 0, ..., 1].


• If 28 out of 30 predictions are correct, the accuracy score is 28/30 =
0.933.

• Store Score:

• Accuracy score of 0.933 is stored.


Fold 3
• Data Splitting:

• Testing data: Next set of 30 samples, say indices [60, 61, 62, ..., 89].
• Training data: Remaining 120 samples.

• Training:

• The model is retrained on the new 120 training samples.

• Prediction:

• Predictions made on the 30 testing samples.


• Assume predictions are [2, 2, 1, ..., 0].
• Evaluation:

• Actual labels are [2, 2, 1, ..., 1].


• If 29 out of 30 predictions are correct, accuracy is 29/30 = 0.967.

• Store Score:

• Accuracy score of 0.967 is stored.


Fold 4
• Data Splitting:

• Testing data: Next set of 30 samples, say indices [90, 91, 92, ..., 119].
• Training data: Remaining 120 samples.

• Training:

• The model is retrained on the new training samples.

• Prediction:

• Predictions made on the testing samples.


• Assume predictions are [0, 0, 1, ..., 2].
• Evaluation:

• Actual labels are [0, 0, 1, ..., 2].


• If 26 out of 30 predictions are correct, accuracy is 26/30 = 0.867.

• Store Score:

• Accuracy score of 0.867 is stored.


Fold 5
• Data Splitting:

• Testing data: Last set of 30 samples, say indices [120, 121, 122, ..., 149].
• Training data: Remaining 120 samples.

• Training:

• The model is trained one last time on the new training samples.

• Prediction:

• Predictions made on the testing samples.


• Assume predictions are [2, 2, 2, ..., 0].
• Evaluation:

• Actual labels are [2, 2, 2, ..., 0].


• If 27 out of 30 predictions are correct, accuracy is 27/30 = 0.90.

• Store Score:

• Accuracy score of 0.90 is stored.


Final Computation
• scores = [0.90, 0.933, 0.967, 0.867, 0.90]
• Mean Score: Calculated as the average of the scores.

• Mean = (0.90 + 0.933 + 0.967 + 0.867 + 0.90) / 5 = 0.9134


Steps to Calculate Standard
Deviation
• Variance is the average of the squared differences from the Mean.

• Compute the squared differences from the Mean:


(0.90−0.9134)2=(−0.0134)2=0.00018
(0.933−0.9134)2=(0.0196)2=0.000384
(0.967−0.9134)2=(0.0536)2=0.00287
• (0.867−0.9134)2=(−0.0464)2=0.00215
• (0.90−0.9134)2=(−0.0134)2=0.00018

• Sum of squared differences:

• 0.00018+0.000384+0.00287+0.00215+0.00018=0.005764
• Variance:

• = 0.005764 / 5 = 0.001153
• Calculate the Standard Deviation:
• Standard deviation is the square root of the variance.

• ≈0.034
Summary

• Mean Score: 0.9134


• Standard Deviation: 0.034

You might also like