0% found this document useful (0 votes)

254 views11 pages

Train Test Split in Python

This document summarizes train/test split and cross validation techniques in machine learning. It explains that train/test split involves splitting data into training and test sets to fit a model on the training data and evaluate it on the test set, to avoid overfitting. Cross validation improves on this by splitting the data into multiple subsets and iteratively training on all but one subset and testing on the remaining subset. The document provides an example of implementing train/test split in Python using scikit-learn to load diabetes data and fit a linear regression model.

Uploaded by

Nikhil Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

254 views11 pages

Train Test Split in Python

Uploaded by

Nikhil Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

8/21/2019 Train/Test Split and Cross Validation in Python - Towards Data Science

Train/Test Split and Cross Validation

in Python
Adi Bronshtein
May 17, 2017 · 9 min read

Hi everyone! After my last post on linear regression in Python, I thought it would only
be natural to write a post about Train/Test Split and Cross Validation. As usual, I am
going to give a short overview on the topic and then give an example on implementing
it in Python. These are two rather important concepts in data science and data analysis
and are used as tools to prevent (or at least minimize) overfitting. I’ll explain what that
is — when we’re using a statistical model (like linear regression, for example), we
usually fit the model on a training set in order to make predications on a data that
wasn’t trained (general data). Overfitting means that what we’ve fit the model too
much to the training data. It will all make sense pretty soon, I promise!

What is Overfitting/Underfitting a Model?

As mentioned, in statistics and machine learning we usually split our data into two
subsets: training data and testing data (and sometimes to three: train, validate and
test), and fit our model on the train data, in order to make predictions on the test data.
When we do that, one of two thing might happen: we overfit our model or we underfit
our model. We don’t want any of these things to happen, because they affect the
predictability of our model — we might be using a model that has lower accuracy
and/or is ungeneralized (meaning you can’t generalize your predictions on other data).
Let’s see what under and overfitting actually mean:

Overfitting
Overfitting means that model we trained has trained “too well” and is now, well, fit too
closely to the training dataset. This usually happens when the model is too complex
(i.e. too many features/variables compared to the number of observations). This model
will be very accurate on the training data but will probably be very not accurate on

https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 1/11
8/21/2019 Train/Test Split and Cross Validation in Python - Towards Data Science

untrained or new data. It is because this model is not generalized (or not AS
generalized), meaning you can generalize the results and can’t make any inferences on
other data, which is, ultimately, what you are trying to do. Basically, when this
happens, the model learns or describes the “noise” in the training data instead of the
actual relationships between variables in the data. This noise, obviously, isn’t part in of
any new dataset, and cannot be applied to it.

Underfitting
In contrast to overfitting, when a model is underfitted, it means that the model does
not fit the training data and therefore misses the trends in the data. It also means the
model cannot be generalized to new data. As you probably guessed (or figured out!),
this is usually the result of a very simple model (not enough predictors/independent
variables). It could also happen when, for example, we fit a linear model (like linear
regression) to data that is not linear. It almost goes without saying that this model will
have poor predictive ability (on training data and can’t be generalized to other data).

An example of over tting, under tting and a model that’s “just right!”

It is worth noting the underfitting is not as prevalent as overfitting. Nevertheless, we

want to avoid both of those problems in data analysis. You might say we are trying to
find the middle ground between under and overfitting our model. As you will see,
train/test split and cross validation help to avoid overfitting more than underfitting.
Let’s dive into both of them!

Train/Test Split
As I said before, the data we use is usually split into training data and test data. The
training set contains a known output and the model learns on this data in order to be
generalized to other data later on. We have the test dataset (or subset) in order to test
our model’s prediction on this subset.

https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 2/11
8/21/2019 Train/Test Split and Cross Validation in Python - Towards Data Science

Train/Test Split

Let’s see how to do this in Python. We’ll do this using the Scikit-Learn library and
specifically the train_test_split method. We’ll start with importing the necessary
libraries:

import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt

Let’s quickly go over the libraries I’ve imported:

Pandas — to load the data file as a Pandas data frame and analyze the data. If you
want to read more on Pandas, feel free to check out my post!

From Sklearn, I’ve imported the datasets module, so I can load a sample dataset,
and the linear_model, so I can run a linear regression

From Sklearn, sub-library model_selection, I’ve imported the train_test_split so I

can, well, split to training and test sets

From Matplotlib I’ve imported pyplot in order to plot graphs of the data

OK, all set! Let’s load in the diabetes dataset, turn it into a data frame and define the
columns’ names:

# Load the Diabetes dataset

columns = “age sex bmi map tc ldl hdl tch ltg glu”.split() # Declare
the columns names
diabetes = datasets.load_diabetes() # Call the diabetes dataset from
sklearn
df = pd.DataFrame(diabetes.data, columns=columns) # load the dataset
as a pandas data frame

https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 3/11
8/21/2019 Train/Test Split and Cross Validation in Python - Towards Data Science

y = diabetes.target # define the target variable (dependent

variable) as y

Now we can use the train_test_split function in order to make the split. The
test_size=0.2 inside the function indicates the percentage of the data that should be
held over for testing. It’s usually around 80/20 or 70/30.

# create training and testing vars

X_train, X_test, y_train, y_test = train_test_split(df, y,
test_size=0.2)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape

(353, 10) (353,)

(89, 10) (89,)

Now we’ll fit the model on the training data:

# fit a model
lm = linear_model.LinearRegression()

model = lm.fit(X_train, y_train)

predictions = lm.predict(X_test)

As you can see, we’re fitting the model on the training data and trying to predict the
test data. Let’s see what (some of) the predictions are:

predictions[0:5]
array([ 205.68012533, 64.58785513, 175.12880278, 169.95993301,
128.92035866])

Note: because I used [0:5] after predictions, it only showed the first five predicted
values. Removing the [0:5] would have made it print all of the predicted values that
our model created.

Let’s plot the model:

https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 4/11
8/21/2019 Train/Test Split and Cross Validation in Python - Towards Data Science

## The line / model

plt.scatter(y_test, predictions)
plt.xlabel(“True Values”)
plt.ylabel(“Predictions”)

And print the accuracy score:

print “Score:”, model.score(X_test, y_test)

Score: 0.485829586737

There you go! Here is a summary of what I did: I’ve loaded in the data, split it into a
training and testing sets, fitted a regression model to the training data, made
predictions based on this data and tested the predictions on the test data. Seems good,

https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 5/11
8/21/2019 Train/Test Split and Cross Validation in Python - Towards Data Science

right? But train/test split does have its dangers — what if the split we make isn’t
random? What if one subset of our data has only people from a certain state, employees
with a certain income level but not other income levels, only women or only people at a
certain age? (imagine a file ordered by one of these). This will result in overfitting,
even though we’re trying to avoid it! This is where cross validation comes in.

Cross Validation
In the previous paragraph, I mentioned the caveats in the train/test split method. In
order to avoid this, we can perform something called cross validation. It’s very similar
to train/test split, but it’s applied to more subsets. Meaning, we split our data into k
subsets, and train on k-1 one of those subset. What we do is to hold the last subset for
test. We’re able to do it for each of the subsets.

Visual Representation of Train/Test Split and Cross Validation. H/t to my DSI instructor, Joseph Nelson!

There are a bunch of cross validation methods, I’ll go over two of them: the first is K-
Folds Cross Validation and the second is Leave One Out Cross Validation (LOOCV)

K-Folds Cross Validation

In K-Folds Cross Validation we split our data into k different subsets (or folds). We use
k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We
then average the model against each of the folds and then finalize our model. After
that we test it against the test set.

https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 6/11
8/21/2019 Train/Test Split and Cross Validation in Python - Towards Data Science

Visual representation of K-Folds. Again, H/t to Joseph Nelson!

Here is a very simple example from the Sklearn documentation for K-Folds:

from sklearn.model_selection import KFold # import KFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) # create an array
y = np.array([1, 2, 3, 4]) # Create another array
kf = KFold(n_splits=2) # Define the split - into 2 folds
kf.get_n_splits(X) # returns the number of splitting iterations in
the cross-validator

print(kf)

KFold(n_splits=2, random_state=None, shuffle=False)

And let’s see the result — the folds:

for train_index, test_index in kf.split(X):

print(“TRAIN:”, train_index, “TEST:”, test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))

('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))

As you can see, the function split the original data into different subsets of the data.
Again, very simple example but I think it explains the concept pretty well.

https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 7/11
8/21/2019 Train/Test Split and Cross Validation in Python - Towards Data Science

Leave One Out Cross Validation (LOOCV)

This is another method for cross validation, Leave One Out Cross Validation (by the
way, these methods are not the only two, there are a bunch of other methods for cross
validation. Check them out in the Sklearn website). In this type of cross validation, the
number of folds (subsets) equals to the number of observations we have in the dataset.
We then average ALL of these folds and build our model with the average. We then test
the model against the last fold. Because we would get a big number of training sets
(equals to the number of samples), this method is very computationally expensive and
should be used on small datasets. If the dataset is big, it would most likely be better to
use a different method, like kfold.

Let’s check out another example from Sklearn:

from sklearn.model_selection import LeaveOneOut

X = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])
loo = LeaveOneOut()
loo.get_n_splits(X)

for train_index, test_index in loo.split(X):

print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)

And this is the output:

('TRAIN:', array([1]), 'TEST:', array([0]))

(array([[3, 4]]), array([[1, 2]]), array([2]), array([1]))
('TRAIN:', array([0]), 'TEST:', array([1]))
(array([[1, 2]]), array([[3, 4]]), array([1]), array([2]))

Again, simple example, but I really do think it helps in understanding the basic concept
of this method.

So, what method should we use? How many folds? Well, the more folds we have, we
will be reducing the error due the bias but increasing the error due to variance; the
computational price would go up too, obviously — the more folds you have, the longer
it would take to compute it and you would need more memory. With a lower number of

https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 8/11
8/21/2019 Train/Test Split and Cross Validation in Python - Towards Data Science

folds, we’re reducing the error due to variance, but the error due to bias would be
bigger. It’s would also computationally cheaper. Therefore, in big datasets, k=3 is
usually advised. In smaller datasets, as I’ve mentioned before, it’s best to use LOOCV.

. . .

Let’s check out the example I used before, this time with using cross validation. I’ll use
the cross_val_predict function to return the predicted values for each data point when
it’s in the testing slice.

# Necessary imports:
from sklearn.cross_validation import cross_val_score,
cross_val_predict
from sklearn import metrics

As you remember, earlier on I’ve created the train/test split for the diabetes dataset and
fitted a model. Let’s see what is the score after cross validation:

# Perform 6-fold cross validation

scores = cross_val_score(model, df, y, cv=6)
print “Cross-validated scores:”, scores

Cross-validated scores: [ 0.4554861 0.46138572 0.40094084

0.55220736 0.43942775 0.56923406]

As you can see, the last fold improved the score of the original model — from 0.485 to
0.569. Not an amazing result, but hey, we’ll take what we can get :)

Now, let’s plot the new predictions, after performing cross validation:

# Make cross validated predictions

predictions = cross_val_predict(model, df, y, cv=6)
plt.scatter(y, predictions)

https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 9/11
8/21/2019 Train/Test Split and Cross Validation in Python - Towards Data Science

You can see it’s very different from the original plot from earlier. It is six times as many
points as the original plot because I used cv=6.

Finally, let’s check the R² score of the model (R² is a “number that indicates the
proportion of the variance in the dependent variable that is predictable from the
independent variable(s)”. Basically, how accurate is our model):

accuracy = metrics.r2_score(y, predictions)

print “Cross-Predicted Accuracy:”, accuracy

Cross-Predicted Accuracy: 0.490806583864

. . .

That’s it for this time! I hope you enjoyed this post. As always, I welcome questions,
notes, comments and requests for posts on topics you’d like to read. See you next time!

https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 10/11
8/21/2019 Train/Test Split and Cross Validation in Python - Towards Data Science

Machine Learning Data Science Data Analysis Statistics Python

About Help Legal

https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6 11/11

40 R Programming Interview Questions & Answers For All Levels - DataCamp
No ratings yet
40 R Programming Interview Questions & Answers For All Levels - DataCamp
22 pages
Software Engineer - Interview Preparation Guide
No ratings yet
Software Engineer - Interview Preparation Guide
1 page
p74-MedBlock - Efficient and Secure Medical Data Sharing Via Blockch
No ratings yet
p74-MedBlock - Efficient and Secure Medical Data Sharing Via Blockch
11 pages
Employee Assessment and Learning System
No ratings yet
Employee Assessment and Learning System
24 pages
Python and Automation by Nitikesh
No ratings yet
Python and Automation by Nitikesh
15 pages
Honeywell Sample Aptitude (General) Placement Paper
No ratings yet
Honeywell Sample Aptitude (General) Placement Paper
5 pages
My Library
No ratings yet
My Library
47 pages
Community Session - PromptEngineering-Fine-Tuning
No ratings yet
Community Session - PromptEngineering-Fine-Tuning
34 pages
Top 10 Risk Management Interview Questions With Answers
No ratings yet
Top 10 Risk Management Interview Questions With Answers
21 pages
Data Analytics With PowerBI
No ratings yet
Data Analytics With PowerBI
27 pages
Micrsoft_AI Builder Prompting Guide
No ratings yet
Micrsoft_AI Builder Prompting Guide
10 pages
(SMS) Verification and Validation
No ratings yet
(SMS) Verification and Validation
11 pages
ETL Testing Int - 1
No ratings yet
ETL Testing Int - 1
16 pages
Manual Testing - Common Interview Questions
No ratings yet
Manual Testing - Common Interview Questions
31 pages
Tcs
No ratings yet
Tcs
46 pages
How To Import JSON To Excel Using VBA - Excelerator Solutions
No ratings yet
How To Import JSON To Excel Using VBA - Excelerator Solutions
15 pages
Etl Testing Ebook PDF
0% (2)
Etl Testing Ebook PDF
2 pages
Im-Rag Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues
No ratings yet
Im-Rag Multi-Round Retrieval-Augmented Generation Through Learning Inner Monologues
11 pages
Professional Learning Indicator Fact Sheet F0o
66% (32)
Professional Learning Indicator Fact Sheet F0o
1 page
Case Study 1
No ratings yet
Case Study 1
10 pages
The Illustrated Word2vec - Jay Alammar - Visualizing Machine Learning One Concept at A Time
100% (1)
The Illustrated Word2vec - Jay Alammar - Visualizing Machine Learning One Concept at A Time
24 pages
Full A Nurse S Step by Step Guide To Writing A Dissertation or Scholarly Project Second Edition Karen Roush PDF All Chapters
100% (4)
Full A Nurse S Step by Step Guide To Writing A Dissertation or Scholarly Project Second Edition Karen Roush PDF All Chapters
62 pages
35+ Job Interview Questions and Answers (Full List)
No ratings yet
35+ Job Interview Questions and Answers (Full List)
29 pages
Supporting Statement NQT
0% (1)
Supporting Statement NQT
2 pages
Scope and Growth of Software Testing Industry
No ratings yet
Scope and Growth of Software Testing Industry
16 pages
Testing Resouces
No ratings yet
Testing Resouces
2 pages
AI Training Prompt For Automating Cold Outreach To Hit $10k-Mo-1
No ratings yet
AI Training Prompt For Automating Cold Outreach To Hit $10k-Mo-1
9 pages
Final Test Report
No ratings yet
Final Test Report
15 pages
Itdumpsfree: Get Free Valid Exam Dumps and Pass Your Exam Test With Confidence
No ratings yet
Itdumpsfree: Get Free Valid Exam Dumps and Pass Your Exam Test With Confidence
5 pages
Uber HYD COE Business Analyst JD - Analytics & Reporting PDF
No ratings yet
Uber HYD COE Business Analyst JD - Analytics & Reporting PDF
3 pages
SDET Interview Questions
No ratings yet
SDET Interview Questions
52 pages
Multithreading Crawler Project OS
No ratings yet
Multithreading Crawler Project OS
11 pages
Java Test Automation
No ratings yet
Java Test Automation
69 pages
Why Did You Apply For This Position
No ratings yet
Why Did You Apply For This Position
4 pages
Accenture Syllabus
No ratings yet
Accenture Syllabus
2 pages
ErrCode CS1
No ratings yet
ErrCode CS1
10 pages
Definitive Guide: Data-Driven Testing Soap Rest Apis
No ratings yet
Definitive Guide: Data-Driven Testing Soap Rest Apis
32 pages
Top 25 Excel Dashboard Tips Sheet
No ratings yet
Top 25 Excel Dashboard Tips Sheet
2 pages
Vedant Surana - Imarticus Resume PDF
No ratings yet
Vedant Surana - Imarticus Resume PDF
2 pages
Software Testing Interview Questions and Answers
No ratings yet
Software Testing Interview Questions and Answers
46 pages
8 Accenture 2023 Interview Questions
No ratings yet
8 Accenture 2023 Interview Questions
7 pages
Vignesh R 22071471559 Jan 2024: Tcs NQT - It
No ratings yet
Vignesh R 22071471559 Jan 2024: Tcs NQT - It
1 page
List of Java Keywords
No ratings yet
List of Java Keywords
6 pages
Python Interview Questions 1653100147
No ratings yet
Python Interview Questions 1653100147
24 pages
Full Stack Web Development Course Syllabus: Viking Code School
No ratings yet
Full Stack Web Development Course Syllabus: Viking Code School
11 pages
Validation Jerry Banks
100% (1)
Validation Jerry Banks
31 pages
Five Rights of Clinical Reasoning
100% (1)
Five Rights of Clinical Reasoning
2 pages
Aptitude Test: Important Instructions For The Test
No ratings yet
Aptitude Test: Important Instructions For The Test
6 pages
Machine Learning Interview Questions PDF
No ratings yet
Machine Learning Interview Questions PDF
14 pages
Tutorial Database Testing Using SQL
No ratings yet
Tutorial Database Testing Using SQL
0 pages
Python For Non-Programmers - 1-1
No ratings yet
Python For Non-Programmers - 1-1
19 pages
Aptitude Mock Test On CN, OS, DBMS On 18aug 21 P2P - 2022
No ratings yet
Aptitude Mock Test On CN, OS, DBMS On 18aug 21 P2P - 2022
11 pages
American Express - Sarabjot Singh
No ratings yet
American Express - Sarabjot Singh
2 pages
Mentoring Skills
No ratings yet
Mentoring Skills
14 pages
Medical equipment management Complete Self-Assessment Guide
From Everand
Medical equipment management Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
L03 Generalization, Train Test Splits and Validation
No ratings yet
L03 Generalization, Train Test Splits and Validation
49 pages
XIIAIUNITICAPSTONE_PROJECTPARTII
No ratings yet
XIIAIUNITICAPSTONE_PROJECTPARTII
11 pages
ML Unit 2
No ratings yet
ML Unit 2
18 pages
Train and Test Datasets in Machine Learning
No ratings yet
Train and Test Datasets in Machine Learning
6 pages
Machine Learning-Lecture 02
No ratings yet
Machine Learning-Lecture 02
28 pages
Newwhitepaper_Prompt Engineering_v4
0% (1)
Newwhitepaper_Prompt Engineering_v4
65 pages
Report Data Centers The Sunrise of A New Era in Global Infrastructure Report 3
No ratings yet
Report Data Centers The Sunrise of A New Era in Global Infrastructure Report 3
25 pages
Amendment No. 1 September 1977: Alterations
No ratings yet
Amendment No. 1 September 1977: Alterations
4 pages
The State of Data Engineering in India - 2024
No ratings yet
The State of Data Engineering in India - 2024
46 pages
Indiaai 21 Women in 21
No ratings yet
Indiaai 21 Women in 21
96 pages
Natural Gas
No ratings yet
Natural Gas
86 pages
DCR Pune Final
No ratings yet
DCR Pune Final
237 pages
Ch6 Formulating The Hypothesis-1
No ratings yet
Ch6 Formulating The Hypothesis-1
13 pages
SM 9.2 Skittles Answers
No ratings yet
SM 9.2 Skittles Answers
2 pages
Su 5
No ratings yet
Su 5
10 pages
MBQT 4-5
No ratings yet
MBQT 4-5
40 pages
Introduction To String Theory, by Timo Weigand
100% (1)
Introduction To String Theory, by Timo Weigand
177 pages
Central Equation Weak Electron
No ratings yet
Central Equation Weak Electron
4 pages
Way 02 PDF
No ratings yet
Way 02 PDF
2 pages
Pareto Principle
No ratings yet
Pareto Principle
2 pages
BRM Unit-1
No ratings yet
BRM Unit-1
47 pages
Coleman Complete
100% (1)
Coleman Complete
335 pages
Strumińska-Kutra-Koładkiewicz2018 Chapter CaseStudy
No ratings yet
Strumińska-Kutra-Koładkiewicz2018 Chapter CaseStudy
31 pages
KF 3 PDF
No ratings yet
KF 3 PDF
27 pages
Texto 1 - A Formal Theory of Differentiation in Organizations
No ratings yet
Texto 1 - A Formal Theory of Differentiation in Organizations
19 pages
Critical Solving
No ratings yet
Critical Solving
76 pages
Stark Effect: Test - Course Name: Course Code
No ratings yet
Stark Effect: Test - Course Name: Course Code
12 pages
Sampling Notes - Part-02
No ratings yet
Sampling Notes - Part-02
8 pages
Antiparticle Physics
No ratings yet
Antiparticle Physics
8 pages
Paradigm, Theory and Model
No ratings yet
Paradigm, Theory and Model
6 pages
Heuristic Decision-Making Across Adulthood
No ratings yet
Heuristic Decision-Making Across Adulthood
12 pages
Kelompok 4 MPHES
No ratings yet
Kelompok 4 MPHES
25 pages
DH 42 Quiz 1
No ratings yet
DH 42 Quiz 1
4 pages
Scientific Method- PPT
No ratings yet
Scientific Method- PPT
29 pages
Creating A New Mathematical Equation
No ratings yet
Creating A New Mathematical Equation
3 pages
Famo by Eduard
No ratings yet
Famo by Eduard
6 pages
STAT2112 Q2 Performance Task 2 - Attempt Review
No ratings yet
STAT2112 Q2 Performance Task 2 - Attempt Review
4 pages
Appreciation of Evidence Criminal
No ratings yet
Appreciation of Evidence Criminal
18 pages
Chapter 2 - Postulates of Quantum Mechanics
No ratings yet
Chapter 2 - Postulates of Quantum Mechanics
11 pages
Unit 5 Overview of Probability
No ratings yet
Unit 5 Overview of Probability
21 pages
Assignment 2
No ratings yet
Assignment 2
13 pages
The G - Function and Defect Changing Operators From Wavefunction Overlap On A Fuzzy Sphere
No ratings yet
The G - Function and Defect Changing Operators From Wavefunction Overlap On A Fuzzy Sphere
30 pages

Train Test Split in Python

Uploaded by

Train Test Split in Python

Uploaded by

8/21/2019 Train/Test Split and Cross Validation in Python - Towards Data Science

Train/Test Split and Cross Validation

What is Overfitting/Underfitting a Model?

It is worth noting the underfitting is not as prevalent as overfitting. Nevertheless, we

Let’s quickly go over the libraries I’ve imported:

From Sklearn, sub-library model_selection, I’ve imported the train_test_split so I

# Load the Diabetes dataset

y = diabetes.target # define the target variable (dependent

# create training and testing vars

(353, 10) (353,)

Now we’ll fit the model on the training data:

model = lm.fit(X_train, y_train)

Let’s plot the model:

## The line / model

And print the accuracy score:

print “Score:”, model.score(X_test, y_test)

K-Folds Cross Validation

Visual representation of K-Folds. Again, H/t to Joseph Nelson!

from sklearn.model_selection import KFold # import KFold

KFold(n_splits=2, random_state=None, shuffle=False)

And let’s see the result — the folds:

for train_index, test_index in kf.split(X):

('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))

Leave One Out Cross Validation (LOOCV)

Let’s check out another example from Sklearn:

from sklearn.model_selection import LeaveOneOut

for train_index, test_index in loo.split(X):

And this is the output:

('TRAIN:', array([1]), 'TEST:', array([0]))

# Perform 6-fold cross validation

Cross-validated scores: [ 0.4554861 0.46138572 0.40094084

# Make cross validated predictions

accuracy = metrics.r2_score(y, predictions)

Cross-Predicted Accuracy: 0.490806583864

Machine Learning Data Science Data Analysis Statistics Python

About Help Legal

You might also like