0% found this document useful (0 votes)

43 views8 pages

83 Sklearn Pipeline

Uploaded by

tongjohn9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views8 pages

83 Sklearn Pipeline

Uploaded by

tongjohn9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

83_sklearn_pipeline

May 5, 2024

0.1 Let’s redefine a model

[ ]: # Let's import some packages

from dataidea.packages import * # imports np, pd, plt, etc

from sklearn.neighbors import KNeighborsRegressor

[ ]: # loading the data set

data = pd.read_csv('../assets/boston.csv')

The Boston Housing Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service
concerning housing in the area of Boston MA. The following describes the dataset columns:
• CRIM - per capita crime rate by town
• ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
• INDUS - proportion of non-retail business acres per town.
• CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
• NOX - nitric oxides concentration (parts per 10 million)
• RM - average number of rooms per dwelling
• AGE - proportion of owner-occupied units built prior to 1940
• DIS - weighted distances to five Boston employment centres
• RAD - index of accessibility to radial highways
• TAX - full-value property-tax rate per $10,000
• PTRATIO - pupil-teacher ratio by town
• B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
• LSTAT - % lower status of the population
• MEDV - Median value of owner-occupied homes in $1000’s
[ ]: # looking at the top part

data.head()

[ ]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \

0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0

1
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0

PTRATIO B LSTAT MEDV

0 15.3 396.90 4.98 24.0
1 17.8 396.90 9.14 21.6
2 17.8 392.83 4.03 34.7
3 18.7 394.63 2.94 33.4
4 18.7 396.90 5.33 36.2

0.1.1 Training our first model

In week 4, we learned that to train a model (for supervised machine learning), we needed to have
a set of X variables (also called independent, predictor etc), and then, we needed a y variable (also
called dependent, outcome, predicted etc).

[ ]: # Selecting our X set and y

X = data.drop('MEDV', axis=1)
y = data.MEDV

Now we can train the KNeighborsRegressor model, this model naturally makes predictions by
averaging the values of the 5 neighbors to the point that you want to predict
[ ]: # lets traing the KNeighborsRegressor

knn_model = KNeighborsRegressor() # instanciate the model class

knn_model.fit(X, y) # train the model on X, y
score = knn_model.score(X, y) # obtain the model score on X, y
predicted_y = knn_model.predict(X) # make predictions on X

print('score:', score)

score: 0.716098217736928
Now lets go ahead and try to visualize the performance of the model. The scatter plot is of true
labels against predicted labels. Do you think the model is doing well?
[ ]: # looking at the performance

plt.scatter(y, predicted_y)
plt.title('Model Performance')
plt.xlabel('Predicted y')
plt.ylabel('True y')
plt.show()

2
0.2 Some feature selection.
Feature selection is a process where you automatically select those features in your data that
contribute most to the prediction variable or output in which you are interested.
In week 7 we learned that having irrelevant features in your data can decrease the accuracy of many
models. In the code below, we try to find out the best features that best contribute to the outcome
variable
[ ]: from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression # score function for ANOVA␣
↪with continuous outcome

[ ]: # lets do some feature selection using ANOVA

data_num = data.drop(['CHAS','RAD'], axis=1) # dropping categorical

X = data_num.drop("MEDV", axis=1)
y = data_num.MEDV

# using SelectKBest
test_reg = SelectKBest(score_func=f_regression, k=6)

3
fit_boston = test_reg.fit(X, y)
indexes = fit_boston.get_support(indices=True)

print(fit_boston.scores_)
print(indexes)

[ 89.48611476 75.2576423 153.95488314 112.59148028 471.84673988

83.47745922 33.57957033 141.76135658 175.10554288 63.05422911
601.61787111]
[ 2 3 4 7 8 10]
From above, we can see from above that the best features for now are those in indexes [ 2 3 4
7 8 10] in the num_data dataset. Lets find them in the data and add on our categorical ones to
set up our new X set
[ ]: # redifining the X set

new_X = data[['INDUS', 'NOX', 'RM', 'TAX', 'PTRATIO', 'LSTAT', 'CHAS','RAD']]

0.2.1 Training our second model

Now that we have selected out the features, X that we thing best contribute to the outcome, let’s
retrain our machine learning model and see if we are gonna get better results
[ ]: knn_model = KNeighborsRegressor()
knn_model.fit(new_X, y)
new_score = knn_model.score(new_X, y)
new_predicted_y = knn_model.predict(new_X)

print('Feature selected score:', new_score)

Feature selected score: 0.8324963639640872

The model seems to score better with a significant increment in accuracy from 0.71 to 0.83. As
like last time, let us try to visualize the difference in performance
[ ]: plt.scatter(y, new_predicted_y)
plt.title('Model Performance')
plt.xlabel('New Predicted y')
plt.ylabel('True y')
plt.show()

4
I do not know about you, but as for me, I notice a meaningful improvement in the predictions made
from the model considering this scatter plot

0.3 Scaling the data

In week 7, we learned some advantages of scaling our data like:
• preventing dominance by features with larger scales
• faster convergence in optimization algorithms
• reduce the impact of outliers
In the next section, we will use the sklearn StandardScaler to rescale our data, read more about
it in the sklearn documentation
[ ]: # importing the StandardScaler

from sklearn.preprocessing import StandardScaler

[ ]: scaler = StandardScaler() # instanciating the StandardScaler

standardized_data_num = scaler.fit_transform(
data[['INDUS', 'NOX', 'RM', 'TAX', 'PTRATIO', 'LSTAT']]
) # rescaline numeric features

5
standardized_data_num_df = pd.DataFrame(
standardized_data_num,
columns=['INDUS', 'NOX', 'RM', 'TAX', 'PTRATIO', 'LSTAT']
) # converting the standardized to dataframe

[ ]: from sklearn.preprocessing import OneHotEncoder

[ ]: one_hot_encoder = OneHotEncoder()
encoded_data_cat = one_hot_encoder.fit_transform(data[['CHAS', 'RAD']])
encoded_data_cat_array = encoded_data_cat.toarray()
# Get feature names
feature_names = one_hot_encoder.get_feature_names_out(['CHAS', 'RAD'])

encoded_data_cat_df = pd.DataFrame(
data=encoded_data_cat_array,
columns=feature_names
)

Let us add that to the new X and form a standardized new X set
[ ]: standardized_new_X = pd.concat(
[standardized_data_num_df, encoded_data_cat_df],
axis=1
)

0.3.1 Training our third model

Now that we have the right features selected and standardized, let us train a new model and see if
it is gonna beat the first models
[ ]: knn_model = KNeighborsRegressor()
knn_model.fit(standardized_new_X, y)
new_stardard_score = knn_model.score(standardized_new_X, y)
new_predicted_y = knn_model.predict(standardized_new_X)

print('Standardized score:', new_stardard_score)

Standardized score: 0.8734524530397529

This new models appears to do better than the earlier ones with an improvement in score from
0.83 to 0.87. Do you think this is now a good model?

0.4 The Pipeline

It turns out the above efforts to improve the performance of the model add extra steps to pass
before you can have a good model. But what about if we can put together the transformers into
on object we do most of that stuff.

6
The sklearn Pipeline allows you to sequentially apply a list of transformers to preprocess the data
and, if desired, conclude the sequence with a final predictor for predictive modeling.
Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and
transform methods. The final estimator only needs to implement fit.
Let us build a model that puts together transformation and modelling steps into one pipeline
object
[ ]: # lets import the Pipeline from sklearn

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

[ ]: numeric_cols = ['INDUS', 'NOX', 'RM', 'TAX', 'PTRATIO', 'LSTAT']

categorical_cols = ['CHAS', 'RAD']

[ ]: # Preprocessing steps
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder()

# Combine preprocessing steps

preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_cols),
('cat', categorical_transformer, categorical_cols)
])

# Pipeline
pipe = Pipeline([
('preprocessor', preprocessor),
('model', KNeighborsRegressor())
])

# Fit the pipeline

pipe.fit(new_X, y)

# Score the pipeline

pipe_score = pipe.score(new_X, y)

# Predict using the pipeline

pipe_predicted_y = pipe.predict(new_X)

print('Pipe Score:', pipe_score)

Pipe Score: 0.8734524530397529

7
[ ]: plt.scatter(y, pipe_predicted_y)
plt.title('Pipe Performance')
plt.xlabel('Pipe Predicted y')
plt.ylabel('True y')
plt.show()

We can observe that the model still gets the same good score, but now all the transformation steps,
both on numeric and categorical variables are in a single pipeline object together with the model.

Discourses by Shri Brahmachaitanya Maharaj Gondavalekar Pravachane
100% (15)
Discourses by Shri Brahmachaitanya Maharaj Gondavalekar Pravachane
408 pages
Udacity Machine Learning Analysis Supervised Learning
100% (1)
Udacity Machine Learning Analysis Supervised Learning
504 pages
How To Set Up MES-Driven Staging
No ratings yet
How To Set Up MES-Driven Staging
15 pages
Ivy-Alvarez Edited
No ratings yet
Ivy-Alvarez Edited
14 pages
(Online Teaching) b1 Preliminary For Schools Speaking Part 3 Vocabulary
0% (1)
(Online Teaching) b1 Preliminary For Schools Speaking Part 3 Vocabulary
9 pages
Machine Learnin
100% (2)
Machine Learnin
23 pages
Machine
100% (1)
Machine
45 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
23 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
JNCIA Junos P2 - 2012 12 20
No ratings yet
JNCIA Junos P2 - 2012 12 20
48 pages
Malachia Ormanian - The Church of Armenia - Her History, Doctrine, Rule, Discipline PDF
No ratings yet
Malachia Ormanian - The Church of Armenia - Her History, Doctrine, Rule, Discipline PDF
316 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
22 pages
Listening To Connected Speech PDF
No ratings yet
Listening To Connected Speech PDF
4 pages
William Wordsworth Life and Poems
100% (1)
William Wordsworth Life and Poems
14 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
The Pali Imaginaire of Pre Modern Burmes
No ratings yet
The Pali Imaginaire of Pre Modern Burmes
87 pages
50 Quick Report Card Comments For Assessing Elementary Student Attitude and Effort
86% (36)
50 Quick Report Card Comments For Assessing Elementary Student Attitude and Effort
4 pages
Lecture 1 Introduction To The Theory of English Phonetics Office
No ratings yet
Lecture 1 Introduction To The Theory of English Phonetics Office
43 pages
Irregular Verbs - Homework
100% (1)
Irregular Verbs - Homework
5 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
UNITIV BtechIot
No ratings yet
UNITIV BtechIot
43 pages
Final ML File
No ratings yet
Final ML File
34 pages
IsiZulu HL P2 June-July 2015
No ratings yet
IsiZulu HL P2 June-July 2015
25 pages
CP4252 Machine Learning Laboratory
No ratings yet
CP4252 Machine Learning Laboratory
37 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
ML Manual
No ratings yet
ML Manual
30 pages
Iii Aid - ML
No ratings yet
Iii Aid - ML
30 pages
MLLab Manual
No ratings yet
MLLab Manual
24 pages
ML PDF
No ratings yet
ML PDF
30 pages
Thrax
No ratings yet
Thrax
2 pages
ML Full For Print New 1
No ratings yet
ML Full For Print New 1
38 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Ba Boys 3 0504
No ratings yet
Ba Boys 3 0504
24 pages
cp4252 Machine Learning Lab Manual
No ratings yet
cp4252 Machine Learning Lab Manual
21 pages
VND - Openxmlformats Officedocument - Wordprocessingml.document&rendition 1
No ratings yet
VND - Openxmlformats Officedocument - Wordprocessingml.document&rendition 1
24 pages
Train
No ratings yet
Train
17 pages
ML
No ratings yet
ML
17 pages
House Pricing
No ratings yet
House Pricing
15 pages
Cambridge O Level: Computer Science 2210/22
No ratings yet
Cambridge O Level: Computer Science 2210/22
12 pages
Assignment v5.0 EN
No ratings yet
Assignment v5.0 EN
17 pages
Experiment 1
No ratings yet
Experiment 1
19 pages
ML Manual
No ratings yet
ML Manual
24 pages
Art of Achievements
No ratings yet
Art of Achievements
16 pages
Mlalllabprgs
No ratings yet
Mlalllabprgs
17 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
16 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
18 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Entre La Tradition Sénégalaise
No ratings yet
Entre La Tradition Sénégalaise
14 pages
ML Spy Programs
No ratings yet
ML Spy Programs
16 pages
ML Programs
No ratings yet
ML Programs
14 pages
ML Lab Manual
No ratings yet
ML Lab Manual
14 pages
CQF June 2021 M4L4 Solutions
No ratings yet
CQF June 2021 M4L4 Solutions
14 pages
CP4252 Lab Manual
No ratings yet
CP4252 Lab Manual
13 pages
ML RECORD EX 5,6,7,8,9 (Without Border)
No ratings yet
ML RECORD EX 5,6,7,8,9 (Without Border)
13 pages
Godlweska 1997 The Idea of A Map
No ratings yet
Godlweska 1997 The Idea of A Map
14 pages
Xgboost
No ratings yet
Xgboost
12 pages
ML Lab Manual
No ratings yet
ML Lab Manual
13 pages
09goods L Question - Wave On String (Eng)
No ratings yet
09goods L Question - Wave On String (Eng)
13 pages
1
No ratings yet
1
13 pages
Python For Data Science IA 1 Programs
No ratings yet
Python For Data Science IA 1 Programs
14 pages
Kartik MLP 4-9prg
No ratings yet
Kartik MLP 4-9prg
10 pages
What Is Grammar - Nelson
No ratings yet
What Is Grammar - Nelson
9 pages
AAM PR QB
No ratings yet
AAM PR QB
13 pages
Remaining ML Program
No ratings yet
Remaining ML Program
12 pages
Coding Question
No ratings yet
Coding Question
6 pages
ML
No ratings yet
ML
8 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Python For Data Science IA 1 Programs
No ratings yet
Python For Data Science IA 1 Programs
14 pages
Machine Learning Programs
No ratings yet
Machine Learning Programs
10 pages
ML Journal External
No ratings yet
ML Journal External
14 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
9 pages
MLLAB
No ratings yet
MLLAB
10 pages
ML Minimized Programs
No ratings yet
ML Minimized Programs
9 pages
Lab 1. Boston House
No ratings yet
Lab 1. Boston House
7 pages
2.WSS - Enquiry Routines
No ratings yet
2.WSS - Enquiry Routines
7 pages
PythonForML2023 Laboratory07 08 Regression Classification Update2
No ratings yet
PythonForML2023 Laboratory07 08 Regression Classification Update2
6 pages
ML Lab
No ratings yet
ML Lab
7 pages
TỪ VỰNG CHỦ ĐỀ TRAVEL AND TRANSPORT
No ratings yet
TỪ VỰNG CHỦ ĐỀ TRAVEL AND TRANSPORT
11 pages
Python Code For KNN Classifier 1. Initial Message
No ratings yet
Python Code For KNN Classifier 1. Initial Message
7 pages
ML Short Code - Under Updating
No ratings yet
ML Short Code - Under Updating
4 pages
Daily Lesson Plan: Complementary
No ratings yet
Daily Lesson Plan: Complementary
7 pages
W5 Quiz-Ans
No ratings yet
W5 Quiz-Ans
5 pages
FAQ's - Supervised Learning
No ratings yet
FAQ's - Supervised Learning
4 pages
Continuous Assessment
No ratings yet
Continuous Assessment
4 pages
Unbundling Pokémon Go - Applidium
No ratings yet
Unbundling Pokémon Go - Applidium
4 pages
Reading and Writing Assessement 2
No ratings yet
Reading and Writing Assessement 2
2 pages
Unit 7 - Visual Stories - Animation Video Project. - Unit 7 - Visual Stories - Project
No ratings yet
Unit 7 - Visual Stories - Animation Video Project. - Unit 7 - Visual Stories - Project
1 page
Jockey Club Ti-I College Students' Intranet
No ratings yet
Jockey Club Ti-I College Students' Intranet
1 page
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet

83 Sklearn Pipeline

Uploaded by

83 Sklearn Pipeline

Uploaded by

83_sklearn_pipeline

0.1 Let’s redefine a model

from dataidea.packages import * # imports np, pd, plt, etc

[ ]: # loading the data set

The Boston Housing Dataset

[ ]: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \

PTRATIO B LSTAT MEDV

0.1.1 Training our first model

[ ]: # Selecting our X set and y

knn_model = KNeighborsRegressor() # instanciate the model class

[ ]: # lets do some feature selection using ANOVA

data_num = data.drop(['CHAS','RAD'], axis=1) # dropping categorical

[ 89.48611476 75.2576423 153.95488314 112.59148028 471.84673988

new_X = data[['INDUS', 'NOX', 'RM', 'TAX', 'PTRATIO', 'LSTAT', 'CHAS','RAD']]

0.2.1 Training our second model

print('Feature selected score:', new_score)

Feature selected score: 0.8324963639640872

0.3 Scaling the data

from sklearn.preprocessing import StandardScaler

[ ]: scaler = StandardScaler() # instanciating the StandardScaler

[ ]: from sklearn.preprocessing import OneHotEncoder

0.3.1 Training our third model

print('Standardized score:', new_stardard_score)

Standardized score: 0.8734524530397529

0.4 The Pipeline

from sklearn.pipeline import Pipeline

[ ]: numeric_cols = ['INDUS', 'NOX', 'RM', 'TAX', 'PTRATIO', 'LSTAT']

# Combine preprocessing steps

# Fit the pipeline

# Score the pipeline

# Predict using the pipeline

print('Pipe Score:', pipe_score)

Pipe Score: 0.8734524530397529

You might also like