0% found this document useful (0 votes)

25 views9 pages

ML Book Notes

Uploaded by

official.tanmay1306

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views9 pages

ML Book Notes

Uploaded by

official.tanmay1306

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Chapter 2 – End to end Machine Learning Project

Here are a few places you can look to get data:

Remember about the ML Checklist.

So the data that we are going to build our model for
includes metrics such as the population, median income, and median
housing price for each block group in California

A sequence of data processing components is called a data pipeline.

So the problem that we are solving is supervised, typical univariate regression, as there is no continuous flow of data
hence batch learning will do fine. If the data were huge, you could either split your batch learning work across multiple
servers (using the MapReduce technique) or use an online learning technique.

Next step includes selecting a performance measure –

(Refer to page no. 44 & 45 for notation manual)

For example, if there are many outlier districts. In that case, you may
consider using the mean absolute error (MAE, also called the average
absolute deviation), shown in Equation 2-2:

The higher the norm index, the more it focuses on the

large values and neglects small ones.

Rather than manually downloading and decompressing

the data, it’s usually preferable to write a function that
does it for you. This is useful if the data changes
regularly: you can write a small script that uses the
function to fetch the latest data (or you can set up a
scheduled job to do that automatically at regular
intervals). Automating the process of fetching the data is
also useful if you need to install the dataset on multiple machines.

Steps you need to take now:

You start by looking at the top five rows of data using the
DataFrame’s head() method

The info() method is useful to get a quick description of

the data, in particular the total number of rows, each
attribute’s type, and the number of non-null values

Now for categorical attributes : You can find out what

categories exist and how many districts belong to each
category by using the value_counts() method.
The describe() method shows a summary of the numerical attributes.

Another quick way to get a feel of the type of data you are dealing with is to plot a histogram for each numerical attribute.
A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis).
You can either plot this one attribute at a time, or you can call the hist() method on the whole dataset (as shown in the
following code example), and it will plot a histogram for each numerical attribute.

You can find some attributes to be preprocessed in your dataset, it’s NORMAL!
So if capping becomes a problem then you have two options:
—Collect proper labels for the districts whose labels were capped.
—Remove those districts from the training set (and also from the test set, since
your system should not be evaluated poorly if it predicts values beyond $500,000).

Skewed – right distribution : they extend much farther to the right of the median than to the left, you transform that by
computing their logarithm or square root
So usually overfitting leads to what we call as data snooping bias.

So we divide the training data, into test set and training set :

To have a stable train/test split even after updating the dataset, a common solution is:

But in this method you need to make sure that new data gets appended to the end of the dataset and that no row ever gets
deleted.
Same thing can be done by using a function from the Sci-kit Learn library :

Stratified sampling: the population is divided into homogeneous subgroups called strata, and the right number of instances
are sampled from each stratum to guarantee that the test set is representative of the overall population.
So if someone says that the median income is much important in predicting the housing prices. Then you should assure
that the test set is stratified on the this attribute on a priority basis.
But since it is a continuous attribute, you need to create a income category attribute first.
It is important to have enough instances in your dataset for each stratum, or else the
estimate of a stratum’s importance may be biased.

Having multiple splits can be useful if you want to better estimate the performance of your model (cross-validation)

If the training set is too large you may want to sample an exploration set, to make manipulations easy and fast during the
exploration phase.

So now we are moving on to the visualization phase, Since you’re going to experiment with various transformations of the
full training set, you should make a copy of the original so you can revert to it afterwards.

Now plotting the data via :

Since the dataset is not too large, you can easily compute the standard correlation coefficient (also called Pearson’s r)
between every pair of attributes using the corr() method:

Another way to check for correlation between attributes is to use the Pandas scatter_matrix() function, the code for which
is shown in the next page.
Looking at the correlation scatterplots, it seems like the most promising attribute to predict the median house value is the
median income, how?
- the correlation is indeed quite strong
- clearly see the upward trend
- points are not too dispersed
- price cap you noticed earlier is clearly visible at $500,000
- less obvious straight lines: a horizontal line around $450,000, another around $350,000, perhaps one around
$280,000
- You may want to try removing the corresponding districts to prevent your algorithms from learning to reproduce
these data quirks.
There are limitations to finding patterns using correlation coefficients to find patterns in datasets.
One last thing you may want to do before preparing the data for machine learning algorithms is to try out various attribute
combinations and choose some meaningful combinations from them.

Now first, revert to a clean training set (by copying strat_train_set once again) and also You should also separate the
predictors and the labels, since you don’t
necessarily want to apply the same transformations
to the predictors and the target values.

So first we start with the filling of missing values and for that we have three options:
- Get rid of the corresponding district.
- Get rid of the whole attribute.
- Set the missing values to some value (zero, the mean, the median, etc.). This is called imputation.

So of course we would go with option 3 as it is the least destructive one, but instead of the preceding code you will use a
handy Scikit-Learn class: SimpleImputer. The benefit is that it will store the median value of each feature: this will make it
possible to impute missing values not only on the training set, but also on the validation set, the test set, and any new data
fed to the model.
Missing values can also be replaced with the mean value
(strategy="mean"), or with the most frequent value
(strategy="most_frequent"), or with a constant value
(strategy="constant", fill_value=…).
The last two strategies support nonnumerical data.

All the estimator’s hyperparameters are accessible directly via

public instance variables (e.g., imputer.strategy), and all the
estimator’s learned parameters are accessible via public instance
variables with an underscore suffix (e.g., imputer.statistics_).

All transformers also have a convenience method called

fit_transform(),
which is equivalent to calling fit() and then transform() (but
sometimes
fit_transform() is optimized and runs much faster).

Scikit-Learn transformers output NumPy arrays (or sometimes SciPy

sparse matrices) even when they are fed Pandas DataFrames as input.

So now we have to handle the categorical attributes and convert them from
text to numbers and for that we use OrdinalEncoder

One issue with this representation is that ML algorithms will assume that
two nearby values are more similar than two distant values.

To resolve this we use the one hot encoder :

A sparse matrix is a very efficient representation for matrices that contain mostly zeros. Indeed, internally it only stores the
nonzero values and their positions.

OneHotEncoder is smarter: it will detect the unknown category

and raise an exception. If you prefer, you can set the
handle_unknown hyperparameter to "ignore", in which case it
will just represent the unknown category with zeros:

If a categorical attribute has a large number of possible

categories then this may slow down training and degrade
performance, If this happens, you may want to replace the
categorical input with useful numerical features related
to the categories
never use fit() or fit_transform() for anything else than
the training set
Note that while the training set values will always be scaled to the specified range, if new data contains outliers, these
may end up scaled outside the range. If you want to avoid this, just set the clip hyperparameter to True.

Now lets delve into feature scaling for which there are two common ways : min-max scaling and standardization
1. Min-max Scaling (normalization) – scaled between 0
to 1.
This is performed by subtracting the min value and
dividing by the difference between the min and the max.
2. Standardization - first it subtracts the mean value (so
standardized values have a zero mean), then it divides the result
by the standard deviation (so standardized values have a
standard deviation equal to 1)

If you want to scale a sparse matrix without converting it to a dense matrix first, you can use a StandardScaler with its
with_mean hyperparameter set to False: it will only divide the data by the standard deviation, without subtracting the
mean (as this would
break sparsity).

In heavy tailed distribution, normalization will squash most of the

values into a small range that creates major problem, so we
should try to make the distribution roughly symmetrical.
Common solutions for this are :
1. raise the feature to a power between 0 and 1 (most commonly
square root)
2.If the feature has a very – very long tail such as in power law
distribution then, then replace the feature with it’s logarithm.
3.Another approach is to bucketize the feature, that means chopping the distribution into equal size buckets and replacing
the index of the bucket the data belongs to.

When a feature has multimodal distribution, it is helpful to bucketize it but now treating IDs as categories and not
numbers. Which means that you have to use OneHotEncoding. Another approach would be to add a feature representing
the similarity between the housing age and that particular mode which is typically computed using a radial basis function
(most common one is Gaussian RBF)
We also have to direct our focus on the target variable, for transforming it in the same way as the feature variables, in case
of skewness. We have to convert the labels from Pandas series to dataframe, since standard scaler expects 2D inputs.
There is an inverse_transform() method offered by Scikit for
inversing the transformation to get the final answer.

We can also use a TransformedTargetRegressor :

Now we would see how to create custom transformers:

Rbf kernel does not treat features differently, if you will pass it
an array of 2 features it will calculate Euclidean distance to
calculate similarity.
Custom transformers are also used to combine features –

Article 31 Guidelines
No ratings yet
Article 31 Guidelines
49 pages
Assessment Task 4 Instructions
0% (3)
Assessment Task 4 Instructions
3 pages
PHD Thesis Felix Eichas
No ratings yet
PHD Thesis Felix Eichas
166 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Burned Final
No ratings yet
Burned Final
304 pages
Module 2
No ratings yet
Module 2
20 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Dawit House
No ratings yet
Dawit House
49 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
Module 2
No ratings yet
Module 2
35 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
2 pages
L03 The Regression Pipeline
No ratings yet
L03 The Regression Pipeline
94 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
ML 1-11
No ratings yet
ML 1-11
27 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
ISMLA Module5
No ratings yet
ISMLA Module5
25 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
Regression Analysis - Lasso and Ridge Regularization
No ratings yet
Regression Analysis - Lasso and Ridge Regularization
17 pages
30 Days ML Projects Challenge
No ratings yet
30 Days ML Projects Challenge
288 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Week 1 Get Familier With Jupyter Notebook
No ratings yet
Week 1 Get Familier With Jupyter Notebook
4 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Implementing Artificial Neural Network in Python From Scratch
No ratings yet
Implementing Artificial Neural Network in Python From Scratch
16 pages
Kaggle Course Notes
No ratings yet
Kaggle Course Notes
87 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Machinelearning
No ratings yet
Machinelearning
26 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
22K61A0654 2 Sasi Auto
No ratings yet
22K61A0654 2 Sasi Auto
24 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Ashwin Report
No ratings yet
Ashwin Report
18 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Machine Learning
No ratings yet
Machine Learning
16 pages
ML Lab Manual Completed
No ratings yet
ML Lab Manual Completed
56 pages
End To End Machine Learning Project-2
No ratings yet
End To End Machine Learning Project-2
10 pages
Machine Learning Practice
No ratings yet
Machine Learning Practice
17 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
BA Project - Team17
No ratings yet
BA Project - Team17
13 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
No ratings yet
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
9 pages
Unit 2
No ratings yet
Unit 2
19 pages
ML Practical 205160694034
No ratings yet
ML Practical 205160694034
33 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
ML Complete Notes Hridoy
No ratings yet
ML Complete Notes Hridoy
5 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
Eda Code Snippets
No ratings yet
Eda Code Snippets
17 pages
ML - 03 - Machine Learning Systems
No ratings yet
ML - 03 - Machine Learning Systems
60 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Pytorch (Tabular) - Regression
No ratings yet
Pytorch (Tabular) - Regression
13 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Ruchi Integration of Approaches
No ratings yet
Ruchi Integration of Approaches
19 pages
W. Inglês Paulo
No ratings yet
W. Inglês Paulo
12 pages
4 - HEMATOMUL DIGITAL SPONTAN PAROXISTIC Ro 369
No ratings yet
4 - HEMATOMUL DIGITAL SPONTAN PAROXISTIC Ro 369
4 pages
(Ebook) Academy Stars Level 1 Teacher's Book Pack by Dave Tucker ISBN 9781380006509, 1380006503
100% (1)
(Ebook) Academy Stars Level 1 Teacher's Book Pack by Dave Tucker ISBN 9781380006509, 1380006503
77 pages
Pengaruh Perawatan Perianal Hygiene Dengan Minyak Zaitun Terhadap Pencegahan Ruam Popok Pada Bayi
No ratings yet
Pengaruh Perawatan Perianal Hygiene Dengan Minyak Zaitun Terhadap Pencegahan Ruam Popok Pada Bayi
9 pages
Types of Sentences Lesson Plan
70% (27)
Types of Sentences Lesson Plan
3 pages
Eva Mick - Resume 3:25:16
No ratings yet
Eva Mick - Resume 3:25:16
3 pages
CSWIP-WI-6-92 14th Edition April 2017
No ratings yet
CSWIP-WI-6-92 14th Edition April 2017
17 pages
Service Catalogue For Amadeus Training
No ratings yet
Service Catalogue For Amadeus Training
10 pages
English 10 Quarter 4 Lessons (Week1 - Week 2)
No ratings yet
English 10 Quarter 4 Lessons (Week1 - Week 2)
3 pages
Running Head: Fusing Creativity in Multicultural Teams: Dubai School of Government
No ratings yet
Running Head: Fusing Creativity in Multicultural Teams: Dubai School of Government
44 pages
Faculty Name List
No ratings yet
Faculty Name List
6 pages
Assignment (JELLA MAE YCALINA)
No ratings yet
Assignment (JELLA MAE YCALINA)
2 pages
Chapter 4 Marzano
No ratings yet
Chapter 4 Marzano
2 pages
SP321 Reviewer
No ratings yet
SP321 Reviewer
5 pages
Sense of Belonging Lit Review
No ratings yet
Sense of Belonging Lit Review
16 pages
Examples Career Aspirations Dev Goals
100% (1)
Examples Career Aspirations Dev Goals
2 pages
Pre-Oral Paper
No ratings yet
Pre-Oral Paper
7 pages
Phonics Lesson Plan Baseball - 1
No ratings yet
Phonics Lesson Plan Baseball - 1
2 pages
Module 2-PROPERTIES OF A WELL-WRITTEN
No ratings yet
Module 2-PROPERTIES OF A WELL-WRITTEN
10 pages
S3 Food and Nutrition
No ratings yet
S3 Food and Nutrition
66 pages
ĐỀ 11
No ratings yet
ĐỀ 11
18 pages
Narayana Business School - Digital Brochure - MBA+PGPCE - Final - Mobile Version
No ratings yet
Narayana Business School - Digital Brochure - MBA+PGPCE - Final - Mobile Version
17 pages
Guiding Questions Fo EE Reflective Sessions
No ratings yet
Guiding Questions Fo EE Reflective Sessions
4 pages
Draft 2023 EWF Side Meeting
No ratings yet
Draft 2023 EWF Side Meeting
2 pages
Dulcie September - A Voice of Reason
No ratings yet
Dulcie September - A Voice of Reason
3 pages

ML Book Notes

Uploaded by

ML Book Notes

Uploaded by

Chapter 2 – End to end Machine Learning Project

Here are a few places you can look to get data:

Remember about the ML Checklist.

A sequence of data processing components is called a data pipeline.

Next step includes selecting a performance measure –

The higher the norm index, the more it focuses on the

Rather than manually downloading and decompressing

Steps you need to take now:

The info() method is useful to get a quick description of

Now for categorical attributes : You can find out what

Now plotting the data via :

All the estimator’s hyperparameters are accessible directly via

All transformers also have a convenience method called

Scikit-Learn transformers output NumPy arrays (or sometimes SciPy

To resolve this we use the one hot encoder :

OneHotEncoder is smarter: it will detect the unknown category

If a categorical attribute has a large number of possible

In heavy tailed distribution, normalization will squash most of the

We can also use a TransformedTargetRegressor :

Now we would see how to create custom transformers:

You might also like