0% found this document useful (0 votes)
25 views9 pages

ML Book Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views9 pages

ML Book Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Chapter 2 – End to end Machine Learning Project

Here are a few places you can look to get data:

Remember about the ML Checklist.


So the data that we are going to build our model for
includes metrics such as the population, median income, and median
housing price for each block group in California

A sequence of data processing components is called a data pipeline.


So the problem that we are solving is supervised, typical univariate regression, as there is no continuous flow of data
hence batch learning will do fine. If the data were huge, you could either split your batch learning work across multiple
servers (using the MapReduce technique) or use an online learning technique.

Next step includes selecting a performance measure –


(Refer to page no. 44 & 45 for notation manual)

For example, if there are many outlier districts. In that case, you may
consider using the mean absolute error (MAE, also called the average
absolute deviation), shown in Equation 2-2:

The higher the norm index, the more it focuses on the


large values and neglects small ones.

Rather than manually downloading and decompressing


the data, it’s usually preferable to write a function that
does it for you. This is useful if the data changes
regularly: you can write a small script that uses the
function to fetch the latest data (or you can set up a
scheduled job to do that automatically at regular
intervals). Automating the process of fetching the data is
also useful if you need to install the dataset on multiple machines.

Steps you need to take now:

You start by looking at the top five rows of data using the
DataFrame’s head() method

The info() method is useful to get a quick description of


the data, in particular the total number of rows, each
attribute’s type, and the number of non-null values

Now for categorical attributes : You can find out what


categories exist and how many districts belong to each
category by using the value_counts() method.
The describe() method shows a summary of the numerical attributes.

Another quick way to get a feel of the type of data you are dealing with is to plot a histogram for each numerical attribute.
A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis).
You can either plot this one attribute at a time, or you can call the hist() method on the whole dataset (as shown in the
following code example), and it will plot a histogram for each numerical attribute.

You can find some attributes to be preprocessed in your dataset, it’s NORMAL!
So if capping becomes a problem then you have two options:
—Collect proper labels for the districts whose labels were capped.
—Remove those districts from the training set (and also from the test set, since
your system should not be evaluated poorly if it predicts values beyond $500,000).

Skewed – right distribution : they extend much farther to the right of the median than to the left, you transform that by
computing their logarithm or square root
So usually overfitting leads to what we call as data snooping bias.

So we divide the training data, into test set and training set :

To have a stable train/test split even after updating the dataset, a common solution is:

But in this method you need to make sure that new data gets appended to the end of the dataset and that no row ever gets
deleted.
Same thing can be done by using a function from the Sci-kit Learn library :

Stratified sampling: the population is divided into homogeneous subgroups called strata, and the right number of instances
are sampled from each stratum to guarantee that the test set is representative of the overall population.
So if someone says that the median income is much important in predicting the housing prices. Then you should assure
that the test set is stratified on the this attribute on a priority basis.
But since it is a continuous attribute, you need to create a income category attribute first.
It is important to have enough instances in your dataset for each stratum, or else the
estimate of a stratum’s importance may be biased.

Having multiple splits can be useful if you want to better estimate the performance of your model (cross-validation)

If the training set is too large you may want to sample an exploration set, to make manipulations easy and fast during the
exploration phase.

So now we are moving on to the visualization phase, Since you’re going to experiment with various transformations of the
full training set, you should make a copy of the original so you can revert to it afterwards.

Now plotting the data via :


Since the dataset is not too large, you can easily compute the standard correlation coefficient (also called Pearson’s r)
between every pair of attributes using the corr() method:

Another way to check for correlation between attributes is to use the Pandas scatter_matrix() function, the code for which
is shown in the next page.
Looking at the correlation scatterplots, it seems like the most promising attribute to predict the median house value is the
median income, how?
- the correlation is indeed quite strong
- clearly see the upward trend
- points are not too dispersed
- price cap you noticed earlier is clearly visible at $500,000
- less obvious straight lines: a horizontal line around $450,000, another around $350,000, perhaps one around
$280,000
- You may want to try removing the corresponding districts to prevent your algorithms from learning to reproduce
these data quirks.
There are limitations to finding patterns using correlation coefficients to find patterns in datasets.
One last thing you may want to do before preparing the data for machine learning algorithms is to try out various attribute
combinations and choose some meaningful combinations from them.

Now first, revert to a clean training set (by copying strat_train_set once again) and also You should also separate the
predictors and the labels, since you don’t
necessarily want to apply the same transformations
to the predictors and the target values.

So first we start with the filling of missing values and for that we have three options:
- Get rid of the corresponding district.
- Get rid of the whole attribute.
- Set the missing values to some value (zero, the mean, the median, etc.). This is called imputation.

So of course we would go with option 3 as it is the least destructive one, but instead of the preceding code you will use a
handy Scikit-Learn class: SimpleImputer. The benefit is that it will store the median value of each feature: this will make it
possible to impute missing values not only on the training set, but also on the validation set, the test set, and any new data
fed to the model.
Missing values can also be replaced with the mean value
(strategy="mean"), or with the most frequent value
(strategy="most_frequent"), or with a constant value
(strategy="constant", fill_value=…).
The last two strategies support nonnumerical data.

All the estimator’s hyperparameters are accessible directly via


public instance variables (e.g., imputer.strategy), and all the
estimator’s learned parameters are accessible via public instance
variables with an underscore suffix (e.g., imputer.statistics_).

All transformers also have a convenience method called


fit_transform(),
which is equivalent to calling fit() and then transform() (but
sometimes
fit_transform() is optimized and runs much faster).

Scikit-Learn transformers output NumPy arrays (or sometimes SciPy


sparse matrices) even when they are fed Pandas DataFrames as input.

So now we have to handle the categorical attributes and convert them from
text to numbers and for that we use OrdinalEncoder

One issue with this representation is that ML algorithms will assume that
two nearby values are more similar than two distant values.

To resolve this we use the one hot encoder :


A sparse matrix is a very efficient representation for matrices that contain mostly zeros. Indeed, internally it only stores the
nonzero values and their positions.

OneHotEncoder is smarter: it will detect the unknown category


and raise an exception. If you prefer, you can set the
handle_unknown hyperparameter to "ignore", in which case it
will just represent the unknown category with zeros:

If a categorical attribute has a large number of possible


categories then this may slow down training and degrade
performance, If this happens, you may want to replace the
categorical input with useful numerical features related
to the categories
never use fit() or fit_transform() for anything else than
the training set
Note that while the training set values will always be scaled to the specified range, if new data contains outliers, these
may end up scaled outside the range. If you want to avoid this, just set the clip hyperparameter to True.

Now lets delve into feature scaling for which there are two common ways : min-max scaling and standardization
1. Min-max Scaling (normalization) – scaled between 0
to 1.
This is performed by subtracting the min value and
dividing by the difference between the min and the max.
2. Standardization - first it subtracts the mean value (so
standardized values have a zero mean), then it divides the result
by the standard deviation (so standardized values have a
standard deviation equal to 1)

If you want to scale a sparse matrix without converting it to a dense matrix first, you can use a StandardScaler with its
with_mean hyperparameter set to False: it will only divide the data by the standard deviation, without subtracting the
mean (as this would
break sparsity).

In heavy tailed distribution, normalization will squash most of the


values into a small range that creates major problem, so we
should try to make the distribution roughly symmetrical.
Common solutions for this are :
1. raise the feature to a power between 0 and 1 (most commonly
square root)
2.If the feature has a very – very long tail such as in power law
distribution then, then replace the feature with it’s logarithm.
3.Another approach is to bucketize the feature, that means chopping the distribution into equal size buckets and replacing
the index of the bucket the data belongs to.

When a feature has multimodal distribution, it is helpful to bucketize it but now treating IDs as categories and not
numbers. Which means that you have to use OneHotEncoding. Another approach would be to add a feature representing
the similarity between the housing age and that particular mode which is typically computed using a radial basis function
(most common one is Gaussian RBF)
We also have to direct our focus on the target variable, for transforming it in the same way as the feature variables, in case
of skewness. We have to convert the labels from Pandas series to dataframe, since standard scaler expects 2D inputs.
There is an inverse_transform() method offered by Scikit for
inversing the transformation to get the final answer.

We can also use a TransformedTargetRegressor :

Now we would see how to create custom transformers:

Rbf kernel does not treat features differently, if you will pass it
an array of 2 features it will calculate Euclidean distance to
calculate similarity.
Custom transformers are also used to combine features –

You might also like