ML Book Notes
ML Book Notes
For example, if there are many outlier districts. In that case, you may
consider using the mean absolute error (MAE, also called the average
absolute deviation), shown in Equation 2-2:
You start by looking at the top five rows of data using the
DataFrame’s head() method
Another quick way to get a feel of the type of data you are dealing with is to plot a histogram for each numerical attribute.
A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis).
You can either plot this one attribute at a time, or you can call the hist() method on the whole dataset (as shown in the
following code example), and it will plot a histogram for each numerical attribute.
You can find some attributes to be preprocessed in your dataset, it’s NORMAL!
So if capping becomes a problem then you have two options:
—Collect proper labels for the districts whose labels were capped.
—Remove those districts from the training set (and also from the test set, since
your system should not be evaluated poorly if it predicts values beyond $500,000).
Skewed – right distribution : they extend much farther to the right of the median than to the left, you transform that by
computing their logarithm or square root
So usually overfitting leads to what we call as data snooping bias.
So we divide the training data, into test set and training set :
To have a stable train/test split even after updating the dataset, a common solution is:
But in this method you need to make sure that new data gets appended to the end of the dataset and that no row ever gets
deleted.
Same thing can be done by using a function from the Sci-kit Learn library :
Stratified sampling: the population is divided into homogeneous subgroups called strata, and the right number of instances
are sampled from each stratum to guarantee that the test set is representative of the overall population.
So if someone says that the median income is much important in predicting the housing prices. Then you should assure
that the test set is stratified on the this attribute on a priority basis.
But since it is a continuous attribute, you need to create a income category attribute first.
It is important to have enough instances in your dataset for each stratum, or else the
estimate of a stratum’s importance may be biased.
Having multiple splits can be useful if you want to better estimate the performance of your model (cross-validation)
If the training set is too large you may want to sample an exploration set, to make manipulations easy and fast during the
exploration phase.
So now we are moving on to the visualization phase, Since you’re going to experiment with various transformations of the
full training set, you should make a copy of the original so you can revert to it afterwards.
Another way to check for correlation between attributes is to use the Pandas scatter_matrix() function, the code for which
is shown in the next page.
Looking at the correlation scatterplots, it seems like the most promising attribute to predict the median house value is the
median income, how?
- the correlation is indeed quite strong
- clearly see the upward trend
- points are not too dispersed
- price cap you noticed earlier is clearly visible at $500,000
- less obvious straight lines: a horizontal line around $450,000, another around $350,000, perhaps one around
$280,000
- You may want to try removing the corresponding districts to prevent your algorithms from learning to reproduce
these data quirks.
There are limitations to finding patterns using correlation coefficients to find patterns in datasets.
One last thing you may want to do before preparing the data for machine learning algorithms is to try out various attribute
combinations and choose some meaningful combinations from them.
Now first, revert to a clean training set (by copying strat_train_set once again) and also You should also separate the
predictors and the labels, since you don’t
necessarily want to apply the same transformations
to the predictors and the target values.
So first we start with the filling of missing values and for that we have three options:
- Get rid of the corresponding district.
- Get rid of the whole attribute.
- Set the missing values to some value (zero, the mean, the median, etc.). This is called imputation.
So of course we would go with option 3 as it is the least destructive one, but instead of the preceding code you will use a
handy Scikit-Learn class: SimpleImputer. The benefit is that it will store the median value of each feature: this will make it
possible to impute missing values not only on the training set, but also on the validation set, the test set, and any new data
fed to the model.
Missing values can also be replaced with the mean value
(strategy="mean"), or with the most frequent value
(strategy="most_frequent"), or with a constant value
(strategy="constant", fill_value=…).
The last two strategies support nonnumerical data.
So now we have to handle the categorical attributes and convert them from
text to numbers and for that we use OrdinalEncoder
One issue with this representation is that ML algorithms will assume that
two nearby values are more similar than two distant values.
Now lets delve into feature scaling for which there are two common ways : min-max scaling and standardization
1. Min-max Scaling (normalization) – scaled between 0
to 1.
This is performed by subtracting the min value and
dividing by the difference between the min and the max.
2. Standardization - first it subtracts the mean value (so
standardized values have a zero mean), then it divides the result
by the standard deviation (so standardized values have a
standard deviation equal to 1)
If you want to scale a sparse matrix without converting it to a dense matrix first, you can use a StandardScaler with its
with_mean hyperparameter set to False: it will only divide the data by the standard deviation, without subtracting the
mean (as this would
break sparsity).
When a feature has multimodal distribution, it is helpful to bucketize it but now treating IDs as categories and not
numbers. Which means that you have to use OneHotEncoding. Another approach would be to add a feature representing
the similarity between the housing age and that particular mode which is typically computed using a radial basis function
(most common one is Gaussian RBF)
We also have to direct our focus on the target variable, for transforming it in the same way as the feature variables, in case
of skewness. We have to convert the labels from Pandas series to dataframe, since standard scaler expects 2D inputs.
There is an inverse_transform() method offered by Scikit for
inversing the transformation to get the final answer.
Rbf kernel does not treat features differently, if you will pass it
an array of 2 features it will calculate Euclidean distance to
calculate similarity.
Custom transformers are also used to combine features –