0% found this document useful (0 votes)
18 views29 pages

CH 2

The document outlines a comprehensive checklist for executing an end-to-end machine learning project, emphasizing the importance of framing the problem, gathering and exploring data, and preparing it for modeling. It introduces the California Housing Prices dataset and discusses the significance of understanding the business objective, performance measures, and the necessity of avoiding data leakage. Additionally, it highlights the iterative nature of data science, the importance of feature engineering, and the use of data pipelines for efficient processing.

Uploaded by

Mamtha Jayakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views29 pages

CH 2

The document outlines a comprehensive checklist for executing an end-to-end machine learning project, emphasizing the importance of framing the problem, gathering and exploring data, and preparing it for modeling. It introduces the California Housing Prices dataset and discusses the significance of understanding the business objective, performance measures, and the necessity of avoiding data leakage. Additionally, it highlights the iterative nature of data science, the importance of feature engineering, and the use of data pipelines for efficient processing.

Uploaded by

Mamtha Jayakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Ch2: End-to-End Machine Learning Project

Machine Learning project checklist (from Appendix B):


1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to ML algorithms.
5. Explore many different models and shortlist the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.

You'll often see slight variations of this checklist. Modify to suit your own situation and needs!

An alternate ML Project checklist (from Andrew Ng's Coursera stuff):

Introduction to our provided data set: California Housing Prices dataset (StatLib repository)
● Based on data from the 1990 California census (hence currently unrealistic).
● Author removed some features and added a categorical feature (more instructive).
● Point of possible confusion: Examples (rows) are not individual houses!!!
Block groups (a.k.a "districts"): smallest geographical unit for which the US Census
Bureau publishes sample data (a block group typically has a population of 600 to 3,000).
1.Frame the Problem

Gather Information:
Talk to people. Hunt them down! This includes not only data stewards, developers responsible
for the systems that created/compiled the data, and project leads or C-suite, but also anyone
that is responsible for downstream components that could be affected.

The more you nail down at the start, the easier it will be to manage risk and expectations.

What is the business objective?


“Knowing the objective is important because it will determine how you frame the problem, which
algorithms you will select, which performance measure you will use to evaluate your model, and
how much effort you will spend tweaking it.”

Proposed Objective: Prediction of a district’s median housing price.

(almost) Universal Objective of Data Science: Turn data into money!

Ancillary provided details:


● Our predictions will be fed to another ML system along with many other signals.
● Median district house price is currently estimated manually by experts via complex rules.
○ Costly and time-consuming.
○ Estimates were off by more than 20%.
● Owners of the downstream system confirmed that they want a numeric value, not a
coarse-grained approximation via category (cheap/medium/expensive).

Determinations:
● We have labels -> supervised learning task.
● We're trying to predict a continuous target -> regression problem (multiple regression).
● We're only predicting one target from each example -> univariate regression.

Recall (from the footnotes of Chapter1): Why is regression called regression?


"Fun fact: this odd-sounding name is a statistics term introduced by Francis Galton while he
was studying the fact that the children of tall people tend to be shorter than their parents.
Since the children were shorter, he called this regression to the mean. This name was then
applied to the methods he used to analyze correlations between variables."
Performance Measure:
Basically, how are we going to evaluate and compare models? How can we tell when we've
satisfied the objective of the project?
Root Mean Square Error (RMSE) a.k.a "l2 norm".

m is the number of instances


x(i) is a vector of all the feature values
y(i) is its label
h(x(i)) is your system’s prediction function, a.k.a "hypothesis function"
ŷ(i) = h(x(i)) is a predicted value for the target/label for that instance (ŷ is pronounced “y-hat”).
Some notes regarding RMSE, if you're curious:
● The squaring--why?
○ To penalize large errors.
○ One of the main reasons is that it is very easy to differentiate (important for
derivative-based methods such as gradient descent).
● The square root--why?
○ Brings us back to the natural, interpretable units of the problem. Without the
square root, we'd end up talking about $2, whatever the heck that is?
● Why RMSE and not MAE?
○ The higher the norm index, the more it focuses on large values and neglects
small ones. The choice of index 2 is slightly arbitrary.
○ MAE is preferable when you know you've got plenty of outliers and you know that
residuals are not going to end up having a Normal/Guassian distribution.

Author's suggestion: Check assumptions!


Aside: Pipelines
A sequence of data processing components is called a data pipeline.

“Components typically run asynchronously... each component pulls in a large amount of data,
processes it, and spits out the result in another data store.”

Is this true?

Synchronous - i.e. sequential, blocking; one task executed at a time; coordinated, or aligned
with a clock or timer; executing task must return before proceeding with next task

Asynchronous - non-blocking; "fire and forget" i.e. call functions and continue doing other stuff,
knowing that those functions will eventually return results on their own time

Python's asyncio (standard library package) provides typical async/await.

The key difference between synchronous and asynchronous processing is in what the processor
does while it waits for an I/O task to complete.

In synchronous execution, the processor remains idle and waits for the I/O task to complete
before executing the next set of instructions.

Asynchronous execution is not necessarily parallel execution; think about making breakfast.

Good example: Preparing breakfast (source?).


Pour a cup of coffee.
Heat up a pan, then fry two eggs.
Fry three slices of bacon.
Toast two pieces of bread.
Add butter and jam to the toast.
Pour a glass of orange juice.

2.Get the Data

Setup:
All notebooks, data, extra goodies available at author's github repo:
https://github.com/ageron/handson-ml2

If you just want to run the code/notebooks, tinker around, and not deal with installing a bunch of
stuff, you can just use Google Colab:
If you want to run things on your own machine:
Preferred: Anaconda--the easiest way to get up and running.
Less Preferred: What the author does in the book i.e. venv/virtualenv/whatever (unless you
have a particular reason to use these).
For Cool Kids: Docker!

Aside: The importance of virtual environments--it's all about dependencies.


Official Python Documentation: "A virtual environment is a Python environment such that the
Python interpreter, libraries and scripts installed into it are isolated from those installed in other
virtual environments, and (by default) any libraries installed in a “system” Python, i.e., one which
is installed as part of your operating system."

Why bother?
● Easier to work on different projects while avoiding package version conflicts--different
envs for different types of projects.
● By keeping project dependencies static, or at least isolated, predictable, and explicit, you
ensure that if you revisit your project at a later time, it'll actually run. The python DS/ML
ecosystem evolves at a rapid pace; functions get deprecated, and package APIs change
all the time.
● For ease of sharing and collaboration. If someone wants to run your code, rather than
having to guess which versions of the dependencies you used, they can just recreate
their own version of your environment (from a file, which you ought to have provided).
○ Ex: $ conda env export > environment.yaml
$ conda env create -f <path_to_yaml_file>
● Avoiding headaches.
○ If you bork one env, at least your others are fine.
○ conda 'solving environment'... wait 487593453 hours, or just bail?

DS/ML folks frequently use 'conda' for package/environment management. It comes with
Anaconda/Miniconda distributions, and it works pretty dang well (until it doesn't).
General python developers tend to use 'venv' (part of the python standard library), a 'venv'
extension that tries to fix particular limitations/annoyances ('virtualenv', 'virtualenvwrapper'), or
'venv' analog ('pyenv', 'pipenv').

Suggestion: As you build up your various environments over time, try to stick with 'conda' as
long as you can. When the time comes that you can't, just switch to using 'pip'.

Basic example of first time using:


$ conda create -n <env_name> jupyter matplotlib numpy pandas scipy scikit-learn
$ conda activate <env_name>
$ jupyter notebook
# If your default browser hasn't popped up, just manually go to http://localhost:8888/

Helpful:
$ conda init # if you didn't already agree to have the Anaconda installer do this for you
$ conda info # spits out a bunch of version
$ conda env list # shows you available envs and which one is currently activate (*)
$ conda list --explicit # lists packages installed in your currently activated env

Take a Quick Look at the Data Structure:


The core pandas object is the DataFrame; think of it like an excel sheet on steroids or a table in
a SQL database. Each column in the DataFrame is a Series object, which is a one-dimensional
ndarray with axis labels; think of it as a snazzy array/list.

Most useful methods for DataFrames/Series:


head(), or alternatively sample()
info()

Note: With Pandas, 'object' dtype usually means 'text/string'.

value_counts()

describe() - Very good for sanity check; look min/max/mean

Histograms - Quick & easy way to see an approximation of the distribution of numerical data.
What to look for:
● Lines/cutoffs.
○ Why? Could indicate problems with data collection, corruption, preprocessing like
clipping/winsorization...
● Basic/known distribution types.
Why? We have more math and tools available for known distributions. Also,
some model types base their theoretical foundations on assumptions about their input
distributions (OLS/linear regression). Homoscedasticity?!!??!?!?!?!?!?!?!
● Very pronounced skew.
○ Why? Again, considerations for model assumptions. But also, skew will be one of
the factors involved with decisions about feature scaling and preprocessing.

Takeaways:
● Median income attribute does not look like it is expressed in US dollars.
○ "...the data has been scaled and capped at 15 (actually, 15.0001) for higher
median incomes, and at 0.5 (actually, 0.4999) for lower median incomes. The
numbers represent roughly tens of thousands of dollars (e.g., 3 actually means
about $30,000)."
○ It's an example of a preprocessed feature.
● Housing median age, median house value were also capped. Potentially problematic.
● Attributes have very different scales.
● There are some tail-heavy distributions.

Recall left skew vs right skew: Where is the mean in relation to the median?
Warning: Histograms can be deceiving!
https://towardsdatascience.com/6-reasons-why-you-should-stop-using-histograms-and-which-pl
ot-you-should-use-instead-31f937a0a81c
Create a Test Set:
"your brain is an amazing pattern detection system, which means that it is highly prone to
overfitting: if you look at the test set, you may stumble upon some seemingly interesting pattern
in the test data that leads you to select a particular kind of Machine Learning model"

It's vitally important to avoid "data snooping bias", "data leakage"--anything that would give us
false confidence in our model.

Many different ways to sample and split a dataset into train and test sets...
Some cool visuals:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html

Most mature data science packages are going to have some mechanism for controlling the
randomness used through their code/algorithms, typically through setting a seed for the
pseudo-random number generator that's used under the hood.
Scikit-Learn's train_test_split() function:

Discussion: Anybody actually using hashing, low-level techniques for sampling/splitting,


anything other than tried-and-tested utility functions?

Stratified Sampling: The population is divided into homogeneous(def) subgroups called strata.
You should not have too many strata, and each stratum should be large enough.
To stratify, we need groups, hence using Panda's pd.cut() method to do "binning".
'median_income' -> bin into temporary column 'income_cat' -> use 'income_cat' for the purpose
of stratification -> drop 'income_cat'

Scikit-Learn's StratifiedShuffleSplit() object:

"We spent quite a bit of time on test set generation for a good reason: this is an often neglected
but critical part of a Machine Learning project."

We tend to miss out on this when not working on a real problem; kaggle handles test sets for us,
but in an actual business setting it would be our responsibility to ensure the test set is
representative of the data we want to make predictions on!

Suggestion: Data Dictionary. Probably a good idea, if it hasn't already been done, to accumulate
and document all the basic info you have about the data set.

3.Discover and Visualize the Data to Gain Insights


This part of the process is usually called Exploratory Data Analysis (EDA).

Author's tip: "If the training set is very large, you may want to sample an exploration set, to make
manipulations easy and fast."
Ex: df_for_plotting = df.sample(n=df.shape[0]//10)
Also to mitigate visual clutter!
This is particularly true when playing with visualizations using unsupervised methods like t-SNE,
UMAP, etc.

Author's tip: "Our brains are very good at spotting patterns in pictures, but you may need to play
around with visualization parameters to make the patterns stand out."
-> adjust 'alpha', use various sizes/colors/shapes/'hue'

Location matters! Proximity to the ocean is important, as well as proximity to urban/city centers.

"A clustering algorithm should be useful for detecting the main cluster and for adding new
features that measure the proximity to the cluster centers."

Extra plotting tip: Get them plots bigger!


1) Set a better default plot size, up in the 'imports' section of your notebook:

2) Explicitly create a figure object, then plot:


3) Use 'figsize' arg that's available in many plotting functions (see above CA scatterplot).

Other miscellaneous plotting tips:


● Assign the plot function to a variable, or use ';' to mute plot function object/handle output.
● Try not to be lazy... label axis, enable legends, customize tick labels/format/spacing, etc.
● Anyone else have some tips???

Correlations:
Standard correlation coefficient (also called Pearson’s r) with Pandas corr() method.
Correlation coefficient [–1, 1]
~0 means no linear correlation

Warning: Common correlation coefficients (like Pearson's r) have limitations:


● Correlation coefficient only measures linear correlations.
● May completely miss out on nonlinear relationships.
● Strength of correlation is not related to slope.
Hence the need for scatter plots.

Takeaways:
● Correlation is noticeable/strong; imagine drawing an upward line along the density.
● Price cap at $500, evident from the horizontal line at.
● Other artifacts are evident from lines around 460k, 350k, 280k, 220k, and so on.

Author's suggestion: "You may want to try removing the corresponding districts to prevent your
algorithms from learning to reproduce these data quirks."

Remember, we have control over the train set. This may sound sketchy, but whatever we can do
(with appropriate justification) to make the data more representative and lead to better
generalization in the end is fair game.

Intuitive Feature Engineering:


"try out various attribute combinations. For example, the total number of rooms in a district is not
very useful if you don’t know how many households there are. What you really want is the
number of rooms per household. Similarly, the total number of bedrooms by itself is not very
useful: you probably want to compare it to the number of rooms. And the population per
household also seems like an interesting attribute combination to look at."

With ML problems, this sort of feature engineering can matter much more than model choice.
Takeaways:
● Houses with a lower bedroom/room ratio tend to be more expensive. Think about this for
a moment... larger houses will tend to have allotted a reasonable number of bedrooms,
then the remaining 'room budget' will go to leisure rooms.
● Number of rooms per household is more informative than the total number of rooms
(larger houses tend to be more expensive).

Remember: Data science is an iterative process. Once you get a prototype up and running, you
can analyze the output to gain more insights. Some packages/models may provide you with
statistics like p-values (linear regression implementations like OLS in the statsmodels package),
or some type of feature importance.

4.Prepare the Data for Machine Learning Algorithms


Author's suggestion: Write functions.
● Allows you to reproduce these transformations easily on any dataset.
● Accumulate your own snippets and boilerplate code.
● You can use these functions when deploying your model.
● Easy experimentation.

Remember: Separate the predictors and the labels--don't necessarily want to apply the same
transformations to the predictors and the target values.

Data Cleaning
Most Machine Learning algorithms cannot work with missing features.
Problem: 'total_bedrooms' attribute has some missing values. There are a few options...
● Get rid of the corresponding districts (drop rows).
● Get rid of the whole attribute (drop column).
● Set the values to some value (fill or impute).

Note: Usually a good idea to avoid inplace=True.

Scikit-Learn's SimpleImputer object:

"Only the total_bedrooms attribute had missing values, but we cannot be sure that there won’t
be any missing values in new data after the system goes live, so it is safer to apply the imputer
to all the numerical attributes"

Aside: Scikit-Learn Design


A key aspect you'll notice while working with sklearn--Consistency!
● Estimators - Any object that can estimate some parameters based on a dataset
○ will always have a fit() method
● Transformers - Estimators (such as an imputer) that can also transform a dataset.
○ will always have transform(), fit_transform() methods
● Predictors - Estimators that are capable of making predictions on dataset.
○ will always have predict(), score() methods

Handling Text and Categorical Attributes


Only one categorical variable: 'ocean_proximity'.
Check value_counts(), nunique().

Need to encode as a numeric type. Why? "ML algorithms prefer to work with numbers."

Scikit-Learn's OrdinalEncoder class:


Problem: "One issue with this representation is that ML algorithms will assume that two
nearby values are more similar than two distant values."

Scikit-Learn's OneHotEncoder class:

Custom Transformers
Not everything is built-in.

Note: 'self' in python is the same thing as 'this' in C++/Java.

"Scikit-Learn relies on duck typing (not inheritance), all you need to do is create a class and
implement three methods: fit() (returning self), transform(), and fit_transform()."

Discussion: Is this really true? The custom class below inherits from BaseEstimator and
TransformerMixin...
Feature Scaling
"With few exceptions, Machine Learning algorithms don’t perform well when the input numerical
attributes have very different scales."

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-
examples-preprocessing-plot-all-scaling-py

Scaling options:
● MinMaxScaler
● StandardScaler
● RobustScaler
● QuantileTransformer
● PowerTransformer

As with all the transformations, fit to training data only.


Transformation Pipelines
Basically objections that you can compose sequentially.

"Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps. All but
the last estimator must be transformers (i.e., they must have a fit_transform() method)."

Simple example:

Better example example via ColumnTransformer:

ColumnTransformer - Applies each transformer to the appropriate columns then concatenates.

FeatureUnion - Applies each transform to the entire data set then concatenates.

5.Select and Train a Model


Suggestion: Start simple, gradually progress to more complex or computationally expensive
models.

LinearRegression()
Works, but it's pretty bad.
"most districts’median_housing_values range between $120,000 and $265,000, so a typical
prediction error of $68,628 is not very satisfying. "

What's happening here? Underfitting.


● features do not provide enough information to make good predictions
● that the model is not powerful enough.

DecisionTreeRegressor()

Weird! Overfitting. What does this mean? The model has effectively memorized all the training
examples. If you were to feed it data it hasn't seen before, it would most likely crap the bed.

Recall: Bias-Variance Tradeoff


Better Evaluation Using Cross-Validation
Rather than train_test_split(), we could use Scikit-Learn’s K-fold cross-validation class.

Note: Number of folds is somewhat arbitrary; the appropriate value depends on the data.

Scikit-Learn's cross_val_score convenience class:


RandomForestRegressor() - Prototypical example of a "bagging" ensemble model.

Author's tip: Saving models via pickle.


● Warning: pickling and similar data serialization routines can introduce vulnerabilities in
your code!
● joblib is supposed to be the replacement for pickle; more efficient for ndarrays.
● Another, less flexible, but more space-efficient: keep the validation setup fixed, and for
each model store the parameters (in case you need to re-train it), and oof_predictions, in
case you want to use that output down the line for further ensembling (blending/stacking,
weighted averages of models, etc).

6.Fine-Tune Your Model

Grid SearcH
Evaluate all the possible combinations of hyperparameter values.
Key attributes:
● .best_params_
● .best_estimator_
● .cv_results_

Author's tip: "When you have no idea what value a hyperparameter should have, a simple
approach is to try out consecutive powers of 10." a.k.a. logspace.

Note: Once you become familiar with particular machine learning algorithms, you'll better be
able to tell what parameter values are odd... good to sanity check hyperparameter optimization.

Tip: Setting n_jobs=-1 is handy for many sklearn objects.


● -1 means using all processors.
● Or, n_jobs=n_cpus - 1 , in order to avoid that the machine gets stuck.

Randomized Search
Preferable when hyperparameter search space is very large.

Main benefits:
● May yield a good configuration of hyperparameters in less time/compute than a really
extensive gridsearch.
● Some marginal control over computing budget via n_iter.

Other Hip/Trendy Options (via external packages):


● hyperopt
● optuna
Ensemble Methods
"Another way to fine-tune your system is to try to combine the models that
perform best... especially if the individual models make very different types of errors."

Common types of ensembling:


● Bagging or weighted averages
● Boosting
● Stacking/Blending

Analyze the Best Models and Their Errors


grid_search.best_estimator_.feature_importances_

Illustrates the importance of diagnostic models!

Further options for improvement, which basically amount to removing noise:


● Drop bad features
● Remove outliers

Try to find patterns in the errors. If you can spot a pattern to exploit, chances are you can finagle
with the features and hyperparameters so that the model can exploit the pattern too.

"You should also look at the specific errors that your system makes, then try to understand why
it makes them and what could fix the problem (adding extra features or getting rid of
uninformative ones, cleaning up outliers, etc.)."
Evaluate Your System on the Test Set
"Run your full_pipeline to transform the data (call transform(), not fit_transform()—you do not
want to fit the test set!), and evaluate the final model on the test set"

8.Launch, Monitor, and Maintain Your System


Probably better saved for the dedicated chapters with fleshed-out examples.

Hojillions of options of varying complexity; on-prem vs cloud, managed services vs not so


managed.

The Reality (Andrew Ng's Coursera stuff):

Monica Rogati (https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007)


"Data-Centric AI" sentiment periodically resurfaces--or maybe it never really went anywhere?
(Andrew Ng's Coursera stuff, yet again):
(What happened to... ) 7.Present Your Solution
● What you have learned.
● What worked, what did not.
● What assumptions were made.
● What your system’s limitations are.
● Document everything!!!
● Create nice presentations with clear visualizations and easy- to-remember statements
(e.g., “the median income is the number one predictor of housing prices”).

Conclusion: "In this California housing example, the final performance of the system is not better
than the experts’ price estimates, which were often off by about 20%, but it may still be a good
idea to launch it, especially if this frees up some time for the experts so they can work on
more interesting and productive tasks."

Things that don't work:


● Emailing your tech-illiterate CEO a ghastly, unformatted MS Excel spreadsheet with lots
of numbers.
● Skipping the presentation all together, appealing to your authority in all things data: "Just
trust me, bro".
● Getting distracted by other fires, subsequently neglecting documentation. Ideally your
notebooks and code should be at least somewhat self-documenting.

You might also like