0% found this document useful (0 votes)

18 views29 pages

CH 2

The document outlines a comprehensive checklist for executing an end-to-end machine learning project, emphasizing the importance of framing the problem, gathering and exploring data, and preparing it for modeling. It introduces the California Housing Prices dataset and discusses the significance of understanding the business objective, performance measures, and the necessity of avoiding data leakage. Additionally, it highlights the iterative nature of data science, the importance of feature engineering, and the use of data pipelines for efficient processing.

Uploaded by

Mamtha Jayakumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views29 pages

CH 2

Uploaded by

Mamtha Jayakumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Ch2: End-to-End Machine Learning Project

Machine Learning project checklist (from Appendix B):

1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to ML algorithms.
5. Explore many different models and shortlist the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.

You'll often see slight variations of this checklist. Modify to suit your own situation and needs!

An alternate ML Project checklist (from Andrew Ng's Coursera stuff):

Introduction to our provided data set: California Housing Prices dataset (StatLib repository)
● Based on data from the 1990 California census (hence currently unrealistic).
● Author removed some features and added a categorical feature (more instructive).
● Point of possible confusion: Examples (rows) are not individual houses!!!
Block groups (a.k.a "districts"): smallest geographical unit for which the US Census
Bureau publishes sample data (a block group typically has a population of 600 to 3,000).
1.Frame the Problem

Gather Information:
Talk to people. Hunt them down! This includes not only data stewards, developers responsible
for the systems that created/compiled the data, and project leads or C-suite, but also anyone
that is responsible for downstream components that could be affected.

The more you nail down at the start, the easier it will be to manage risk and expectations.

What is the business objective?

“Knowing the objective is important because it will determine how you frame the problem, which
algorithms you will select, which performance measure you will use to evaluate your model, and
how much effort you will spend tweaking it.”

Proposed Objective: Prediction of a district’s median housing price.

(almost) Universal Objective of Data Science: Turn data into money!

Ancillary provided details:

● Our predictions will be fed to another ML system along with many other signals.
● Median district house price is currently estimated manually by experts via complex rules.
○ Costly and time-consuming.
○ Estimates were off by more than 20%.
● Owners of the downstream system confirmed that they want a numeric value, not a
coarse-grained approximation via category (cheap/medium/expensive).

Determinations:
● We have labels -> supervised learning task.
● We're trying to predict a continuous target -> regression problem (multiple regression).
● We're only predicting one target from each example -> univariate regression.

Recall (from the footnotes of Chapter1): Why is regression called regression?

"Fun fact: this odd-sounding name is a statistics term introduced by Francis Galton while he
was studying the fact that the children of tall people tend to be shorter than their parents.
Since the children were shorter, he called this regression to the mean. This name was then
applied to the methods he used to analyze correlations between variables."
Performance Measure:
Basically, how are we going to evaluate and compare models? How can we tell when we've
satisfied the objective of the project?
Root Mean Square Error (RMSE) a.k.a "l2 norm".

m is the number of instances

x(i) is a vector of all the feature values
y(i) is its label
h(x(i)) is your system’s prediction function, a.k.a "hypothesis function"
ŷ(i) = h(x(i)) is a predicted value for the target/label for that instance (ŷ is pronounced “y-hat”).
Some notes regarding RMSE, if you're curious:
● The squaring--why?
○ To penalize large errors.
○ One of the main reasons is that it is very easy to differentiate (important for
derivative-based methods such as gradient descent).
● The square root--why?
○ Brings us back to the natural, interpretable units of the problem. Without the
square root, we'd end up talking about $2, whatever the heck that is?
● Why RMSE and not MAE?
○ The higher the norm index, the more it focuses on large values and neglects
small ones. The choice of index 2 is slightly arbitrary.
○ MAE is preferable when you know you've got plenty of outliers and you know that
residuals are not going to end up having a Normal/Guassian distribution.

Author's suggestion: Check assumptions!

Aside: Pipelines
A sequence of data processing components is called a data pipeline.

“Components typically run asynchronously... each component pulls in a large amount of data,
processes it, and spits out the result in another data store.”

Is this true?

Synchronous - i.e. sequential, blocking; one task executed at a time; coordinated, or aligned
with a clock or timer; executing task must return before proceeding with next task

Asynchronous - non-blocking; "fire and forget" i.e. call functions and continue doing other stuff,
knowing that those functions will eventually return results on their own time

Python's asyncio (standard library package) provides typical async/await.

The key difference between synchronous and asynchronous processing is in what the processor
does while it waits for an I/O task to complete.

In synchronous execution, the processor remains idle and waits for the I/O task to complete
before executing the next set of instructions.

Asynchronous execution is not necessarily parallel execution; think about making breakfast.

Good example: Preparing breakfast (source?).

Pour a cup of coffee.
Heat up a pan, then fry two eggs.
Fry three slices of bacon.
Toast two pieces of bread.
Add butter and jam to the toast.
Pour a glass of orange juice.

2.Get the Data

Setup:
All notebooks, data, extra goodies available at author's github repo:
https://github.com/ageron/handson-ml2

If you just want to run the code/notebooks, tinker around, and not deal with installing a bunch of
stuff, you can just use Google Colab:
If you want to run things on your own machine:
Preferred: Anaconda--the easiest way to get up and running.
Less Preferred: What the author does in the book i.e. venv/virtualenv/whatever (unless you
have a particular reason to use these).
For Cool Kids: Docker!

Aside: The importance of virtual environments--it's all about dependencies.

Official Python Documentation: "A virtual environment is a Python environment such that the
Python interpreter, libraries and scripts installed into it are isolated from those installed in other
virtual environments, and (by default) any libraries installed in a “system” Python, i.e., one which
is installed as part of your operating system."

Why bother?
● Easier to work on different projects while avoiding package version conflicts--different
envs for different types of projects.
● By keeping project dependencies static, or at least isolated, predictable, and explicit, you
ensure that if you revisit your project at a later time, it'll actually run. The python DS/ML
ecosystem evolves at a rapid pace; functions get deprecated, and package APIs change
all the time.
● For ease of sharing and collaboration. If someone wants to run your code, rather than
having to guess which versions of the dependencies you used, they can just recreate
their own version of your environment (from a file, which you ought to have provided).
○ Ex: $ conda env export > environment.yaml
$ conda env create -f <path_to_yaml_file>
● Avoiding headaches.
○ If you bork one env, at least your others are fine.
○ conda 'solving environment'... wait 487593453 hours, or just bail?

DS/ML folks frequently use 'conda' for package/environment management. It comes with
Anaconda/Miniconda distributions, and it works pretty dang well (until it doesn't).
General python developers tend to use 'venv' (part of the python standard library), a 'venv'
extension that tries to fix particular limitations/annoyances ('virtualenv', 'virtualenvwrapper'), or
'venv' analog ('pyenv', 'pipenv').

Suggestion: As you build up your various environments over time, try to stick with 'conda' as
long as you can. When the time comes that you can't, just switch to using 'pip'.

Basic example of first time using:

$ conda create -n <env_name> jupyter matplotlib numpy pandas scipy scikit-learn
$ conda activate <env_name>
$ jupyter notebook
# If your default browser hasn't popped up, just manually go to http://localhost:8888/

Helpful:
$ conda init # if you didn't already agree to have the Anaconda installer do this for you
$ conda info # spits out a bunch of version
$ conda env list # shows you available envs and which one is currently activate (*)
$ conda list --explicit # lists packages installed in your currently activated env

Take a Quick Look at the Data Structure:

The core pandas object is the DataFrame; think of it like an excel sheet on steroids or a table in
a SQL database. Each column in the DataFrame is a Series object, which is a one-dimensional
ndarray with axis labels; think of it as a snazzy array/list.

Most useful methods for DataFrames/Series:

head(), or alternatively sample()
info()

Note: With Pandas, 'object' dtype usually means 'text/string'.

value_counts()

describe() - Very good for sanity check; look min/max/mean

Histograms - Quick & easy way to see an approximation of the distribution of numerical data.
What to look for:
● Lines/cutoffs.
○ Why? Could indicate problems with data collection, corruption, preprocessing like
clipping/winsorization...
● Basic/known distribution types.
Why? We have more math and tools available for known distributions. Also,
some model types base their theoretical foundations on assumptions about their input
distributions (OLS/linear regression). Homoscedasticity?!!??!?!?!?!?!?!?!
● Very pronounced skew.
○ Why? Again, considerations for model assumptions. But also, skew will be one of
the factors involved with decisions about feature scaling and preprocessing.

Takeaways:
● Median income attribute does not look like it is expressed in US dollars.
○ "...the data has been scaled and capped at 15 (actually, 15.0001) for higher
median incomes, and at 0.5 (actually, 0.4999) for lower median incomes. The
numbers represent roughly tens of thousands of dollars (e.g., 3 actually means
about $30,000)."
○ It's an example of a preprocessed feature.
● Housing median age, median house value were also capped. Potentially problematic.
● Attributes have very different scales.
● There are some tail-heavy distributions.

Recall left skew vs right skew: Where is the mean in relation to the median?
Warning: Histograms can be deceiving!
https://towardsdatascience.com/6-reasons-why-you-should-stop-using-histograms-and-which-pl
ot-you-should-use-instead-31f937a0a81c
Create a Test Set:
"your brain is an amazing pattern detection system, which means that it is highly prone to
overfitting: if you look at the test set, you may stumble upon some seemingly interesting pattern
in the test data that leads you to select a particular kind of Machine Learning model"

It's vitally important to avoid "data snooping bias", "data leakage"--anything that would give us
false confidence in our model.

Many different ways to sample and split a dataset into train and test sets...
Some cool visuals:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html

Most mature data science packages are going to have some mechanism for controlling the
randomness used through their code/algorithms, typically through setting a seed for the
pseudo-random number generator that's used under the hood.
Scikit-Learn's train_test_split() function:

Discussion: Anybody actually using hashing, low-level techniques for sampling/splitting,

anything other than tried-and-tested utility functions?

Stratified Sampling: The population is divided into homogeneous(def) subgroups called strata.
You should not have too many strata, and each stratum should be large enough.
To stratify, we need groups, hence using Panda's pd.cut() method to do "binning".
'median_income' -> bin into temporary column 'income_cat' -> use 'income_cat' for the purpose
of stratification -> drop 'income_cat'

Scikit-Learn's StratifiedShuffleSplit() object:

"We spent quite a bit of time on test set generation for a good reason: this is an often neglected
but critical part of a Machine Learning project."

We tend to miss out on this when not working on a real problem; kaggle handles test sets for us,
but in an actual business setting it would be our responsibility to ensure the test set is
representative of the data we want to make predictions on!

Suggestion: Data Dictionary. Probably a good idea, if it hasn't already been done, to accumulate
and document all the basic info you have about the data set.

3.Discover and Visualize the Data to Gain Insights

This part of the process is usually called Exploratory Data Analysis (EDA).

Author's tip: "If the training set is very large, you may want to sample an exploration set, to make
manipulations easy and fast."
Ex: df_for_plotting = df.sample(n=df.shape[0]//10)
Also to mitigate visual clutter!
This is particularly true when playing with visualizations using unsupervised methods like t-SNE,
UMAP, etc.

Author's tip: "Our brains are very good at spotting patterns in pictures, but you may need to play
around with visualization parameters to make the patterns stand out."
-> adjust 'alpha', use various sizes/colors/shapes/'hue'

Location matters! Proximity to the ocean is important, as well as proximity to urban/city centers.

"A clustering algorithm should be useful for detecting the main cluster and for adding new
features that measure the proximity to the cluster centers."

Extra plotting tip: Get them plots bigger!

1) Set a better default plot size, up in the 'imports' section of your notebook:

2) Explicitly create a figure object, then plot:

3) Use 'figsize' arg that's available in many plotting functions (see above CA scatterplot).

Other miscellaneous plotting tips:

● Assign the plot function to a variable, or use ';' to mute plot function object/handle output.
● Try not to be lazy... label axis, enable legends, customize tick labels/format/spacing, etc.
● Anyone else have some tips???

Correlations:
Standard correlation coefficient (also called Pearson’s r) with Pandas corr() method.
Correlation coefficient [–1, 1]
~0 means no linear correlation

Warning: Common correlation coefficients (like Pearson's r) have limitations:

● Correlation coefficient only measures linear correlations.
● May completely miss out on nonlinear relationships.
● Strength of correlation is not related to slope.
Hence the need for scatter plots.

Takeaways:
● Correlation is noticeable/strong; imagine drawing an upward line along the density.
● Price cap at $500, evident from the horizontal line at.
● Other artifacts are evident from lines around 460k, 350k, 280k, 220k, and so on.

Author's suggestion: "You may want to try removing the corresponding districts to prevent your
algorithms from learning to reproduce these data quirks."

Remember, we have control over the train set. This may sound sketchy, but whatever we can do
(with appropriate justification) to make the data more representative and lead to better
generalization in the end is fair game.

Intuitive Feature Engineering:

"try out various attribute combinations. For example, the total number of rooms in a district is not
very useful if you don’t know how many households there are. What you really want is the
number of rooms per household. Similarly, the total number of bedrooms by itself is not very
useful: you probably want to compare it to the number of rooms. And the population per
household also seems like an interesting attribute combination to look at."

With ML problems, this sort of feature engineering can matter much more than model choice.
Takeaways:
● Houses with a lower bedroom/room ratio tend to be more expensive. Think about this for
a moment... larger houses will tend to have allotted a reasonable number of bedrooms,
then the remaining 'room budget' will go to leisure rooms.
● Number of rooms per household is more informative than the total number of rooms
(larger houses tend to be more expensive).

Remember: Data science is an iterative process. Once you get a prototype up and running, you
can analyze the output to gain more insights. Some packages/models may provide you with
statistics like p-values (linear regression implementations like OLS in the statsmodels package),
or some type of feature importance.

4.Prepare the Data for Machine Learning Algorithms

Author's suggestion: Write functions.
● Allows you to reproduce these transformations easily on any dataset.
● Accumulate your own snippets and boilerplate code.
● You can use these functions when deploying your model.
● Easy experimentation.

Remember: Separate the predictors and the labels--don't necessarily want to apply the same
transformations to the predictors and the target values.

Data Cleaning
Most Machine Learning algorithms cannot work with missing features.
Problem: 'total_bedrooms' attribute has some missing values. There are a few options...
● Get rid of the corresponding districts (drop rows).
● Get rid of the whole attribute (drop column).
● Set the values to some value (fill or impute).

Note: Usually a good idea to avoid inplace=True.

Scikit-Learn's SimpleImputer object:

"Only the total_bedrooms attribute had missing values, but we cannot be sure that there won’t
be any missing values in new data after the system goes live, so it is safer to apply the imputer
to all the numerical attributes"

Aside: Scikit-Learn Design

A key aspect you'll notice while working with sklearn--Consistency!
● Estimators - Any object that can estimate some parameters based on a dataset
○ will always have a fit() method
● Transformers - Estimators (such as an imputer) that can also transform a dataset.
○ will always have transform(), fit_transform() methods
● Predictors - Estimators that are capable of making predictions on dataset.
○ will always have predict(), score() methods

Handling Text and Categorical Attributes

Only one categorical variable: 'ocean_proximity'.
Check value_counts(), nunique().

Need to encode as a numeric type. Why? "ML algorithms prefer to work with numbers."

Scikit-Learn's OrdinalEncoder class:

Problem: "One issue with this representation is that ML algorithms will assume that two
nearby values are more similar than two distant values."

Scikit-Learn's OneHotEncoder class:

Custom Transformers
Not everything is built-in.

Note: 'self' in python is the same thing as 'this' in C++/Java.

"Scikit-Learn relies on duck typing (not inheritance), all you need to do is create a class and
implement three methods: fit() (returning self), transform(), and fit_transform()."

Discussion: Is this really true? The custom class below inherits from BaseEstimator and
TransformerMixin...
Feature Scaling
"With few exceptions, Machine Learning algorithms don’t perform well when the input numerical
attributes have very different scales."

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-
examples-preprocessing-plot-all-scaling-py

Scaling options:
● MinMaxScaler
● StandardScaler
● RobustScaler
● QuantileTransformer
● PowerTransformer

As with all the transformations, fit to training data only.

Transformation Pipelines
Basically objections that you can compose sequentially.

"Pipeline constructor takes a list of name/estimator pairs defining a sequence of steps. All but
the last estimator must be transformers (i.e., they must have a fit_transform() method)."

Simple example:

Better example example via ColumnTransformer:

ColumnTransformer - Applies each transformer to the appropriate columns then concatenates.

FeatureUnion - Applies each transform to the entire data set then concatenates.

5.Select and Train a Model

Suggestion: Start simple, gradually progress to more complex or computationally expensive
models.

LinearRegression()
Works, but it's pretty bad.
"most districts’median_housing_values range between $120,000 and $265,000, so a typical
prediction error of $68,628 is not very satisfying. "

What's happening here? Underfitting.

● features do not provide enough information to make good predictions
● that the model is not powerful enough.

DecisionTreeRegressor()

Weird! Overfitting. What does this mean? The model has effectively memorized all the training
examples. If you were to feed it data it hasn't seen before, it would most likely crap the bed.

Recall: Bias-Variance Tradeoff

Better Evaluation Using Cross-Validation
Rather than train_test_split(), we could use Scikit-Learn’s K-fold cross-validation class.

Note: Number of folds is somewhat arbitrary; the appropriate value depends on the data.

Scikit-Learn's cross_val_score convenience class:

RandomForestRegressor() - Prototypical example of a "bagging" ensemble model.

Author's tip: Saving models via pickle.

● Warning: pickling and similar data serialization routines can introduce vulnerabilities in
your code!
● joblib is supposed to be the replacement for pickle; more efficient for ndarrays.
● Another, less flexible, but more space-efficient: keep the validation setup fixed, and for
each model store the parameters (in case you need to re-train it), and oof_predictions, in
case you want to use that output down the line for further ensembling (blending/stacking,
weighted averages of models, etc).

6.Fine-Tune Your Model

Grid SearcH
Evaluate all the possible combinations of hyperparameter values.
Key attributes:
● .best_params_
● .best_estimator_
● .cv_results_

Author's tip: "When you have no idea what value a hyperparameter should have, a simple
approach is to try out consecutive powers of 10." a.k.a. logspace.

Note: Once you become familiar with particular machine learning algorithms, you'll better be
able to tell what parameter values are odd... good to sanity check hyperparameter optimization.

Tip: Setting n_jobs=-1 is handy for many sklearn objects.

● -1 means using all processors.
● Or, n_jobs=n_cpus - 1 , in order to avoid that the machine gets stuck.

Randomized Search
Preferable when hyperparameter search space is very large.

Main benefits:
● May yield a good configuration of hyperparameters in less time/compute than a really
extensive gridsearch.
● Some marginal control over computing budget via n_iter.

Other Hip/Trendy Options (via external packages):

● hyperopt
● optuna
Ensemble Methods
"Another way to fine-tune your system is to try to combine the models that
perform best... especially if the individual models make very different types of errors."

Common types of ensembling:

● Bagging or weighted averages
● Boosting
● Stacking/Blending

Analyze the Best Models and Their Errors

grid_search.best_estimator_.feature_importances_

Illustrates the importance of diagnostic models!

Further options for improvement, which basically amount to removing noise:

● Drop bad features
● Remove outliers

Try to find patterns in the errors. If you can spot a pattern to exploit, chances are you can finagle
with the features and hyperparameters so that the model can exploit the pattern too.

"You should also look at the specific errors that your system makes, then try to understand why
it makes them and what could fix the problem (adding extra features or getting rid of
uninformative ones, cleaning up outliers, etc.)."
Evaluate Your System on the Test Set
"Run your full_pipeline to transform the data (call transform(), not fit_transform()—you do not
want to fit the test set!), and evaluate the final model on the test set"

8.Launch, Monitor, and Maintain Your System

Probably better saved for the dedicated chapters with fleshed-out examples.

Hojillions of options of varying complexity; on-prem vs cloud, managed services vs not so

managed.

The Reality (Andrew Ng's Coursera stuff):

Monica Rogati (https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007)

"Data-Centric AI" sentiment periodically resurfaces--or maybe it never really went anywhere?
(Andrew Ng's Coursera stuff, yet again):
(What happened to... ) 7.Present Your Solution
● What you have learned.
● What worked, what did not.
● What assumptions were made.
● What your system’s limitations are.
● Document everything!!!
● Create nice presentations with clear visualizations and easy- to-remember statements
(e.g., “the median income is the number one predictor of housing prices”).

Conclusion: "In this California housing example, the final performance of the system is not better
than the experts’ price estimates, which were often off by about 20%, but it may still be a good
idea to launch it, especially if this frees up some time for the experts so they can work on
more interesting and productive tasks."

Things that don't work:

● Emailing your tech-illiterate CEO a ghastly, unformatted MS Excel spreadsheet with lots
of numbers.
● Skipping the presentation all together, appealing to your authority in all things data: "Just
trust me, bro".
● Getting distracted by other fires, subsequently neglecting documentation. Ideally your
notebooks and code should be at least somewhat self-documenting.

Teach Yourself Data Analytics in 30 Days
0% (1)
Teach Yourself Data Analytics in 30 Days
133 pages
Data Science in Python - Regression
No ratings yet
Data Science in Python - Regression
234 pages
Libro Nuevo ML
No ratings yet
Libro Nuevo ML
577 pages
Statistics Machine Learning Python
No ratings yet
Statistics Machine Learning Python
437 pages
End-to-End Machine Learning Project (Bootcamp)
No ratings yet
End-to-End Machine Learning Project (Bootcamp)
415 pages
Statistics Machine Learning Python
No ratings yet
Statistics Machine Learning Python
415 pages
Statistics and Machine Learning in Python
No ratings yet
Statistics and Machine Learning in Python
218 pages
Statistics Machine Learning Python Draft
No ratings yet
Statistics Machine Learning Python Draft
319 pages
ML Notesv1
100% (1)
ML Notesv1
300 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
313 pages
Statistics and Machine Learning in Python
No ratings yet
Statistics and Machine Learning in Python
300 pages
Statistics and Machine Learning in Python
100% (1)
Statistics and Machine Learning in Python
166 pages
CLASS NOTES Unit 1 ML Material
No ratings yet
CLASS NOTES Unit 1 ML Material
42 pages
Statistics Machine Learning Python Draft
100% (1)
Statistics Machine Learning Python Draft
333 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
219 pages
Python For Data Science and Machine Learning
100% (2)
Python For Data Science and Machine Learning
31 pages
Statistics Machine Learning Python
100% (1)
Statistics Machine Learning Python
389 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
323 pages
Python - Follow Dr. AngShu (@drangshu) For More
100% (1)
Python - Follow Dr. AngShu (@drangshu) For More
300 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
319 pages
Hands-On Data Science and Python Machine Learning - Perform Data Mining and Machine Learning Efficiently Using Python and Spark PDF
No ratings yet
Hands-On Data Science and Python Machine Learning - Perform Data Mining and Machine Learning Efficiently Using Python and Spark PDF
415 pages
QCM
No ratings yet
QCM
24 pages
Python Data Science
100% (1)
Python Data Science
173 pages
Poly
100% (1)
Poly
108 pages
Data Science Full Stack Roadmap
No ratings yet
Data Science Full Stack Roadmap
25 pages
Statistics Machine Learning Python
No ratings yet
Statistics Machine Learning Python
399 pages
Stat and Machine Learning Python PDF
No ratings yet
Stat and Machine Learning Python PDF
300 pages
Statistics Machine Learning Python Draft
No ratings yet
Statistics Machine Learning Python Draft
329 pages
Distributed Linear Regression Class Notes
No ratings yet
Distributed Linear Regression Class Notes
140 pages
Unit 4 Model Evaluation
No ratings yet
Unit 4 Model Evaluation
24 pages
CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
Python Tutorial Text 2024-1
No ratings yet
Python Tutorial Text 2024-1
82 pages
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
100% (10)
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
227 pages
OceanofPDF - Com Hands-On Machine Learning From Scratch - Venelin Valkov
No ratings yet
OceanofPDF - Com Hands-On Machine Learning From Scratch - Venelin Valkov
119 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
Module 2
No ratings yet
Module 2
20 pages
Ad8552 ML Unit V
No ratings yet
Ad8552 ML Unit V
78 pages
Miniproj Report Calorie Burnt Prediction
No ratings yet
Miniproj Report Calorie Burnt Prediction
25 pages
Full Stack Roadmap
No ratings yet
Full Stack Roadmap
25 pages
ML Glossary
No ratings yet
ML Glossary
44 pages
DM2324 Lab01
No ratings yet
DM2324 Lab01
66 pages
Vaccination Scheduling
No ratings yet
Vaccination Scheduling
65 pages
DS Final
No ratings yet
DS Final
46 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
49 pages
Accenture Ai Scaling ML and Deep Learning Models
No ratings yet
Accenture Ai Scaling ML and Deep Learning Models
35 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Report Intership Chapters
No ratings yet
Report Intership Chapters
39 pages
Benchmarking Optimizers
No ratings yet
Benchmarking Optimizers
30 pages
DA Python Env Intro
No ratings yet
DA Python Env Intro
47 pages
Project Report
No ratings yet
Project Report
37 pages
Module 2
No ratings yet
Module 2
35 pages
Liu and Tuzel - 2016 - Coupled Generative Adversarial Networks
No ratings yet
Liu and Tuzel - 2016 - Coupled Generative Adversarial Networks
32 pages
House Report
No ratings yet
House Report
26 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
CH 2
No ratings yet
CH 2
29 pages
Ransomware Attack Detection Based On Pertinent System Calls Using Machine Learning Techniques
No ratings yet
Ransomware Attack Detection Based On Pertinent System Calls Using Machine Learning Techniques
23 pages
Python Data Science Group Bootcamp NYC (Affordable Machine Learning)
No ratings yet
Python Data Science Group Bootcamp NYC (Affordable Machine Learning)
16 pages
Artificial Intelligence & Machine Learning Unit 6: Applications Question Bank and Its Solution
No ratings yet
Artificial Intelligence & Machine Learning Unit 6: Applications Question Bank and Its Solution
22 pages
Rakshitha.M - 1BO17EC031
No ratings yet
Rakshitha.M - 1BO17EC031
26 pages
Depression Detection From Social
No ratings yet
Depression Detection From Social
17 pages
BTAIML10 Major Project Report
No ratings yet
BTAIML10 Major Project Report
25 pages
ISMLA Module5
No ratings yet
ISMLA Module5
25 pages
Full Stack Data Science Roadmap
No ratings yet
Full Stack Data Science Roadmap
17 pages
Data Science: Machine Learning
No ratings yet
Data Science: Machine Learning
25 pages
PGDC Iiit Delhi
No ratings yet
PGDC Iiit Delhi
16 pages
Optimizing Medical Inventory: A Data-Driven Approach To Forecasting Drug Demand Using Advanced Machine Learning Techniques
No ratings yet
Optimizing Medical Inventory: A Data-Driven Approach To Forecasting Drug Demand Using Advanced Machine Learning Techniques
8 pages
DeepCreditRisk TOC
No ratings yet
DeepCreditRisk TOC
11 pages
Aiml Pro
No ratings yet
Aiml Pro
14 pages
Optimizing Pharmaceutical Supply Chains An Intelligent Approach To Sustainable Business Growth
No ratings yet
Optimizing Pharmaceutical Supply Chains An Intelligent Approach To Sustainable Business Growth
14 pages
Enhancing Churn Forecasting With Sentiment Analysis of Steam Reviews
No ratings yet
Enhancing Churn Forecasting With Sentiment Analysis of Steam Reviews
13 pages
Demand Forecasting in Supply Chain Management For Rossmann Stores Using Weather Enhanced Deep Learning Model
No ratings yet
Demand Forecasting in Supply Chain Management For Rossmann Stores Using Weather Enhanced Deep Learning Model
12 pages
Data Science
No ratings yet
Data Science
9 pages
Report
No ratings yet
Report
11 pages
IRO Semis Mock Answers
No ratings yet
IRO Semis Mock Answers
10 pages
CreditRisk TOC
No ratings yet
CreditRisk TOC
10 pages
Syllabus
No ratings yet
Syllabus
7 pages
Hotel Booking Prediction Using Machine Learning
No ratings yet
Hotel Booking Prediction Using Machine Learning
5 pages
1.benign Malicious
No ratings yet
1.benign Malicious
8 pages
IJISRT23MAY2427
No ratings yet
IJISRT23MAY2427
7 pages
Data Science Toc Srinivas
No ratings yet
Data Science Toc Srinivas
4 pages
Data 101 Terms
No ratings yet
Data 101 Terms
6 pages
IEEE Xplore Conference-Template-A4
No ratings yet
IEEE Xplore Conference-Template-A4
4 pages
Microsoft Certified: Azure Data Scientist Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Scientist Associate - Skills Measured
3 pages
Automated Machine Learning Practices
No ratings yet
Automated Machine Learning Practices
1 page
MLR 3 Tuning
No ratings yet
MLR 3 Tuning
1 page
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
From Everand
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
Taryn Voska
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet

CH 2

Uploaded by

CH 2

Uploaded by

Ch2: End-to-End Machine Learning Project

Machine Learning project checklist (from Appendix B):

An alternate ML Project checklist (from Andrew Ng's Coursera stuff):

What is the business objective?

Proposed Objective: Prediction of a district’s median housing price.

(almost) Universal Objective of Data Science: Turn data into money!

Ancillary provided details:

Recall (from the footnotes of Chapter1): Why is regression called regression?

m is the number of instances

Author's suggestion: Check assumptions!

Python's asyncio (standard library package) provides typical async/await.

Good example: Preparing breakfast (source?).

2.Get the Data

Aside: The importance of virtual environments--it's all about dependencies.

Basic example of first time using:

Take a Quick Look at the Data Structure:

Most useful methods for DataFrames/Series:

Note: With Pandas, 'object' dtype usually means 'text/string'.

describe() - Very good for sanity check; look min/max/mean

Discussion: Anybody actually using hashing, low-level techniques for sampling/splitting,

Scikit-Learn's StratifiedShuffleSplit() object:

3.Discover and Visualize the Data to Gain Insights

Extra plotting tip: Get them plots bigger!

2) Explicitly create a figure object, then plot:

Other miscellaneous plotting tips:

Warning: Common correlation coefficients (like Pearson's r) have limitations:

Intuitive Feature Engineering:

4.Prepare the Data for Machine Learning Algorithms

Note: Usually a good idea to avoid inplace=True.

Scikit-Learn's SimpleImputer object:

Aside: Scikit-Learn Design

Handling Text and Categorical Attributes

Scikit-Learn's OrdinalEncoder class:

Scikit-Learn's OneHotEncoder class:

Note: 'self' in python is the same thing as 'this' in C++/Java.

As with all the transformations, fit to training data only.

Better example example via ColumnTransformer:

ColumnTransformer - Applies each transformer to the appropriate columns then concatenates.

5.Select and Train a Model

What's happening here? Underfitting.

Recall: Bias-Variance Tradeoff

Scikit-Learn's cross_val_score convenience class:

Author's tip: Saving models via pickle.

6.Fine-Tune Your Model

Tip: Setting n_jobs=-1 is handy for many sklearn objects.

Other Hip/Trendy Options (via external packages):

Common types of ensembling:

Analyze the Best Models and Their Errors

Illustrates the importance of diagnostic models!

Further options for improvement, which basically amount to removing noise:

8.Launch, Monitor, and Maintain Your System

Hojillions of options of varying complexity; on-prem vs cloud, managed services vs not so

The Reality (Andrew Ng's Coursera stuff):

Monica Rogati (https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007)

Things that don't work:

You might also like