0% found this document useful (0 votes)

92 views20 pages

Module 2

Uploaded by

Shipoto S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views20 pages

Module 2

Uploaded by

Shipoto S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

MODULE 2

Introduction: End-to-End Machine Learning Project

In this chapter, an example project end to end is presented, imagining the scenario of being a
recently hired data scientist in a real estate company. Here are the main steps to go through:

1. Look at the big picture.

2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system.

Look at the Big Picture

Welcome to Machine Learning Housing Corporation!
• The first task is to build a model of housing prices in California using the California
census data.
• This data includes metrics such as population, median income, median housing price,
and more for each block group in California.
• Block groups are the smallest geographical units for which the US Census Bureau
publishes sample data, typically having a population of 600 to 3,000 people. These
will be referred to as “districts” for simplicity.
• The model should learn from this data and be able to predict the median housing price
in any district based on all the other metrics

Frame the Problem

Each of questions helps in framing and understanding the machine learning project more
effectively.
1. What exactly is the business objective? - This question aims to clarify the ultimate
goal of the project and how the company expects to benefit from the model.

2. How does the company expect to use and benefit from this model? - This is important
because it will determine how to frame the problem, what algorithms to select, what
performance measure should use to evaluate model, and how much effort should
spend tweaking it.

Here, the model’s output (a prediction of a district’s median housing price) will be fed
to another Machine Learning system (Figure 2-2), along with many other signals. This
downstream system will determine whether it is worth investing in a given area or not.
Getting this right is critical, as it directly affects revenue.

3. What does the current solution look like (if any)? - It will give a reference
performance, as well as insights on how to solve the problem.

With all this information, it is ready to start designing the system

1. Is it supervised, unsupervised, or reinforcement learning? - This is a supervised

learning task because we have labeled training examples where each instance comes
with the expected output, i.e., the district’s median housing price.

2. Is it a classification task, a regression task, or something else? - It is a regression task

because we are asked to predict a continuous value (the median housing price). More
specifically, it is a multiple regression problem since the system will use multiple
features to make a prediction (such as population, median income, etc.). It is also a
univariate regression problem since we are only trying to predict a single value for
each district. If we were trying to predict multiple values per district, it would be a
multivariate regression problem

3. Should you use batch learning or online learning techniques? - Batch learning should
be chosen because there is no continuous flow of new data, no immediate need to
adjust to changing data, and the data is small enough to fit in memory.
Select a Performance Measure

1. Root Mean Square Error (RMSE)

A performance measure for regression problems is the Root Mean Square Error (RMSE). It
gives an idea of how much error the system typically makes in its predictions, with a higher
weight for large errors.

This equation introduces several very common Machine Learning notations

• m is the number of instances in the dataset. For example, if you are evaluating the
RMSE on a validation set of 2,000 districts, then m = 2,000.
• x(i) is a vector of all the feature values (excluding the label) of the ith instance in the
dataset, and y(i) is its label (the desired output value for that instance).
• X is a matrix containing all the feature values (excluding labels) of all instances in the
dataset.
• h is called a hypothesis. When system is given an instance’s feature vector x(i), it
outputs a predicted value ŷ(i) = h(x(i)) for that instance
• RMSE(X,h) is the cost function measured on the set of examples using hypothesis h.

2. Mean Absolute Error (Average Absolute Deviation)

This performance measure is used when there are many outliers.

Both the RMSE and the MAE are ways to measure the distance between two vectors: the
vector of predictions and the vector of target values.
Get the Data

Create the Workspace

• First, ensure Python is installed.

• Next, create a workspace directory for your Machine Learning code and datasets.
• Open a terminal and type the following commands:

• A number of Python modules are needed: Jupyter, NumPy, Pandas, Matplotlib, and
Scikit-Learn.
• The system’s packaging system (e.g., apt-get on Ubuntu, or MacPorts or HomeBrew
on MacOS) can be used. Install a Scientific Python distribution such as Anaconda and
its packaging system or Python’s own packaging system, pip, can be used.
• All the required modules and their dependencies can now be installed using this
simple pip command.

• To check your installation, try to import every module like this:

There should be no output and no error.

• Now you can fire up Jupyter by typing:

A Jupyter server is now running in your terminal, listening to port 8888

• Now create a new Python notebook by clicking on the New button and selecting the
appropriate Python version

Download the Data

• For this project, just download a single compressed file, housing.tgz, which contains a
comma-separated value (CSV) file called housing.csv with all the data.
• A simple method is to use web browser to download it, decompress the file and extract
the CSV file.
• But it is preferable to create a small function / script to download the data because it is
useful in particular if data changes regularly, it can run whenever you need to fetch the
latest data.
Here is the function to fetch the data:

Now when you call fetch_housing_data(), it creates a datasets/housing directory in

workspace, downloads the housing.tgz file, and extracts the housing.csv from it in this
directory.

Now let’s load the data using Pandas.

This function returns a Pandas DataFrame object containing all the data.
Take a Quick Look at the Data Structure

1. head() : Let’s take a look at the top five rows using the DataFrame’s head() method.
Each row represents one district. There are 10 attributes “longitude, latitude,
housing_median_age, total_rooms, total_bed_rooms, population, households,
median_income, median_house_value, and ocean_proximity.”

2. info(): The info() method is useful to get a quick description of the data, in particular
the total number of rows, and each attribute’s type and number of non-null values

3. value_counts(): You can find out what categories exist and how many districts belong
to each category by using the value_counts() method:
4. describe(): The describe() method shows a summary of the numerical attributes

• The count, mean, min, and max rows are self-explanatory. Note that the null values are
ignored (so, for example, count of total_bedrooms is 20,433, not 20,640).
• The std row shows the standard deviation, which measures how dispersed the values
are.
• The 25%, 50%, and 75% rows show the corresponding percentiles: a percentile
indicates the value below which a given percentage of observations in a group of
observations falls.
• For example, 25% of the districts have a housing_median_age lower than 18, while
50% are lower than 29 and 75% are lower than 37. These are often called the 25th
percentile (or 1st quartile), the median, and the 75th percentile (or 3rd quartile).

Create a Test Set

When splitting the data into training and test sets, it's important to ensure that test set remains
consistent across different runs of the program.

The Problem: If the dataset is randomly split into training and test sets each time the
program is run, different test sets will be generated each time. Over time, the model might see
the entire dataset, which defeats the purpose of having a separate test set.

Solution 1: Saving the Test Set

One way to address the issue of different test sets on each run is to save the test set when it is
first created. Then, load this saved test set in future runs. However, this approach has
limitations, especially if there is a need to update the dataset.
Solution 2: Using a Random Seed

Another option is to set the random number generator’s seed (e.g., np.random.seed(42)) so
that it always generates the same shuffled indices.

But both these solutions will break next time you fetch an updated dataset.

A more robust approach is to use each instance's unique identifier to determine whether it
should be in the test set. This way, even if dataset is refreshed, the split remains consistent.

Here it can do:

• Compute a hash of each instance’s identifier.
• Put the instance in the test set if the hash value is below a certain threshold (e.g., 20%
of the maximum hash value).
This method ensures that test set contains approximately 20% of the data and remains
consistent across runs, even when the dataset is updated

Discover and Visualize the Data to Gain Insights

Visualizing Geographical Data

The dataset has geographical information (latitude and longitude), it is a good idea to create a
scatterplot of all districts to visualize the data.
The above plot looks like California, but other than that it is hard to see any particular
pattern. Setting the alpha option to 0.1 makes it much easier to visualize the places where
there is a high density of data points

Now let’s look at the housing prices. The radius of each circle represents the district’s
population (s), and the color represents the price (c). We will use a predefined color map
(cmap) called jet, which ranges from blue (low values) to red (high prices)
Looking for Correlations

To compute the standard correlation coefficient (also called Pearson’s r) between every pair
of attributes use the corr() method:

Now let’s look at how much each attribute correlates with the median house value:

• The correlation coefficient ranges from –1 to 1.

• When it is close to 1, it means that there is a strong positive correlation; for example,
the median house value tends to go up when the median income goes up.
• When the coefficient is close to –1, it means that there is a strong negative correlation;
For example, there is small negative correlation between the latitude and the median
house value (i.e., prices have a slight tendency to go down when you go north).
• When coefficients close to zero mean that there is no linear correlation.

The below figure shows various plots along with the correlation coefficient between their
horizontal and vertical axes.

Figure: Standard correlation coefficient of various datasets

Another way to check for correlation between attributes is to use Pandas’ scatter_matrix
function, which plots every numerical attribute against every other numerical attribute
The main diagonal (top left to bottom right) would be full of straight lines if Pandas plotted
each variable against itself, which would not be very useful. So instead, Pandas displays a
histogram of each attribute

The most promising attribute to predict the median house value is the median income, so let’s
look in on their correlation scatterplot.

Figure: Median income versus median house value

Prepare the Data for Machine Learning Algorithms

Data Cleaning
Start to clean training set. Let’s separate the predictors and the labels since we don’t want to
apply the same transformations to the predictors and the target values.

Missing Features: Most Machine Learning algorithms cannot work with missing features. If
any attribute has some missing values there are three options to handle:
• Get rid of the corresponding attribute.
• Get rid of the whole attribute.
• Set the values to some value (zero, the mean, the median, etc.).

These can be accomplish easily by using DataFrame’s dropna(), drop(), and fillna() methods:

If option 3 is chosen, compute the median value on the training set and use it to fill the
missing values in the training set. Save the computed median value, as it will be needed later
to replace missing values in the test set for system evaluation, and also to handle missing
values in new data once the system goes live.

Scikit-Learn provides a class to take care of missing values: SimpleImputer

Since the median can only be computed on numerical attributes, we need to create a copy of
the data without the text attribute ocean_proximity:

Now, fit the imputer instance to the training data using the fit() method:

The imputer has simply computed the median of each attribute and stored the result in its
statistics_ instance variable.
Now you can use this “trained” imputer to transform the training set by replacing missing
values by the learned medians:

The result is a plain NumPy array containing the transformed features. If you want to put it
back into a Pandas DataFrame, it’s simple:

Handling Text and Categorical Attributes

To convert categories from text to numbers, we can use Scikit-Learn’s OrdinalEncoder

Another way to create one binary attribute per category: one attribute equal to 1 when the
category is “<1H OCEAN” (and 0 otherwise), another attribute equal to 1 when the category
is “INLAND” (and 0 otherwise), and so on. This is called one-hot encoding, because only
one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are
sometimes called dummy attributes. Scikit-Learn provides a OneHotEn coder class to convert
categorical values into one-hot vectors.

By default, the OneHotEncoder class returns a sparse array, but we can convert it to a dense
array if needed by calling the toarray() method:
Feature Scaling
Machine Learning algorithms don’t perform well when the input numerical attributes have
very different scales.

There are two common ways to get all attributes to have the same scale:
1. Min-max scaling: In min-max scaling (normalization) the values are shifted and
rescaled so that they end up ranging from 0 to 1.

Scikit-Learn provides a transformer called MinMaxScaler for this.

2. Standardization (Z-score Normalization): The value x is subtracting the mean

value, and then it divides by the standard deviation so that the resulting distribution
has unit variance.

Scikit-Learn provides a transformer called StandardScaler for standardization

Transformation Pipelines
• There are many data transformation steps that need to be executed in the right order.
Scikit-Learn provides the Pipeline class to help with such sequences of
transformations.
• Here is a small pipeline for the numerical attributes:
First line imports the necessary classes from the sklearn library. Pipeline is used to create a
sequence of data processing steps.
StandardScaler is used to standardize features by removing the mean and scaling to unit
variance.
This code defines a pipeline named num_pipeline consisting of three steps:

1. 'imputer': Uses SimpleImputer to handle missing values by replacing them with the
median value of the column. This is specified by strategy="median".
2. 'attribs_adder': Uses a custom transformer CombinedAttributesAdder(), which is
assumed to be defined elsewhere. This step adds new attributes to the dataset based on
existing ones.
3. 'std_scaler': Uses StandardScaler to standardize the numerical attributes.
Standardization is the process of rescaling the features so that they have the properties
of a standard normal distribution with a mean of 0 and a standard deviation of 1.

The last line applies the pipeline to the housing_num data. The fit_transform method first fits
the pipeline to the data i.e., it computes the necessary statistics such as median values for
imputation and mean/standard deviation for scaling and then transforms the data according to
the fitted pipeline.

Select and Train a Model

Training and Evaluating on the Training Set

Let’s first train a Linear Regression model.

Let’s try it out on a few instances from the training set:

Let’s measure this regression model’s RMSE on the whole training set using Scikit-Learn’s
mean_squared_error function:

Fine-Tune Your Model

Grid Search
• Scikit-Learn’s GridSearchCV tell which hyperparameters you want it to experiment
with, and what values to try out, and it will evaluate all the possible combinations of
hyperparameter values, using cross-validation.
• For example, the following code searches for the best combination of hyperparameter
values for the RandomForestRegressor:

This param_grid tells Scikit-Learn to first evaluate all 3 × 4 = 12 combinations of

n_estimators and max_features hyperparameter values specified in the first dict, then try all
2 × 3 = 6 combinations of hyperparameter values in the second dict, but this time with the
bootstrap hyperparameter set to False instead of True.
The grid search will explore 12 + 6 = 18 combinations of RandomForestRegressor
hyperparameter values, and it will train each model five times. In other words, there will be
18 × 5 = 90 rounds of training!
Randomized Search

• When the hyperparameter search space is large, it is often preferable to use

RandomizedSearchCV instead.
• It evaluates a given number of random combinations by selecting a random value for
each hyperparameter at every iteration.

Ensemble Methods

Another way to fine-tune your system is to try to combine the models that perform best. The
group (or “ensemble”) will often perform better than the best individual model, especially if
the individual models make very different types of errors.

Analyze the Best Models and Their Errors

The RandomForestRegressor can indicate the relative importance of each attribute for
making accurate predictions.

With this information, you may want to try dropping some of the less useful features. You
should also look at the specific errors that your system makes, then try to understand why it
makes them and what could fix the problem.

Evaluate Your System on the Test Set

Get the predictors and the labels from test set, run your full_pipeline to transform the data,
and evaluate the final model on the test set.
Launch, Monitor, and Maintain Your System

• Production Readiness: Integrate the production input data sources into your system
and write necessary tests to ensure everything functions correctly.
• Performance Monitoring: Develop code to monitor your system’s live performance
regularly and trigger alerts if there is a performance drop, to catch both sudden
breakage and gradual performance degradation.
• Human Evaluation: Implement a pipeline for human analysis of your system’s
predictions, involving field experts or crowdsourcing platforms, to evaluate and
improve system accuracy.
• Input Data Quality Check: Regularly evaluate the quality of the system’s input data
to detect issues early, preventing minor problems from escalating and affecting system
performance.
• Automated Training: Automate the process of training models with fresh data
regularly to maintain consistent performance and save snapshots of the system's state
for easy rollback in online learning systems.
Explanation of Grid Search

GridSearchCV: This is a tool from Scikit-Learn that performs an exhaustive search over
specified parameter values for an estimator. It helps in finding the best combination of
hyperparameters for a given model.

• param_grid: This is a list of dictionaries, where each dictionary defines a set of

hyperparameters to search over.
o n_estimators: This parameter specifies the number of trees in the forest.
o max_features: This parameter specifies the maximum number of features to
consider when looking for the best split.
o The first dictionary searches over different combinations of n_estimators
and max_features with the default setting of bootstrap=True.
o The second dictionary adds an additional setting to search over:
bootstrap=False, with its own combinations of n_estimators and
max_features

forest_reg: This creates an instance of the RandomForestRegressor, which is the

model we want to tune.

• grid_search: This initializes GridSearchCV with several parameters:

o forest_reg: The estimator (model) to be tuned.
o param_grid: The parameter grid defined earlier, specifying the
hyperparameters to search over.
o cv=5: This sets the cross-validation strategy to 5-fold cross-validation. This
means the data will be split into 5 parts, and the model will be trained and
validated 5 times, each time using a different part of the data for validation and
the remaining parts for training.
o scoring='neg_mean_squared_error': This sets the scoring metric to
negative mean squared error. GridSearchCV will use this metric to evaluate
the performance of each combination of hyperparameters. The negative sign is
used because Scikit-Learn expects higher values to be better, but for mean
squared error, lower values are better.
o return_train_score=True: This ensures that the training scores for each
fold and parameter combination are stored in the results

fit: This method trains the GridSearchCV object using the prepared housing data
(housing_prepared) and the corresponding labels (housing_labels)

Data Analytics New Quantum AKTU
No ratings yet
Data Analytics New Quantum AKTU
210 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
On The Development and Practice of AI Technology For Contemporary Popular Music Production
No ratings yet
On The Development and Practice of AI Technology For Contemporary Popular Music Production
15 pages
Trackpad Pro Ver. 5.0 Class 6
From Everand
Trackpad Pro Ver. 5.0 Class 6
Nidhi Arora
No ratings yet
CS178 Homework #1: Problem 0: Getting Connected
No ratings yet
CS178 Homework #1: Problem 0: Getting Connected
4 pages
DataMining Course Handout PDF
No ratings yet
DataMining Course Handout PDF
5 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
Machine Learning Introduction
100% (1)
Machine Learning Introduction
20 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
A Guide To Teaching Data Science PDF
No ratings yet
A Guide To Teaching Data Science PDF
26 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Introduction To Python For Data Science - Syllabus
100% (1)
Introduction To Python For Data Science - Syllabus
5 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Data Science Course Content
No ratings yet
Data Science Course Content
8 pages
IICT - Data Science
No ratings yet
IICT - Data Science
22 pages
How To Build An Effective Data Science Portfolio in 2021 - by Harshit Tyagi
No ratings yet
How To Build An Effective Data Science Portfolio in 2021 - by Harshit Tyagi
15 pages
CCchap 2
No ratings yet
CCchap 2
7 pages
Unit-3: Non-Linear Data Structure
No ratings yet
Unit-3: Non-Linear Data Structure
23 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Data Mart Info
No ratings yet
Data Mart Info
5 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
Visualizations in Spreadsheets and Tableau
No ratings yet
Visualizations in Spreadsheets and Tableau
4 pages
Lecture 01 05.08.2024 AI-ML Introduction
No ratings yet
Lecture 01 05.08.2024 AI-ML Introduction
46 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
Neural
No ratings yet
Neural
35 pages
XV. Anomaly Detection
0% (1)
XV. Anomaly Detection
4 pages
Data Science Portfolio For Success
No ratings yet
Data Science Portfolio For Success
100 pages
DSML Curriculum Doc - Google Sheets
0% (1)
DSML Curriculum Doc - Google Sheets
12 pages
Barclays Data Engineer Interview Questions
No ratings yet
Barclays Data Engineer Interview Questions
17 pages
Lecture 4 Data Structure Linked List
No ratings yet
Lecture 4 Data Structure Linked List
30 pages
02 DataCategorization
No ratings yet
02 DataCategorization
41 pages
Angular 2 Essentials - Sample Chapter
0% (1)
Angular 2 Essentials - Sample Chapter
39 pages
Data Science Learning Path For 50 Days
No ratings yet
Data Science Learning Path For 50 Days
15 pages
KMBN It01 - Unit 4
No ratings yet
KMBN It01 - Unit 4
19 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
BI Mini Project Report
No ratings yet
BI Mini Project Report
11 pages
Question Bank - CSE-DS
No ratings yet
Question Bank - CSE-DS
5 pages
Difference Between Data Science and Machine Learning
No ratings yet
Difference Between Data Science and Machine Learning
5 pages
Lecture-1 Introduction To Data Science
No ratings yet
Lecture-1 Introduction To Data Science
20 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
1.2 Introduction To Applied Data Science
No ratings yet
1.2 Introduction To Applied Data Science
47 pages
Basic Electronics Course Humphrey Kimathi - New PDF
No ratings yet
Basic Electronics Course Humphrey Kimathi - New PDF
91 pages
CLOUD COMPUTING Presentation
No ratings yet
CLOUD COMPUTING Presentation
5 pages
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
Detailed Curriculum PDF
No ratings yet
Detailed Curriculum PDF
6 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
Tungban Machine Learning Math Course
No ratings yet
Tungban Machine Learning Math Course
124 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
The AI Hierarchy of Needs
No ratings yet
The AI Hierarchy of Needs
8 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Fuzzy Logic Applications: Bram Heyns
No ratings yet
Fuzzy Logic Applications: Bram Heyns
7 pages
99 Machine Learning Algorithm
No ratings yet
99 Machine Learning Algorithm
7 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Lifecycle of A Data Science Project
No ratings yet
Lifecycle of A Data Science Project
1 page
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet
All It Takes Is A Goal Jon Acuff Download
100% (3)
All It Takes Is A Goal Jon Acuff Download
21 pages
Student Notes
No ratings yet
Student Notes
10 pages
Fundamentals of Microcontroller and Its Application: Unit N0.1
No ratings yet
Fundamentals of Microcontroller and Its Application: Unit N0.1
16 pages
ST2 WhiteBox
No ratings yet
ST2 WhiteBox
48 pages
Asynchronous Communication: SE3020 - Distributed Systems - Async. Communication - Dharshana Kasthurirathna
No ratings yet
Asynchronous Communication: SE3020 - Distributed Systems - Async. Communication - Dharshana Kasthurirathna
52 pages
Lesson - 5.2 - Root Cause Analysis - Improve - Phase
No ratings yet
Lesson - 5.2 - Root Cause Analysis - Improve - Phase
29 pages
Parth Agarwal Docs
No ratings yet
Parth Agarwal Docs
14 pages
AZ-900T0xModule 02core Azure Services
No ratings yet
AZ-900T0xModule 02core Azure Services
43 pages
Pratham CV
No ratings yet
Pratham CV
1 page
Addressing Modes of 8086 MR ARVIND VISHNUBHATLA
No ratings yet
Addressing Modes of 8086 MR ARVIND VISHNUBHATLA
17 pages
Visitor Management System
No ratings yet
Visitor Management System
41 pages
ICONICS Product Suite Getting Started Guide
No ratings yet
ICONICS Product Suite Getting Started Guide
111 pages
Plunet Guide For External Resources
No ratings yet
Plunet Guide For External Resources
18 pages
MT 2 T62 NRMQCM
No ratings yet
MT 2 T62 NRMQCM
33 pages
CC Notes Unit 2
No ratings yet
CC Notes Unit 2
26 pages
Install ONNX Runtime - Onnxruntime
No ratings yet
Install ONNX Runtime - Onnxruntime
8 pages
Ewd Audio Yaris
100% (1)
Ewd Audio Yaris
5 pages
Oracle Fusion Payroll Create Loan Using Reference Code 1718103834
No ratings yet
Oracle Fusion Payroll Create Loan Using Reference Code 1718103834
10 pages
Course File Foect Mca
No ratings yet
Course File Foect Mca
28 pages
BG - JPG Contact - HTM Contact - JPG Daisy - JPG Logo - JPG Owg - Css Text - TXT Vista - JPG Wall - JPG Yearend - CSV Task 1 - Evidence Document
No ratings yet
BG - JPG Contact - HTM Contact - JPG Daisy - JPG Logo - JPG Owg - Css Text - TXT Vista - JPG Wall - JPG Yearend - CSV Task 1 - Evidence Document
4 pages
Duan Et Al. - 2024 - MuxServe Flexible Multiplexing For Efficient Mult
No ratings yet
Duan Et Al. - 2024 - MuxServe Flexible Multiplexing For Efficient Mult
12 pages
Infinet Wireless Recover
No ratings yet
Infinet Wireless Recover
4 pages
Lab 11 Data Structure 16 December
No ratings yet
Lab 11 Data Structure 16 December
6 pages
CSC 1 Questions
No ratings yet
CSC 1 Questions
3 pages
Lecture Note 1
No ratings yet
Lecture Note 1
7 pages
Log
No ratings yet
Log
14 pages
Os Pyq CST 206
No ratings yet
Os Pyq CST 206
22 pages
Revit-MEP-FIRE FIGHTING
100% (2)
Revit-MEP-FIRE FIGHTING
68 pages
Ms. Priyanshu Singh
No ratings yet
Ms. Priyanshu Singh
2 pages

Module 2

Uploaded by

Module 2

Uploaded by

MODULE 2

Introduction: End-to-End Machine Learning Project

1. Look at the big picture.

Look at the Big Picture

Frame the Problem

With all this information, it is ready to start designing the system

1. Is it supervised, unsupervised, or reinforcement learning? - This is a supervised

2. Is it a classification task, a regression task, or something else? - It is a regression task

1. Root Mean Square Error (RMSE)

This equation introduces several very common Machine Learning notations

2. Mean Absolute Error (Average Absolute Deviation)

This performance measure is used when there are many outliers.

Create the Workspace

• First, ensure Python is installed.

• To check your installation, try to import every module like this:

There should be no output and no error.

• Now you can fire up Jupyter by typing:

A Jupyter server is now running in your terminal, listening to port 8888

Download the Data

Now when you call fetch_housing_data(), it creates a datasets/housing directory in

Now let’s load the data using Pandas.

Create a Test Set

Solution 1: Saving the Test Set

Here it can do:

Discover and Visualize the Data to Gain Insights

Visualizing Geographical Data

• The correlation coefficient ranges from –1 to 1.

Figure: Standard correlation coefficient of various datasets

Figure: Median income versus median house value

Scikit-Learn provides a class to take care of missing values: SimpleImputer

Handling Text and Categorical Attributes

To convert categories from text to numbers, we can use Scikit-Learn’s OrdinalEncoder

Scikit-Learn provides a transformer called MinMaxScaler for this.

2. Standardization (Z-score Normalization): The value x is subtracting the mean

Scikit-Learn provides a transformer called StandardScaler for standardization

Select and Train a Model

Training and Evaluating on the Training Set

Let’s first train a Linear Regression model.

Let’s try it out on a few instances from the training set:

Fine-Tune Your Model

This param_grid tells Scikit-Learn to first evaluate all 3 × 4 = 12 combinations of

• When the hyperparameter search space is large, it is often preferable to use

Analyze the Best Models and Their Errors

Evaluate Your System on the Test Set

• param_grid: This is a list of dictionaries, where each dictionary defines a set of

forest_reg: This creates an instance of the RandomForestRegressor, which is the

• grid_search: This initializes GridSearchCV with several parameters:

You might also like