Module 2

Uploaded by

anushaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views24 pages

Module 2

Uploaded by

anushaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

Module 2

Anusha J
Assistant Professor
Dept of AIML
JSSATEB
End to end Machine learning Project:
• In this chapter, you will go through an example project end to end, pretending to be a
recently hired data scientist in a real estate company.1 Here are the main steps you will go
through:
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system
Working with Real Data

• When you are learning about

machine learning it is best to
actually experiment with real-world
data, not just artitificial datasets.
• Thousands of open datasets
ranging across all sorts of domains.
• California Housing Prices dataset
from the StatLib repository.This
dataset was based on data from
the 1990 California census.
Cont..
• Popular open data repositories:
– UC Irvine Machine Learning Repository
– Kaggle datasets
– Amazon’s AWS datasets
• Meta portals (they list open data repositories):
– http://dataportals.org/
– http://opendatamonitor.eu/
– http://quandl.com/
• Other pages listing many popular open data repositories:
– Wikipedia’s list of Machine Learning datasets
– Quora.com question
– Datasets subreddit
Project
• Task:
– Build a model to predict median housing price in districts of
California.
– The model will be used by another system to decide on
investments in those areas.
• Data:
– Data source: California census data
– This data includes features like population, median income,
and median housing price for each district.
Cont..
• Model Type:
– This is a supervised learning task because we have labeled data
(districts with known median housing prices).
– It's a regression task because we're predicting a continuous value
(housing price).
– More specifically, it's a multiple regression problem since the prediction
considers multiple features (population, income, etc.) and a univariate
regression problem because we're predicting a single value (housing
price) per district.
– Batch learning is suitable here because the data size is manageable
and there's no need for real-time updates.
Cont..
• Performance Measure:
– Root Mean Squared Error (RMSE) is chosen to measure the
model's performance.
– RMSE considers the squared difference between predicted
and actual values, giving higher weightage to larger errors.
Look at the Big Picture
• Company and Task: You are working at Machine Learning Housing
Corporation. Your first task is to build a model to predict housing prices in
California.
• Data Source: The model will be trained using California census data.
• Data Details: This data includes various metrics for each district (block
group), such as population, median income, and most importantly, the
median housing price you're trying to predict.
• Data Unit: The data refers to block groups, the smallest unit with census
data (typically 600-3,000 people). These will be called "districts" for
convenience.
• Model Goal: The model should learn from the data and use that knowledge
to predict the median housing price in any district, considering the other
available metrics (population, income, etc.).
Frame the Problem

Question:
• What exactly is the business objective; building a model is
probabaly not the end goal. How does the company
expect t use and beneift from this model??
• This is important because it will determine how you frame
the problem, what algorithms you will select, what
performance measure you will use to evaluate your
model, and hpw much effort you should spend tweaking it.
• Why Understanding the Business Objective is Crucial
– The passage highlights that building a model is just a means to an end. You need
to understand the company's business objective for using the model.
– This objective will influence various aspects of your project, such as:
– Problem Framing: How you define the problem (supervised learning, regression, etc.)
depends on the desired outcome.
– Algorithm Selection: The best algorithms for your model depend on the problem type.
– Performance Measure: You'll choose a metric (like RMSE) that reflects how well the
model meets the business goal.
– Development Effort: The amount of time spent refining the model depends on its
impact on the business.
The Business Objective in this Scenario
• The company's goal is to decide on real estate
investments.
• Your housing price prediction model will be fed into
another Machine Learning system . This system likely
analyzes various factors (signals) beyond housing price.
• By predicting housing prices accurately, the overall
system can make better investment decisions, directly
impacting the company's revenue.
Frame the problem.
Current Solution and Expected Improvement
• Currently, housing prices are estimated manually by
experts using complex rules. This is expensive, time-
consuming, and inaccurate (estimates can be off by more
than 20%).
• The company expects a Machine Learning model to be
more efficient and accurate in predicting housing prices.
Understanding the Data Pipeline (Optional
Background)
• The passage also briefly introduces data pipelines, which
are common in Machine Learning.
• They involve a sequence of components that process and
transform data.
• These components can operate independently and
communicate through data storage.
• This modular design makes the system scalable and
easier to manage.
Pipelines
– A data pipeline is a series of connected components that process and
transform data.
– Imagine an assembly line in a factory, but instead of physical objects, the data
flows through different stations that perform specific tasks.

• Why are data pipelines important in Machine Learning?

– Machine learning often deals with vast datasets that require cleaning,
manipulation, and transformation before they can be used to train models.
– Data pipelines automate these processes, making them more efficient and
reliable.
How do data pipelines work?
– Each component in the pipeline acts independently.
– A component retrieves a chunk of data, processes it (e.g.,
cleaning, filtering, transforming), and stores the results in a
designated location (data store).
– The next component in the pipeline retrieves the processed
data from the store, performs its own operations, and stores its
output in another data store.
– This continues until all the data has been processed through
the entire pipeline.
Benefits and challenges of data pipelines:
• Benefits of data pipelines:
• Modular design: Each component is self-contained, making the system
easier to understand, maintain, and scale. Different teams can work on
different parts of the pipeline.
• Robustness: If one component fails, others can often continue functioning
by using the last available processed data. This allows for easier
troubleshooting and minimizes downtime.
• Challenges of data pipelines:
• Monitoring: It's crucial to monitor the pipelines to ensure data isn't getting
stuck or corrupted somewhere. Unidentified issues can lead to stale or
inaccurate data, impacting the overall system's performance.
Why Consider Existing Solutions?
• Why Consider Existing Solutions?
– Examining the current approach can provide valuable insights for building your
machine learning model.
– It can offer a baseline for performance comparison and suggest ways to improve
upon the existing solution.
• Current Solution for Housing Price Estimation
– In this scenario, housing prices are currently estimated manually by a team of
experts.
– This process involves gathering data about each district and using complex rules to
estimate the median housing price if the actual value is unavailable.
• Challenges of the Manual Approach
– This method is expensive and time-consuming.
– The accuracy of the estimates is poor, often deviating from the actual price by more
than 20%.
Justification for the Machine Learning Model
• The company aims to replace the manual process with a
machine learning model to achieve:
– Increased Efficiency: The model can automate price
prediction, saving time and resources.
– Improved Accuracy: The model is expected to be more
accurate than manual estimates.
Data Availability for Machine Learning
– The California census data appears to be a suitable source for
training the model.
– This data includes both the median housing prices (desired
output) and other relevant features (population, income, etc.) for
many districts.
Next Steps: Defining the Machine Learning Problem
– With this understanding of the current solution and available
data, you can now define the machine learning problem precisely.
– The passage asks you to consider the type of learning
(supervised, unsupervised, etc.) and the specific task
classification, regression, etc.) based on the information provided.
Cont..
• By understanding the limitations of the current approach, you can
design a machine learning model that addresses those
shortcomings and leverages the available data to deliver better
results.
Framing the Machine Learning Problem
• Before designing the system (your machine learning model), you need to
clearly define the problem it's trying to solve. This involves specifying several
aspects:
• Learning Type: This scenario involves supervised learning because you
have labeled data. Each data point (district) has features (population,
income, etc.) and a corresponding label (median housing price). The model
learns the relationship between these features and labels to predict future
housing prices.
• Task Type: This is a regression task because you're predicting a continuous
value (housing price) as opposed to classifying something into categories.
Cont..
Regression Specifics:
– Multiple Regression: This problem is a multiple regression because the model will
use several features (population, income, etc.) to predict a single value (housing
price).
– Univariate Regression: It's also a univariate regression because you're only
predicting one value (housing price) per district. If you were predicting multiple
values like housing price and average commute time, it would be multivariate
regression.
Learning Mode: This scenario is suitable for batch learning because:
– The data size (California census data) is likely manageable to handle all at once.
– There's no need for real-time updates on housing prices. The model can be trained
using the entire dataset and then used for predictions.
Selecting a Performance Measure

5e Lesson Plan - Measures of Center
No ratings yet
5e Lesson Plan - Measures of Center
9 pages
Dokumen - Tips - Understanding Robust and Exploratory Data Analysisby David C Hoaglin Frederick
No ratings yet
Dokumen - Tips - Understanding Robust and Exploratory Data Analysisby David C Hoaglin Frederick
3 pages
Fin534 - La1 - Oct2022 (G8)
No ratings yet
Fin534 - La1 - Oct2022 (G8)
17 pages
CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
Module 2 Own Notes
No ratings yet
Module 2 Own Notes
10 pages
Module 5
No ratings yet
Module 5
46 pages
End-to-End Machine Learning Project (Bootcamp)
No ratings yet
End-to-End Machine Learning Project (Bootcamp)
415 pages
AIMLlatestmodule 2Notes Removed
No ratings yet
AIMLlatestmodule 2Notes Removed
33 pages
module_2
No ratings yet
module_2
35 pages
L03 The Regression Pipeline
No ratings yet
L03 The Regression Pipeline
94 pages
project
No ratings yet
project
36 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
Act7
No ratings yet
Act7
18 pages
FML Winter 24 Lecture 1 Introduction
No ratings yet
FML Winter 24 Lecture 1 Introduction
18 pages
int 5
No ratings yet
int 5
12 pages
Module_1
No ratings yet
Module_1
5 pages
ISMLA_Module5
No ratings yet
ISMLA_Module5
25 pages
HOUSE_PREDICTION_(1)[1]new[1][1]
No ratings yet
HOUSE_PREDICTION_(1)[1]new[1][1]
24 pages
Module 2
No ratings yet
Module 2
20 pages
B.E Cse Batchno 106
No ratings yet
B.E Cse Batchno 106
72 pages
Project - Synopsis - Format (1) (1) (1) Copy 2
No ratings yet
Project - Synopsis - Format (1) (1) (1) Copy 2
33 pages
AI_ML
No ratings yet
AI_ML
2 pages
Mini Project Synopsis
No ratings yet
Mini Project Synopsis
1 page
Current Trends in Software
No ratings yet
Current Trends in Software
26 pages
House Report
No ratings yet
House Report
26 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Yug Removed
No ratings yet
Yug Removed
29 pages
Sathyabama: House Price Prediction
No ratings yet
Sathyabama: House Price Prediction
72 pages
Detd Report
No ratings yet
Detd Report
7 pages
Report On Java Chatting
No ratings yet
Report On Java Chatting
10 pages
Presentation 21
No ratings yet
Presentation 21
9 pages
AIML NOTES
No ratings yet
AIML NOTES
12 pages
ML 2
No ratings yet
ML 2
39 pages
Nea Write Up
No ratings yet
Nea Write Up
41 pages
PBL-1 Research Paper
No ratings yet
PBL-1 Research Paper
5 pages
Lecture4
No ratings yet
Lecture4
56 pages
ML - 03 - Machine Learning Systems
No ratings yet
ML - 03 - Machine Learning Systems
60 pages
ML Unit 1
No ratings yet
ML Unit 1
21 pages
Aastha Mahajan Python File
No ratings yet
Aastha Mahajan Python File
17 pages
Course Overview
No ratings yet
Course Overview
33 pages
End To End Machine Learning Problem Problem Under Discussion
No ratings yet
End To End Machine Learning Problem Problem Under Discussion
12 pages
isml3
No ratings yet
isml3
9 pages
Lecture 17&18 - Introduction To Machine Learning
No ratings yet
Lecture 17&18 - Introduction To Machine Learning
51 pages
الفصل ١
No ratings yet
الفصل ١
15 pages
AI-Lecture 8 (Machine Learning Overview)
No ratings yet
AI-Lecture 8 (Machine Learning Overview)
42 pages
Dawit House
No ratings yet
Dawit House
49 pages
Feature Labs - ML 2.0
No ratings yet
Feature Labs - ML 2.0
13 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
8 pages
07-Overview-of-Machine-Learning
No ratings yet
07-Overview-of-Machine-Learning
113 pages
END_TO_END_PROJECT
No ratings yet
END_TO_END_PROJECT
21 pages
ML QB Ans
No ratings yet
ML QB Ans
141 pages
SRS Wordd
No ratings yet
SRS Wordd
5 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Lec3 4 ML Project
No ratings yet
Lec3 4 ML Project
26 pages
Ai - W2L4
No ratings yet
Ai - W2L4
18 pages
Exercises 5
No ratings yet
Exercises 5
3 pages
9e22017d-b680-4335-8526-adb05749c2f5
No ratings yet
9e22017d-b680-4335-8526-adb05749c2f5
28 pages
House Price Using Machine Learning (1)
No ratings yet
House Price Using Machine Learning (1)
9 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
Seminar Ppt4
No ratings yet
Seminar Ppt4
19 pages
5 IntroML
No ratings yet
5 IntroML
23 pages
ml project clg (2)
No ratings yet
ml project clg (2)
62 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Karnataka Geography QRN
No ratings yet
Karnataka Geography QRN
50 pages
California Housing Project
No ratings yet
California Housing Project
5 pages
MNIST
No ratings yet
MNIST
54 pages
MNIST
No ratings yet
MNIST
3 pages
NAME: - Assignment 5 Data Files Needed For These Problems Are in The Attached Files. Problems: 3.31
No ratings yet
NAME: - Assignment 5 Data Files Needed For These Problems Are in The Attached Files. Problems: 3.31
9 pages
Residential Neighborhood and Real Estate Report For The Stilwell, Kansas Zip Code 66085
No ratings yet
Residential Neighborhood and Real Estate Report For The Stilwell, Kansas Zip Code 66085
6 pages
154 10422 086 PDF
No ratings yet
154 10422 086 PDF
12 pages
Lesson 4.2 Computing The Point Estimate of A Population Mean
No ratings yet
Lesson 4.2 Computing The Point Estimate of A Population Mean
24 pages
Short Answer Type: 2 Marks Each: Statistics - IX Class Test 01
No ratings yet
Short Answer Type: 2 Marks Each: Statistics - IX Class Test 01
6 pages
Quartiles
No ratings yet
Quartiles
7 pages
Koenker, R., & Bassett, G. (1978) - Regression Quantiles
No ratings yet
Koenker, R., & Bassett, G. (1978) - Regression Quantiles
24 pages
MBS3-TB03
No ratings yet
MBS3-TB03
25 pages
Study of Averages Final
No ratings yet
Study of Averages Final
111 pages
1974 Jacob A Mincer - Schooling and Earnings
No ratings yet
1974 Jacob A Mincer - Schooling and Earnings
24 pages
File 3
No ratings yet
File 3
2 pages
Statistics and Probability Practice
No ratings yet
Statistics and Probability Practice
16 pages
Asa QB
No ratings yet
Asa QB
5 pages
Core Course - Co3crt08 - Quantitative Techniques For Business - 1
No ratings yet
Core Course - Co3crt08 - Quantitative Techniques For Business - 1
2 pages
OSTA WS2024 Tutorial Session 01
No ratings yet
OSTA WS2024 Tutorial Session 01
19 pages
QM Resit Assignmnt
No ratings yet
QM Resit Assignmnt
9 pages
Metrics For Test Reporting Analysis And Reporting For Effective Test Management Frank Witte pdf download
No ratings yet
Metrics For Test Reporting Analysis And Reporting For Effective Test Management Frank Witte pdf download
89 pages
Statistics: Stem and Leaf Diagrams
100% (1)
Statistics: Stem and Leaf Diagrams
17 pages
CO2 Math10
No ratings yet
CO2 Math10
54 pages
Sasa 1
No ratings yet
Sasa 1
1 page
Spss Problem Solve
No ratings yet
Spss Problem Solve
107 pages
Measures of Central Tendency: Mean, Mode, Median
No ratings yet
Measures of Central Tendency: Mean, Mode, Median
30 pages
Descriptive Statistics: 4 Edition David P. Doane and Lori E. Seward
No ratings yet
Descriptive Statistics: 4 Edition David P. Doane and Lori E. Seward
9 pages
CCCJ Statistics Formula Sheet & Tables
No ratings yet
CCCJ Statistics Formula Sheet & Tables
11 pages
2021 ComplianceLine Hotline Benchmark Report
No ratings yet
2021 ComplianceLine Hotline Benchmark Report
46 pages
Spot Speed Study
100% (1)
Spot Speed Study
13 pages
MMW-FINALS-REVIEWER - Etc
No ratings yet
MMW-FINALS-REVIEWER - Etc
4 pages

Module 2

Uploaded by

Module 2

Uploaded by

Module 2

• When you are learning about

• Why are data pipelines important in Machine Learning?

You might also like