0% found this document useful (0 votes)
20 views

Module 2

Uploaded by

anushaj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Module 2

Uploaded by

anushaj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Module 2

Anusha J
Assistant Professor
Dept of AIML
JSSATEB
End to end Machine learning Project:
• In this chapter, you will go through an example project end to end, pretending to be a
recently hired data scientist in a real estate company.1 Here are the main steps you will go
through:
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system
Working with Real Data

• When you are learning about


machine learning it is best to
actually experiment with real-world
data, not just artitificial datasets.
• Thousands of open datasets
ranging across all sorts of domains.
• California Housing Prices dataset
from the StatLib repository.This
dataset was based on data from
the 1990 California census.
Cont..
• Popular open data repositories:
– UC Irvine Machine Learning Repository
– Kaggle datasets
– Amazon’s AWS datasets
• Meta portals (they list open data repositories):
– http://dataportals.org/
– http://opendatamonitor.eu/
– http://quandl.com/
• Other pages listing many popular open data repositories:
– Wikipedia’s list of Machine Learning datasets
– Quora.com question
– Datasets subreddit
Project
• Task:
– Build a model to predict median housing price in districts of
California.
– The model will be used by another system to decide on
investments in those areas.
• Data:
– Data source: California census data
– This data includes features like population, median income,
and median housing price for each district.
Cont..
• Model Type:
– This is a supervised learning task because we have labeled data
(districts with known median housing prices).
– It's a regression task because we're predicting a continuous value
(housing price).
– More specifically, it's a multiple regression problem since the prediction
considers multiple features (population, income, etc.) and a univariate
regression problem because we're predicting a single value (housing
price) per district.
– Batch learning is suitable here because the data size is manageable
and there's no need for real-time updates.
Cont..
• Performance Measure:
– Root Mean Squared Error (RMSE) is chosen to measure the
model's performance.
– RMSE considers the squared difference between predicted
and actual values, giving higher weightage to larger errors.
Look at the Big Picture
• Company and Task: You are working at Machine Learning Housing
Corporation. Your first task is to build a model to predict housing prices in
California.
• Data Source: The model will be trained using California census data.
• Data Details: This data includes various metrics for each district (block
group), such as population, median income, and most importantly, the
median housing price you're trying to predict.
• Data Unit: The data refers to block groups, the smallest unit with census
data (typically 600-3,000 people). These will be called "districts" for
convenience.
• Model Goal: The model should learn from the data and use that knowledge
to predict the median housing price in any district, considering the other
available metrics (population, income, etc.).
Frame the Problem

Question:
• What exactly is the business objective; building a model is
probabaly not the end goal. How does the company
expect t use and beneift from this model??
• This is important because it will determine how you frame
the problem, what algorithms you will select, what
performance measure you will use to evaluate your
model, and hpw much effort you should spend tweaking it.
• Why Understanding the Business Objective is Crucial
– The passage highlights that building a model is just a means to an end. You need
to understand the company's business objective for using the model.
– This objective will influence various aspects of your project, such as:
– Problem Framing: How you define the problem (supervised learning, regression, etc.)
depends on the desired outcome.
– Algorithm Selection: The best algorithms for your model depend on the problem type.
– Performance Measure: You'll choose a metric (like RMSE) that reflects how well the
model meets the business goal.
– Development Effort: The amount of time spent refining the model depends on its
impact on the business.
The Business Objective in this Scenario
• The company's goal is to decide on real estate
investments.
• Your housing price prediction model will be fed into
another Machine Learning system . This system likely
analyzes various factors (signals) beyond housing price.
• By predicting housing prices accurately, the overall
system can make better investment decisions, directly
impacting the company's revenue.
Frame the problem.
Current Solution and Expected Improvement
• Currently, housing prices are estimated manually by
experts using complex rules. This is expensive, time-
consuming, and inaccurate (estimates can be off by more
than 20%).
• The company expects a Machine Learning model to be
more efficient and accurate in predicting housing prices.
Understanding the Data Pipeline (Optional
Background)
• The passage also briefly introduces data pipelines, which
are common in Machine Learning.
• They involve a sequence of components that process and
transform data.
• These components can operate independently and
communicate through data storage.
• This modular design makes the system scalable and
easier to manage.
Pipelines
– A data pipeline is a series of connected components that process and
transform data.
– Imagine an assembly line in a factory, but instead of physical objects, the data
flows through different stations that perform specific tasks.

• Why are data pipelines important in Machine Learning?


– Machine learning often deals with vast datasets that require cleaning,
manipulation, and transformation before they can be used to train models.
– Data pipelines automate these processes, making them more efficient and
reliable.
How do data pipelines work?
– Each component in the pipeline acts independently.
– A component retrieves a chunk of data, processes it (e.g.,
cleaning, filtering, transforming), and stores the results in a
designated location (data store).
– The next component in the pipeline retrieves the processed
data from the store, performs its own operations, and stores its
output in another data store.
– This continues until all the data has been processed through
the entire pipeline.
Benefits and challenges of data pipelines:
• Benefits of data pipelines:
• Modular design: Each component is self-contained, making the system
easier to understand, maintain, and scale. Different teams can work on
different parts of the pipeline.
• Robustness: If one component fails, others can often continue functioning
by using the last available processed data. This allows for easier
troubleshooting and minimizes downtime.
• Challenges of data pipelines:
• Monitoring: It's crucial to monitor the pipelines to ensure data isn't getting
stuck or corrupted somewhere. Unidentified issues can lead to stale or
inaccurate data, impacting the overall system's performance.
Why Consider Existing Solutions?
• Why Consider Existing Solutions?
– Examining the current approach can provide valuable insights for building your
machine learning model.
– It can offer a baseline for performance comparison and suggest ways to improve
upon the existing solution.
• Current Solution for Housing Price Estimation
– In this scenario, housing prices are currently estimated manually by a team of
experts.
– This process involves gathering data about each district and using complex rules to
estimate the median housing price if the actual value is unavailable.
• Challenges of the Manual Approach
– This method is expensive and time-consuming.
– The accuracy of the estimates is poor, often deviating from the actual price by more
than 20%.
Justification for the Machine Learning Model
• The company aims to replace the manual process with a
machine learning model to achieve:
– Increased Efficiency: The model can automate price
prediction, saving time and resources.
– Improved Accuracy: The model is expected to be more
accurate than manual estimates.
Data Availability for Machine Learning
– The California census data appears to be a suitable source for
training the model.
– This data includes both the median housing prices (desired
output) and other relevant features (population, income, etc.) for
many districts.
Next Steps: Defining the Machine Learning Problem
– With this understanding of the current solution and available
data, you can now define the machine learning problem precisely.
– The passage asks you to consider the type of learning
(supervised, unsupervised, etc.) and the specific task
classification, regression, etc.) based on the information provided.
Cont..
• By understanding the limitations of the current approach, you can
design a machine learning model that addresses those
shortcomings and leverages the available data to deliver better
results.
Framing the Machine Learning Problem
• Before designing the system (your machine learning model), you need to
clearly define the problem it's trying to solve. This involves specifying several
aspects:
• Learning Type: This scenario involves supervised learning because you
have labeled data. Each data point (district) has features (population,
income, etc.) and a corresponding label (median housing price). The model
learns the relationship between these features and labels to predict future
housing prices.
• Task Type: This is a regression task because you're predicting a continuous
value (housing price) as opposed to classifying something into categories.
Cont..
Regression Specifics:
– Multiple Regression: This problem is a multiple regression because the model will
use several features (population, income, etc.) to predict a single value (housing
price).
– Univariate Regression: It's also a univariate regression because you're only
predicting one value (housing price) per district. If you were predicting multiple
values like housing price and average commute time, it would be multivariate
regression.
Learning Mode: This scenario is suitable for batch learning because:
– The data size (California census data) is likely manageable to handle all at once.
– There's no need for real-time updates on housing prices. The model can be trained
using the entire dataset and then used for predictions.
Selecting a Performance Measure

You might also like