Introduction to Artificial Intelligence
Lesson 3: Machine Learning Workflow
© Simplilearn. All rights reserved.
Learning Objectives
Describe the seven steps of machine learning workflow
Machine Learning Workflow
Topic 1: Machine Learning Process
The Machine Learning Workflow
It is essential for all technical and non-technical stakeholders to understand
machine learning workflow to understand:
• The job of data scientist
• The processes a data scientist follows to provide feedback to decision-makers
• The machine learning process in a business environment
The Machine Learning Process
2. Ask a
sharp Yes
question
3. Add the
7. Use the
data to
answer
Yes
No
table
Yes
Yes
1. Get more Yes
data
No No
6. Answer
4. Check for
the No
quality
question
Yes
Yes 5.
Transform
features
Step 1: Get More Data
• The data collected is used to investigate a business challenge.
• The quality of the predictive model depends upon the quantity and quality of the data gathered.
• The data can be collected in different formats.
Data Format Examples
Names Numbers
Type Sonia Tran Money $300m
Variety Caramel latte Count 69 pizzas
Id Air Force One Pixel brightness 232/255
Model number R2-D2 Temperature 30 degree F
Category Chocolate Sound intensity 0.64
Text Best. Show. Ever.
Names that look like numbers
Names that look like numbers
and can be turned into numbers
Zip code 95126 Place First, second, third
Social security number 602-47-1899 Time zone Pacific, mountain, central,
eastern
Serial number 100000023987 Diridon, San Francisco,
Train stops Sunnyvale, Menlo Park
Credit card number 5467-3345-2122-5508
Sound intensity 0.64 Side Left, middle, right
Sound intensity 0.64
Goals of Machine Learning Workflow
The goals of data science and machine learning is to:
• Derive answers to business challenges
• Derive meaningful conclusions from complicated issues
• Identify actionable steps given a wide set of variables
To be able to do so, you need to ask a sharp question as opposed to a vague one.
Step 2: Ask a Sharp Question
It helps you get clear answers to the questions.
? It is direct and specific.
Need for a
sharp question It focuses on a single topic.
It focuses on the exact need and requirement.
Vague vs. Sharp Questions
? Vague questions ? Sharp questions
1. What should you do? 1. Which route will get you to work fastest?
2. How should you live my life? 2. How many times a user will use the new
3. Which career path should you take? product features?
4. Which data can tell you about your business? 3. How can the revenue be maximized?
Sharp Question Example
Goal • Your ultimate goal is to be able to analyze your historical data and predict the
stock price at a future date.
• You have to study all different tables of data in your database and analyze how
your company is doing month by month in terms of sales.
• This will ultimately lead you to understand how the company is doing in terms
of it’s market share.
Sharp Question • What will be your company’s stock price next week?
Sharp Question Example
The company’s database is divided into following tables:
Step 3: Add Data to the Table
01 Data analyst arranges data in database tables in a systematic manner.
02 Systematic arrangement of data helps in detailed analysis.
03 Data is stored in the table in the form of columns and rows.
Table columns represent data of a single type and rows represent
04
records pertaining to one entity.
The final step is to aggregate, distribute, compute, or measure to derive
05
a data analysis.
Data Analysis in Machine Learning
• Data analysis is the process of deriving new findings from the historical data.
• It mainly focuses on aggregating table data to find the answers to various business problems.
• It is one of the essential steps performed by data analysts to build machine learning algorithm.
Example: Add Data to the Table
• Each table row represents observations across given attributes.
• The stock price column shows the stock value across different dates.
Example: Data Analysis
Aggregate and distribute the data as shown here:
Quarter Total Sales
2015Q4 119.2M
2016Q1 221.0M
2016Q2 215.9M Month Total Sales
2016Q3 189.3M 2016/01 43.0M
2016Q4 211.2M 2016/02 60.1M
… … 2016/03 55.5M
Example: Aggregate
• You can aggregate the data in the table to derive answers.
• This process is called data analysis and involves counting total observations in a table or
combining data from multiple tables.
Example: Aggregate and Distribute
• You can focus on all observations for a particular column or feature and total it.
Example: Distribute, Compute, and Measure
• This is an example of performing aggregate, distribute, compute and measure
operations on data in tables.
• Each feature and their observations are distributed across the table and then combined.
Example: Estimate
• The market share column shows the estimated stock price values of the company that are
derived from the previous steps.
Step 4: Check for Quality
Quality check determines if the data is acceptable for further investigation.
For an algorithm to work, the data in a column should be in a consistent format.
It involves computation and analysis of the data derived from previous steps.
Check for Quality: Example
• The Birth year column in the table has data format inconsistencies.
• The date in this column needs to be converted to a consistent format to make it readable for
the ML algorithm.
Check for Quality: Example
• The Birth year column in the table has inconsistent format.
Step 5: Transform Features
• This step includes Feature Engineering.
• Each characteristic of a data element is known as a feature.
• Feature engineering enables you to make sense out of the data, especially when
there are multiple features.
• Some features may not give useful information for the model, whereas some features
may be combined to derive meaningful information.
• Feature engineering helps you overcome such challenges.
Tricks of Feature Engineering
• Scale Invariant Feature Transform (SIFT): Images
Data-specific
• Term Frequency-Inverse Document Frequency (TF-IDF): Text
• Econometric, technological, agricultural, and sociological data
Domain-specific
engineering
Deep Learning • Images, text, and audio data engineering
Transform Features: Example
• There are 3 columns and 65,670 rows.
• Features 0 and 1 have similar values.
• The numbers are meaningless and scattered.
Transform Features: Example (contd.)
• Values of feature column 0 is multiplied with every observation in feature column 1.
• These values are plotted in image 2.
Image 1 Image 2
Transform Features: Example (contd.)
• When we subtract feature 0 from feature 1 and plot it, we get a curve.
• This curve is normal or Gaussian distribution or bell-shaped curve.
Step 6: Answer the Questions
• This step helps analyze if the obtained answers are clear.
• These questions include:
1 How much or how many?
2 Which category?
3 Which group?
4 Does this look strange?
5 Which action?
Answer the Questions: Type 1
How much or how many?
What will be the temperature this
1
Friday?
2 How many people will like my post?
What will be my product sales next
3
month?
Answer the Questions: Type 1 (contd.)
How much or how many?
What will be the temperature this
1
Friday?
2 How many people will like my post?
What will be my product sales next
3
month?
Answer the Questions: Type 1 (contd.)
How much or how many?
What will be the temperature this
1
Friday?
2 How many people will like my post?
What will be my product sales next
3
month?
Answer the Questions: Type 2
Which category?
1 Is this an image of a dog?
2 What is the topic of this news article?
Which hotel in my area offers free
3
Wi-Fi?
Answer the Questions: Type 2 (contd.)
Which category?
1 Is this an image of a dog?
2 What is the topic of this news article?
Which hotel in my area offers free
3
Wi-Fi?
Answer the Questions: Type 2 (contd.)
Which category?
1 Is this an image of a dog?
2 What is the topic of this news article?
Which hotel in my area offers free
3
Wi-Fi?
Answer the Questions: Type 3
Which group?
Which shoppers purchase similar
1
products?
Which group of viewers like horror
2
movies?
How best can you divide this book
3
into ten topics?
Answer the Questions: Type 3 (contd.)
Which group?
Which shoppers purchase similar
1
products?
Which group of viewers like horror
2
movies?
How best can you divide this book
3
into ten topics?
Answer the Questions: Type 3 (contd.)
Which group?
Which shoppers purchase similar
1
products?
Which group of viewers like horror
2
movies?
How best can you divide this book
3
into ten topics?
Answer the Questions: Type 4
Does this look strange?
1 Is this internet message typical?
2 Is this heart beat reading abnormal?
Does these transactions look unusual
3 as opposed to customer’s usual credit
card transactions ?
Answer the Questions: Type 4 (contd.)
Does this look strange?
1 Is this internet message typical?
2 Is this heartbeat reading abnormal?
Does these transactions look unusual
3 as opposed to customer’s usual credit
card transactions ?
Answer the Questions: Type 4 (contd.)
Does this look strange?
1 Is this internet message typical?
2 Is this heart beat reading abnormal?
Do these transactions look unusual
3 as opposed to customer’s usual credit
card transactions ?
Answer the Questions: Type 5
Which action?
Should I vacuum again or should I
1
not?
2 Should I run the red light?
Should I raise or lower the
3
temperature ?
Answer the Questions: Type 5 (contd.)
Which action?
Should I vacuum again or should I
1
not?
2 Should I run the red light?
Should I raise or lower the
3
temperature ?
Answer the Questions: Type 5 (contd.)
Which action?
Should I vacuum again or should I
1
not?
2 Should I run the red light?
Should I raise or lower the
3
temperature ?
Step 7: Use the Answer
There are plenty of ways to use the answer derived from the previous step.
1 For making up a decision
2 For proposing the price of an item
3 For publishing the results obtained as a part of research paper
4 For constructing a dashboard on power BI
5 For making changes to product features
Note: Power BI is a business analytics tool by Microsoft.
Demo
Machine Learning Workflow
A demo on how a buyer decides which property he can purchase.
Key Takeaways
Machine learning workflow involves seven steps.
Step one involves getting more data, which is the process of deriving relevant data
to answer business questions.
The next step is to always ask sharp questions and avoid using vague ones to get
the desired response for a question.
Third step is to arrange the raw data in tables to analyze the data better.
In the fourth step, data quality is checked to ensure data consistency.
In the fifth step, transform features help you in making the machine learning
model more efficient.
In the sixth step, answers are derived from the data model to help you answer the
business questions.
In the seventh step, this answer is used to implement in production or ML algorithm.
Quiz
QUIZ
What are the different kinds of data?
1
a. Data as numbers only
b. Data can only be names that can be changed into numbers
Data includes names, numbers, and names that can be turned into numbers. But, it
c. excludes names that look like numbers
Data includes names, numbers, names that can be turned into numbers, and names
d. that look like numbers
QUIZ
What are the different kinds of data?
1
a. Data as numbers only
b. Data can only be names that can be changed into numbers
Data includes names, numbers, and names that can be turned into numbers. But, it
c. excludes names that look like numbers.
Data includes names, numbers, names that can be turned into numbers, and names
d. that look like numbers
The correct answer is D
Data can be names, numbers, names that look like numbers, and names that can be turned into numbers
QUIZ
What are the different ways to ensure data quality?
2
a. Data quality is due to business unit malfunction or due to providing incomplete data
b. Data quality can be handled through communicating with business unit(s), handling missing
numbers, removing outliers, plotting the values in a column, and fitting to a distribution
c. Once missing values in a column are removed, every column has value/observations and
data quality reaches close to 100%
d. Data quality is the job of data analysts and Database Administrators (DBA)
QUIZ
What are the different ways to ensure data quality?
2
a. Data quality is due to business unit malfunction or due to providing incomplete data
b. Data quality can be handled through communicating with business unit(s), handling missing
numbers, removing outliers, plotting the values in a column, and fitting to a distribution
c. Once missing values in a column are removed, every column has value/observations and
data quality reaches close to 100%
d. Data quality is the job of data analysts and Database Administrators (DBA)
The correct answer is B
Data can be made consistent by handling missing numbers, plotting the column values, fitting them to distributions,
and removing outliers.
This concludes “Machine Learning Workflow.”
The next lesson is “Performance Metrics.”
©Simplilearn. All rights reserved