Unit 3

Uploaded by

kumarmagesh0055

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views55 pages

Unit 3

Uploaded by

kumarmagesh0055

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 55

UNIT 3

PREDICTIVE
MODELING
Linear Regression – Polynomial Regression
– Multivariate Regression – Multi Level
Models– Bias/Variance Trade Off – K Fold
Cross Validation – Data Cleaning and
Normalization – Cleaning Web Log Data –
Normalizing Numerical Data – Detecting
Outliers -Introduction to Supervised and
Unsupervised Learning – Reinforcement
Learning – Time series analysis–. Moving
averages – Missing Values – Serial
Correlation – Autocorrelation.
PREDICTIVE MODELING
• Predictive modelling is a statistical
technique used to predict the
outcome of future events based on
historical data.
• It involves building a mathematical
model that takes relevant input
variables and generates a predicted
output variable.
Linear Regression
• The term regression is used
when you try to find the
relationship between variables.
• In Machine Learning and in
statistical modeling, that
relationship is used to predict
the outcome of events.
Least Square Method
• Linear regression uses the least square
method.
• The concept is to draw a line through all the
plotted data points. The line is positioned in a
way that it minimizes the distance to all of
the data points.
• The distance is called "residuals" or "errors".
• The red dashed lines represents the distance
from the data points to the drawn
mathematical function.
Polynomial Regression
• Polynomial Regression is a
regression algorithm that models
the relationship between a
dependent(y) and independent
variable(x) as nth degree
polynomial. The Polynomial
Regression equation is given below:
y= b0+b1x1+ b x +
2 1
2

b2x1 +...... b x
3
n 1
n
In simple terms
• If your data points clearly will not fit a
linear regression (a straight line
through all data points), it might be
ideal for polynomial regression.
• Polynomial regression, like linear
regression, uses the relationship
between the variables x and y to find
the best way to draw a line through
the data points.
Multivariate Regression
• Multivariate Regression is a
method used to measure the
degree at which more than one
independent variable and more
than one dependent variable,
are linearly related.
Example
• Multivariate analysis aims to
identify patterns between
multiple variables. For
example, if you want to measure
the correlation between the
amount of time spent on social
media and an employee's
productivity,
Bias/Variance Trade Off
What is Bias?
• While making predictions, a
difference occurs between
prediction values made by the
model and actual
values/expected values, and
this difference is known as bias
errors or Errors due to bias.
• Low Bias: A low bias model will
make fewer assumptions about the
form of the target function.
• High Bias: A model with a high bias
makes more assumptions, and the
model becomes unable to capture
the important features of our
dataset. A high bias model also
cannot perform well on new data.
What is a Variance Error?
• variance tells that how much a
random variable is different
from its expected value.
• Low-Bias, Low-Variance:
The combination of low bias and low
variance shows an ideal machine learning
model. However, it is not possible
practically.
• Low-Bias, High-Variance: With low bias
and high variance, model predictions are
inconsistent and accurate on average. This
case occurs when the model learns with a
large number of parameters and hence
leads to an overfitting
• High-Bias, Low-Variance:
1. In predictions are consistent but
inaccurate on average. This case occurs
when a model does not learn well with
the training dataset or uses few numbers
of the parameter.
2. It leads to underfitting problems in the
model.
• High-Bias, High-Variance:
With high bias and high variance, predictions
are inconsistent and also inaccurate on
average.
Bias-Variance Trade-Off
• While building the machine learning model, it is
really important to take care of bias and variance
in order to avoid overfitting and underfitting in
the model.
• If the model is very simple with fewer parameters,
it may have low variance and high bias. Whereas,
if the model has a large number of parameters, it
will have high variance and low bias.
• So, it is required to make a balance between bias
and variance errors, and this balance between the
bias error and variance error is known as the Bias-
Variance trade-off.
• For an accurate prediction of the
model, algorithms need a low variance
and low bias. But this is not possible
because bias and variance are related
to each other:
• If we decrease the variance, it will
increase the bias.
• If we decrease the bias, it will increase
the variance.
K Fold Cross Validation
• The concept of cross-validation is
widely used in data science and
machine learning.
• It’s a way to verify the performance
of a predictive model before using it
in an actual situation.
• Essentially, it helps you avoid
creating inaccurate predictions.
Steps in K Fold
• In K-fold cross-validation, the data set
is divided into a number of K-folds.
• K represents the number of groups
into which the data sample is divided.
• For example, if you find the k value to
be 5, you can call it 5-fold cross-
validation. Each fold is used as a test
set at some point in the process.
• Randomly shuffle the dataset.
• Divide the dataset into k folds
• For each unique group:
– Use one fold as test data
– Use remaining groups as training
dataset
– Fit model on training set and evaluate
on test set.
– Keep Score.
Data Cleaning & Normalization
• Data cleaning and
normalization are both
processes that improve the
quality of data, but they
have different goals and
approaches:
Data cleaning
• Focuses on the accuracy of data by
identifying and fixing errors,
duplicates, outliers, and missing
values.
Data normalization
• Focuses on the structure of data by
reorganizing it to remove redundancy
and ensure consistency.
Cleaning Web Log Data
• Web data cleaning is the
process of making raw data
more structured and readable.
• It can help improve data quality
and reduce the size of a
dataset.
Steps and tools that can help with web
data cleaning
Identify and remove duplicates
• This is a key step in data cleaning. You can
compare unique identifiers like email
addresses or ID numbers to find duplicates.
Validate data accuracy
• You can use cross-checks and other
verification methods to ensure data
accuracy. This is important for maintaining
the reliability of machine-learning models.
Discard outliers
• Outliers are unusual values in your
data that can introduce extreme
data variance. This can lead to
inaccurate conclusions.
Data cleaning tools
• OpenRefine: A free, open-source
tool that allows users to convert
data between formats, parse online
data, and work with collected data
locally.
• IBM InfoSphere DataStage: A data
cleaning tool that helps
organizations clean, organize, and
analyze their data.
• TIBCO: A cloud-based SaaS tool that
helps clean data from multiple
sources. It has advanced data profiling
and sampling functions.
• WinPure: A free, downloadable tool that
can clean large datasets by removing
duplicate data and standardizing it.
• DataCleaner: A data quality analysis
platform that helps with data cleaning,
data transformation, and data merging.
Normalizing numerical data
• Normalizing numerical data
is the process of changing the
values of numeric columns in a
dataset to use a common scale
without losing information or
distorting the differences in the
ranges of values.
Ways to normalize numerical data
Divide by the maximum value
A simple approach is to divide each value by the
maximum value in the dataset. For example, if
the maximum value is 100 and a value is 75, the
normalized value is 0.75.
Min-Max normalization:
This technique scales the values of a feature to a
range between 0 and 1. This is done by
subtracting the minimum value of the feature
from each value, and then dividing by the range
of the feature.
• Z-score normalization: This technique
scales the values of a feature to have a
mean of 0 and a standard deviation of
1. This is done by subtracting the mean
of the feature from each value, and
then dividing by the standard
deviation.
• Decimal Scaling: This technique scales
the values of a feature by dividing the
values of a feature by a power of 10.
• Logarithmic transformation: This
technique applies a logarithmic
transformation to the values of a
feature.
• Root transformation: This
technique applies a square root
transformation to the values of a
feature.
Detecting Outliers
• An outlier is a data point that
significantly deviates from the
rest of the data. It can be either
much higher or much lower than
the other data points, and its
presence can have a significant
impact on the results.
(refer unit 2)
Supervised and Unsupervised
Learning
• In supervised learning, the machine is trained
on a set of labeled data, which means that the
input data is paired with the desired output.
• Supervised learning is often used for tasks
such as classification, regression, and object
detection.
• For example, a labeled dataset of images of
Elephant, Camel and Cow would have each
image tagged with either “Elephant” ,
“Camel”or “Cow.”
• In unsupervised learning, the machine is trained
on a set of unlabeled data, which means that
the input data is not paired with the desired
output.
• The machine then learns to find patterns and
relationships in the data.
• Unsupervised learning is often used for tasks
such as clustering, dimensionality reduction,
and anomaly detection.
• The goal of unsupervised learning is to discover
patterns and relationships in the data without
any explicit guidance.(without Training data)
Reinforcement Learning
• Reinforcement learning (RL) is a machine
learning technique that teaches software
to make decisions to achieve the best
results.
• It's a powerful method that helps artificial
intelligence (AI) systems achieve optimal
outcomes in unseen environments.
• It mimics the trial-and-error learning
process that humans use to achieve their
goals.
Key Concepts of Reinforcement Learning
• Agent: The learner or decision-maker.
• Environment: Everything the agent
interacts with.
• State: A specific situation in which the
agent finds itself.
• Action: All possible moves the agent can
make.
• Reward: Feedback from the environment
based on the action taken.
Types of Reinforcement:
• Positive: Positive Reinforcement is defined
as when an event, occurs due to a particular
behaviour, increases the strength and the
frequency of the behaviour. In other words,
it has a positive effect on behavior.
• Negative: Negative Reinforcement is defined
as strengthening of behaviour because a
negative condition is stopped or avoided.
Elements of Reinforcement Learning

i) Policy: Defines the agent’s behavior at a given

time.
ii) Reward Function: Defines the goal of the RL
problem by providing feedback.
iii) Value Function: Estimates long-term rewards
from a state.
iv) Model of the Environment: Helps in
predicting future states and rewards for
planning.
Time series analysis
• Time series analysis is a statistical
method that analyzes data points
collected at regular intervals over a
period of time to identify patterns,
trends, and seasonality.
• It's a powerful tool that can help
organizations make informed decisions
and accurate forecasts based on
historical data.
key features of time series analysis
Data collection
Data points are recorded at consistent intervals
over a set period of time.
Data type
Time series data can be classified as either metrics
(gathered at regular intervals) or events
(gathered at irregular intervals).
Data size
Time series analysis typically requires a large
number of data points to ensure consistency and
reliability
Data analysis
Mathematical tools are used to
identify patterns, trends, seasonality,
and irregularities in the data.
Data use
Time series analysis can be used to
forecast future data based on
historical data.
Moving Averages
• In time series analysis, moving
averages are a way to smooth out a
series of data by calculating an
average of a set number of items.
• A moving average is a series of
averages, calculated from historic
data.
Moving Averages -Examples
Sales
To calculate the simple moving average (SMA)
for sales in the first quarter of the year, you
can divide the total sales for the quarter by
the number of months in the quarter.
Stock prices
To calculate a 7-day moving average for a
stock, you can add up the closing prices for
the last 7 days and divide by 7.

Kaggle Competitions - How To Win
No ratings yet
Kaggle Competitions - How To Win
74 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
39 pages
ML 1 PPT Unit 1
No ratings yet
ML 1 PPT Unit 1
93 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
Online Library Management System Report
No ratings yet
Online Library Management System Report
35 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Chapter 2 Data Preprocessing
No ratings yet
Chapter 2 Data Preprocessing
23 pages
Data Science Interview Questions (#Day11) PDF
100% (1)
Data Science Interview Questions (#Day11) PDF
11 pages
Unit IV
No ratings yet
Unit IV
51 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
Overfitting & Feature Engineering
No ratings yet
Overfitting & Feature Engineering
37 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
ML Unit 2 Part 1
No ratings yet
ML Unit 2 Part 1
47 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
ML.1Lecture.2 (Old)
No ratings yet
ML.1Lecture.2 (Old)
23 pages
02 - Diagnostics For Machine Learning Model
No ratings yet
02 - Diagnostics For Machine Learning Model
20 pages
Lecture 5
No ratings yet
Lecture 5
26 pages
ML Unit 2
No ratings yet
ML Unit 2
35 pages
DS Notes
No ratings yet
DS Notes
36 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Machine Learning
No ratings yet
Machine Learning
37 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
Module3 DS PPT
No ratings yet
Module3 DS PPT
68 pages
1 - Intro To Machine Learning
No ratings yet
1 - Intro To Machine Learning
34 pages
EDA Module 2
No ratings yet
EDA Module 2
28 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Regression
No ratings yet
Regression
24 pages
Vsat2k - ML - Ch1a Evaluation of Learning Algorithms - Jan 2025
No ratings yet
Vsat2k - ML - Ch1a Evaluation of Learning Algorithms - Jan 2025
19 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
ML 5
No ratings yet
ML 5
14 pages
ML 5
No ratings yet
ML 5
26 pages
MLE
No ratings yet
MLE
15 pages
SML
No ratings yet
SML
8 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
Data Science
No ratings yet
Data Science
64 pages
AI & ML Interview Preparation
No ratings yet
AI & ML Interview Preparation
15 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
6 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Fundamentals of Dbms and Oracle
No ratings yet
Fundamentals of Dbms and Oracle
122 pages
AIMl TA2
No ratings yet
AIMl TA2
4 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Iat 1-Cloud Computing Question Paper
No ratings yet
Iat 1-Cloud Computing Question Paper
4 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
PM Debug Info
No ratings yet
PM Debug Info
1,001 pages
Hotel Basic Network Configuration - PNP
No ratings yet
Hotel Basic Network Configuration - PNP
12 pages
National Career Service Portal: User Manual - COUNSELLOR v4.0
No ratings yet
National Career Service Portal: User Manual - COUNSELLOR v4.0
45 pages
GFS-154B M00 - iFIX Fundamentals Front Matter Volume 1 of 2
No ratings yet
GFS-154B M00 - iFIX Fundamentals Front Matter Volume 1 of 2
17 pages
Extensive Study Guide
No ratings yet
Extensive Study Guide
27 pages
CoolSpools Programmers Guide V6R1
No ratings yet
CoolSpools Programmers Guide V6R1
222 pages
Lecture 2 Numbering Systems-1
No ratings yet
Lecture 2 Numbering Systems-1
51 pages
Examination of The Proof of Riemann's Hypothesis
No ratings yet
Examination of The Proof of Riemann's Hypothesis
71 pages
Configuration Samba Server File Sharing
No ratings yet
Configuration Samba Server File Sharing
20 pages
Ics312 - Nasm - Data - BSSPPT BUENOS
No ratings yet
Ics312 - Nasm - Data - BSSPPT BUENOS
43 pages
Log
No ratings yet
Log
20 pages
Updated Courses
No ratings yet
Updated Courses
13 pages
Shanins
No ratings yet
Shanins
60 pages
Course Outline 01 March Update
No ratings yet
Course Outline 01 March Update
16 pages
09 - Active Directory
No ratings yet
09 - Active Directory
10 pages
Module 1 - Interactive Lecture
No ratings yet
Module 1 - Interactive Lecture
9 pages
InterCor Hybrid-Roadmap v1.0 Final
No ratings yet
InterCor Hybrid-Roadmap v1.0 Final
40 pages
React Fundamentals and Environment Setup
No ratings yet
React Fundamentals and Environment Setup
8 pages
ReleaseNotesV5 00
No ratings yet
ReleaseNotesV5 00
6 pages
Minor Project II Mohd Rehan
No ratings yet
Minor Project II Mohd Rehan
17 pages
Globalization: I. Attention Grabber
No ratings yet
Globalization: I. Attention Grabber
4 pages
MIS501 Assessment 3-Final (1) - 220726 - 061808-2
No ratings yet
MIS501 Assessment 3-Final (1) - 220726 - 061808-2
9 pages
The System Unit Is A Case That Contains Electronic Components of The Computer Used To Process Data
No ratings yet
The System Unit Is A Case That Contains Electronic Components of The Computer Used To Process Data
7 pages
Documents-Com Apple CloudDocs-Downloads-untitled Folder 2-2022 MMarco.. 3 (2) (2) (2) 15.download - Google Shopping
No ratings yet
Documents-Com Apple CloudDocs-Downloads-untitled Folder 2-2022 MMarco.. 3 (2) (2) (2) 15.download - Google Shopping
4 pages
Images - Answers - BrainQuest
No ratings yet
Images - Answers - BrainQuest
1 page
Real Numbers and Indices Rubric 2022
No ratings yet
Real Numbers and Indices Rubric 2022
2 pages
(Mission and Vision) Google and Amazon and AAST
No ratings yet
(Mission and Vision) Google and Amazon and AAST
2 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Practical Statistical Process Control
From Everand
Practical Statistical Process Control
Colin Hardwick
5/5 (9)
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Gale Researcher Guide for: Econometric Models
From Everand
Gale Researcher Guide for: Econometric Models
Chupp
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Unit 3

Uploaded by

Unit 3

Uploaded by

UNIT 3

i) Policy: Defines the agent’s behavior at a given

You might also like