0% found this document useful (0 votes)
20 views55 pages

Unit 3

Uploaded by

kumarmagesh0055
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views55 pages

Unit 3

Uploaded by

kumarmagesh0055
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

UNIT 3

PREDICTIVE
MODELING
Linear Regression – Polynomial Regression
– Multivariate Regression – Multi Level
Models– Bias/Variance Trade Off – K Fold
Cross Validation – Data Cleaning and
Normalization – Cleaning Web Log Data –
Normalizing Numerical Data – Detecting
Outliers -Introduction to Supervised and
Unsupervised Learning – Reinforcement
Learning – Time series analysis–. Moving
averages – Missing Values – Serial
Correlation – Autocorrelation.
PREDICTIVE MODELING
• Predictive modelling is a statistical
technique used to predict the
outcome of future events based on
historical data.
• It involves building a mathematical
model that takes relevant input
variables and generates a predicted
output variable.
Linear Regression
• The term regression is used
when you try to find the
relationship between variables.
• In Machine Learning and in
statistical modeling, that
relationship is used to predict
the outcome of events.
Least Square Method
• Linear regression uses the least square
method.
• The concept is to draw a line through all the
plotted data points. The line is positioned in a
way that it minimizes the distance to all of
the data points.
• The distance is called "residuals" or "errors".
• The red dashed lines represents the distance
from the data points to the drawn
mathematical function.
Polynomial Regression
• Polynomial Regression is a
regression algorithm that models
the relationship between a
dependent(y) and independent
variable(x) as nth degree
polynomial. The Polynomial
Regression equation is given below:
y= b0+b1x1+ b x +
2 1
2

b2x1 +...... b x
3
n 1
n
In simple terms
• If your data points clearly will not fit a
linear regression (a straight line
through all data points), it might be
ideal for polynomial regression.
• Polynomial regression, like linear
regression, uses the relationship
between the variables x and y to find
the best way to draw a line through
the data points.
Multivariate Regression
• Multivariate Regression is a
method used to measure the
degree at which more than one
independent variable and more
than one dependent variable,
are linearly related.
Example
• Multivariate analysis aims to
identify patterns between
multiple variables. For
example, if you want to measure
the correlation between the
amount of time spent on social
media and an employee's
productivity,
Bias/Variance Trade Off
What is Bias?
• While making predictions, a
difference occurs between
prediction values made by the
model and actual
values/expected values, and
this difference is known as bias
errors or Errors due to bias.
• Low Bias: A low bias model will
make fewer assumptions about the
form of the target function.
• High Bias: A model with a high bias
makes more assumptions, and the
model becomes unable to capture
the important features of our
dataset. A high bias model also
cannot perform well on new data.
What is a Variance Error?
• variance tells that how much a
random variable is different
from its expected value.
• Low-Bias, Low-Variance:
The combination of low bias and low
variance shows an ideal machine learning
model. However, it is not possible
practically.
• Low-Bias, High-Variance: With low bias
and high variance, model predictions are
inconsistent and accurate on average. This
case occurs when the model learns with a
large number of parameters and hence
leads to an overfitting
• High-Bias, Low-Variance:
1. In predictions are consistent but
inaccurate on average. This case occurs
when a model does not learn well with
the training dataset or uses few numbers
of the parameter.
2. It leads to underfitting problems in the
model.
• High-Bias, High-Variance:
With high bias and high variance, predictions
are inconsistent and also inaccurate on
average.
Bias-Variance Trade-Off
• While building the machine learning model, it is
really important to take care of bias and variance
in order to avoid overfitting and underfitting in
the model.
• If the model is very simple with fewer parameters,
it may have low variance and high bias. Whereas,
if the model has a large number of parameters, it
will have high variance and low bias.
• So, it is required to make a balance between bias
and variance errors, and this balance between the
bias error and variance error is known as the Bias-
Variance trade-off.
• For an accurate prediction of the
model, algorithms need a low variance
and low bias. But this is not possible
because bias and variance are related
to each other:
• If we decrease the variance, it will
increase the bias.
• If we decrease the bias, it will increase
the variance.
K Fold Cross Validation
• The concept of cross-validation is
widely used in data science and
machine learning.
• It’s a way to verify the performance
of a predictive model before using it
in an actual situation.
• Essentially, it helps you avoid
creating inaccurate predictions.
Steps in K Fold
• In K-fold cross-validation, the data set
is divided into a number of K-folds.
• K represents the number of groups
into which the data sample is divided.
• For example, if you find the k value to
be 5, you can call it 5-fold cross-
validation. Each fold is used as a test
set at some point in the process.
• Randomly shuffle the dataset.
• Divide the dataset into k folds
• For each unique group:
– Use one fold as test data
– Use remaining groups as training
dataset
– Fit model on training set and evaluate
on test set.
– Keep Score.
Data Cleaning & Normalization
• Data cleaning and
normalization are both
processes that improve the
quality of data, but they
have different goals and
approaches:
Data cleaning
• Focuses on the accuracy of data by
identifying and fixing errors,
duplicates, outliers, and missing
values.
Data normalization
• Focuses on the structure of data by
reorganizing it to remove redundancy
and ensure consistency.
Cleaning Web Log Data
• Web data cleaning is the
process of making raw data
more structured and readable.
• It can help improve data quality
and reduce the size of a
dataset.
Steps and tools that can help with web
data cleaning
Identify and remove duplicates
• This is a key step in data cleaning. You can
compare unique identifiers like email
addresses or ID numbers to find duplicates.
Validate data accuracy
• You can use cross-checks and other
verification methods to ensure data
accuracy. This is important for maintaining
the reliability of machine-learning models.
Discard outliers
• Outliers are unusual values in your
data that can introduce extreme
data variance. This can lead to
inaccurate conclusions.
Data cleaning tools
• OpenRefine: A free, open-source
tool that allows users to convert
data between formats, parse online
data, and work with collected data
locally.
• IBM InfoSphere DataStage: A data
cleaning tool that helps
organizations clean, organize, and
analyze their data.
• TIBCO: A cloud-based SaaS tool that
helps clean data from multiple
sources. It has advanced data profiling
and sampling functions.
• WinPure: A free, downloadable tool that
can clean large datasets by removing
duplicate data and standardizing it.
• DataCleaner: A data quality analysis
platform that helps with data cleaning,
data transformation, and data merging.
Normalizing numerical data
• Normalizing numerical data
is the process of changing the
values of numeric columns in a
dataset to use a common scale
without losing information or
distorting the differences in the
ranges of values.
Ways to normalize numerical data
Divide by the maximum value
A simple approach is to divide each value by the
maximum value in the dataset. For example, if
the maximum value is 100 and a value is 75, the
normalized value is 0.75.
Min-Max normalization:
This technique scales the values of a feature to a
range between 0 and 1. This is done by
subtracting the minimum value of the feature
from each value, and then dividing by the range
of the feature.
• Z-score normalization: This technique
scales the values of a feature to have a
mean of 0 and a standard deviation of
1. This is done by subtracting the mean
of the feature from each value, and
then dividing by the standard
deviation.
• Decimal Scaling: This technique scales
the values of a feature by dividing the
values of a feature by a power of 10.
• Logarithmic transformation: This
technique applies a logarithmic
transformation to the values of a
feature.
• Root transformation: This
technique applies a square root
transformation to the values of a
feature.
Detecting Outliers
• An outlier is a data point that
significantly deviates from the
rest of the data. It can be either
much higher or much lower than
the other data points, and its
presence can have a significant
impact on the results.
(refer unit 2)
Supervised and Unsupervised
Learning
• In supervised learning, the machine is trained
on a set of labeled data, which means that the
input data is paired with the desired output.
• Supervised learning is often used for tasks
such as classification, regression, and object
detection.
• For example, a labeled dataset of images of
Elephant, Camel and Cow would have each
image tagged with either “Elephant” ,
“Camel”or “Cow.”
• In unsupervised learning, the machine is trained
on a set of unlabeled data, which means that
the input data is not paired with the desired
output.
• The machine then learns to find patterns and
relationships in the data.
• Unsupervised learning is often used for tasks
such as clustering, dimensionality reduction,
and anomaly detection.
• The goal of unsupervised learning is to discover
patterns and relationships in the data without
any explicit guidance.(without Training data)
Reinforcement Learning
• Reinforcement learning (RL) is a machine
learning technique that teaches software
to make decisions to achieve the best
results.
• It's a powerful method that helps artificial
intelligence (AI) systems achieve optimal
outcomes in unseen environments.
• It mimics the trial-and-error learning
process that humans use to achieve their
goals.
Key Concepts of Reinforcement Learning
• Agent: The learner or decision-maker.
• Environment: Everything the agent
interacts with.
• State: A specific situation in which the
agent finds itself.
• Action: All possible moves the agent can
make.
• Reward: Feedback from the environment
based on the action taken.
Types of Reinforcement:
• Positive: Positive Reinforcement is defined
as when an event, occurs due to a particular
behaviour, increases the strength and the
frequency of the behaviour. In other words,
it has a positive effect on behavior.
• Negative: Negative Reinforcement is defined
as strengthening of behaviour because a
negative condition is stopped or avoided.
Elements of Reinforcement Learning

i) Policy: Defines the agent’s behavior at a given


time.
ii) Reward Function: Defines the goal of the RL
problem by providing feedback.
iii) Value Function: Estimates long-term rewards
from a state.
iv) Model of the Environment: Helps in
predicting future states and rewards for
planning.
Time series analysis
• Time series analysis is a statistical
method that analyzes data points
collected at regular intervals over a
period of time to identify patterns,
trends, and seasonality.
• It's a powerful tool that can help
organizations make informed decisions
and accurate forecasts based on
historical data.
key features of time series analysis
Data collection
Data points are recorded at consistent intervals
over a set period of time.
Data type
Time series data can be classified as either metrics
(gathered at regular intervals) or events
(gathered at irregular intervals).
Data size
Time series analysis typically requires a large
number of data points to ensure consistency and
reliability
Data analysis
Mathematical tools are used to
identify patterns, trends, seasonality,
and irregularities in the data.
Data use
Time series analysis can be used to
forecast future data based on
historical data.
Moving Averages
• In time series analysis, moving
averages are a way to smooth out a
series of data by calculating an
average of a set number of items.
• A moving average is a series of
averages, calculated from historic
data.
Moving Averages -Examples
Sales
To calculate the simple moving average (SMA)
for sales in the first quarter of the year, you
can divide the total sales for the quarter by
the number of months in the quarter.
Stock prices
To calculate a 7-day moving average for a
stock, you can add up the closing prices for
the last 7 days and divide by 7.

You might also like