Predictive Analytics - Regression
Predictive Analytics - Regression
Introductory
Big Data
College of Engineering
Chapter -8 -
Predictive Analytics -
Regression
Learning Objectives
• Predictive analytics elements
• Regression Models
• Linear Regression
• Predictive performance measures
Predictive Analytics
Predictive analytics: Answers “What might
Descriptive analytics : Answers What happened?
happen in the future”?
• Summarizes or condensates data to extract • Produces predictions based on predictive
patterns models.
• The result of a given method or technique is • A predictive model is a generalization of the
obtained directly by applying an algorithm to relationship between data and the desired
the data output. It associates the hidden relationships in
• Examples: relationship between Hight and data with a sought or target perdition
weight, average grade in the class, students with • Predictive tasks do not predict what is going to
similar study interests, clusters of mall happen in the future, but how likely or probable
customers or patients- groups… etc. are the outcomes of a given event.
• Examples: predicting the possibility of getting a
certain disease for a new patient based on the
history of genetic data of previous patients
records.
Examples – Predictive analytics
Name Age Diagnosis
4
Predictive
Analytics
• Data Labels
• Predicative Analytics Elements
Data Labels
Data Labels – Desired Predictions
Data Labels – Desired Predictions
Data Labels – Desired Predictions
• Datasets can be labeled and unlabeled (i.e., annotated or
unannotated)
• Predictive tasks use labeled data
• Labeled data is data whose outcome is already known, to guide the
prediction of labels (desired outcome) for new, unlabeled, data.
• A label represents a possible outcome of an event and can be of
several types.
• For example:
a person can be labeled “child” or “adult” Binary labels , or can
be labelled as male of female
a car can be of types “family”, “sport”, “terrain” or “truck” multi
labels
movies can be rated “worst”, “bad”, “neutral”, “good” and
“excellent” (multi labels),
Predictive Analytics Elements
• Training data: the historical data used to induce a model
• Training: the process of utilizing an algorithm to build a generalizable model
that can correctly associate the attributes of each instance to a true prediction
• Predictive attributes: all the features utilized in the training process
• Target attributes: data labels that indicate the desired prediction
• Test data : the data used to test the performance/quality of the predictive
model.
• Predictive model: predictive techniques usually build or induce a predictive
model from the labeled data
• Once a predictive model is induced on the training set, it can be used to predict
the correct label for new data in the next step
• Prediction models are not 100% accurate
• We aim to minimize the number or extent of future mispredictions produced by prediction models
Predictive Analytics - Example
• A hospital might have the records of several
patients, and each record would be the
result of a set of clinical examinations
and the diagnosis (i.e., feature space)
for one of the patients. The set of patient
records is called the training data
• is associated with the importance of the predictive attribute : the weight in our
Example – this is an unknown value that needs to be calculated
• is called the intercept and is the value of y when the linear model intercepts
the y-axis, in other words when x1 = 0.
• When developing a univariate regression model we aim to find the
Parameter 𝛽 1 that produce predictions with the minimum
mispredictions possible
+ * Predictive attribute
Regression - Example
• Let us use the social network data set to predict the
height of new friends
• The “weight” is a predictive attribute
• The “height” is the target attribute,
• The data in the table is the training set,
• The predictive model is a simple linear regression
model:
height = 128.017 + 0.611 × weight
• let us predict the height of our new friends Omar
and Patricia, whose weights are 91 and 58
• The predicted height of Omar will be:
height = 128.017 + 0.611 × 91 = 183.618 cm
• Can you find the weight of Patricia using the same
regression model?
Multivariate Linear Regression
• The linear model generalized for any number p of predictive attributes
• A multivariate regression model can be expressed:
The diet score feature determines how closely it was associated with actual measurements of BMI
• What are the predictive attributes?
• What are the possible values for each attribute?
• What is the target attribute?
Multivariate Linear Regression-
Example
• The regression mathematical function:
BMI = 18.0 + (1.5 (diet score) + 1.6 (male) + 4.2 (age>20))
• What is the BMI for the following new unseen test instance:(J,3, 1, 1)
• Answer:
BMI = 18.0 + (1.5* 3 + 1.6*1 + 4.2 *1)
BMI= 28.3
Class Activity
• Use the same regression model:
BMI = 18.0 + (1.5 (diet score) + 1.6 (male) + 4.2 (age>20))
to predict the BMI for the following new unseen test instance:
(L,3, 0, 0)
Advantages and disadvantages of
linear regression
Advantages Disadvantages
• Strong mathematical • Poor fit if relationship between
foundation predictive attributes and target is non-
linear
• Easily interpretable
• The number of instances must be
larger than the number of attributes
• Sensitive to outliers
Predictive Performance Measures for
Regression
• In the social network example the regression model was induced from the 14 record training
set
• The predicted heights for Omar and Patricia were 183.618 and 163.455 cm
respectively
• The real heights of Omar and Patricia are, however, different from these
predicted values:
• This means that here are errors in our predictions!
• The real, measured, heights of Omar and Patricia be 176 and 168
cm, respectively
Predictive Performance Measures for
Regression
• It is important to measure and assess the quality of a regression model:
• Models that produce an unacceptable level of mispredictions must be
discarded
• The quality of the induced model is obtained by comparing the predicted
values ^yi with the respective true values yi on the given test set S.
• Various performance measures can be set up:
• Mean absolute error
• Mean square error
• Root mean square error
• Relative mean square error (RelMSE)
Predictive Performance Measures for
Regression
Mean absolute error (MAE) Mean square error (MSE)
• The MSE
Class Activity