100% found this document useful (1 vote)
18 views107 pages

MI - Unit 3

The document covers Unit 3 of a Machine Intelligence course, focusing on regression and classification within supervised and unsupervised learning. It discusses various algorithms, model accuracy assessment, and the differences between supervised and unsupervised learning, including examples of classification and regression problems. Key concepts such as bias-variance trade-off, model construction, and methods like K-Nearest Neighbors and linear regression are also explored.

Uploaded by

ersenthilprabhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
18 views107 pages

MI - Unit 3

The document covers Unit 3 of a Machine Intelligence course, focusing on regression and classification within supervised and unsupervised learning. It discusses various algorithms, model accuracy assessment, and the differences between supervised and unsupervised learning, including examples of classification and regression problems. Key concepts such as bias-variance trade-off, model construction, and methods like K-Nearest Neighbors and linear regression are also explored.

Uploaded by

ersenthilprabhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 107

19CSCN1602 – Machine Intelligence

Unit –3 –Session 1

Unit 3 : Regression and Classification

Topic : Supervised and Unsupervised learning

Prabhu K, AP(SS)/CSE
Unit III

Regression and Classification

Supervised and Unsupervised learning – Classification and Regression –


Assessing model accuracy – Simple Linear regression – Multiple linear
regression – Qualitative predictors – Extensions of the linear Model – Logistic
regression – Naïve Bayes classifier.
• Utilize regression and classification
CO algorithms for data modeling and
prediction

• Develop data model using classification


LO technique

• Compare supervised and unsupervised


SO learning

Topic • Supervised and Unsupervised learning


Supervised and Unsupervised learning
Supervised learning

• Supervised learning deals with or learns with “labeled” data

• Supervised learning, as the name indicates, has the presence of a


supervisor as a teacher

• It teach or train the machine using data that is well labeled

• Data is already tagged with the correct answer

• When the machine is provided with a new set of examples(data)


• supervised learning algorithm analyses the training data(set of
training examples) and produces a correct outcome from labeled data.
• Given a basket filled with different kinds of fruits
• First step is to train the machine with all different fruits one by one
• If the shape of the object is rounded and has a depression at the top, is
red in color, then it will be labeled as –Apple
• If the shape of the object is a long curving cylinder having Green-Yellow
color, then it will be labeled as –Banana.
• After training the data, given a new separate fruit, say banana from the
basket, and asked to identify it.
• It will first classify the fruit with its shape and color
and would confirm the fruit name as banana
• The machine learns the things from training data
(basket containing fruits) and then applies the
knowledge to test data(new fruit).
Supervised learning
• Supervised learning is classified into two categories of algorithms:
• Classification: A classification problem is when the output variable
is a category, such as “Red” or “blue” or “disease” and “no disease”.
• Regression: A regression problem is when the output variable is a
real value, such as “dollars” or “weight”.
• Types:-
• Regression
• Logistic Regression
• Classification
• Naive Bayes Classifiers
• K-NN (k nearest neighbors)
• Decision Trees
• Support Vector Machine
Supervised learning
• Advantages

• Supervised learning allows collecting data and produces data output


from previous experiences.
• Helps to optimize performance criteria with the help of experience.

• Supervised machine learning helps to solve various types of real-


world computation problems.

• Disadvantages

• Classifying big data can be challenging.

• Training for supervised learning needs a lot of computation time. So,


it requires a lot of time.
Unsupervised learning

• Unsupervised machine learning is the training of models on raw and


unlabeled training data

• The task of the machine is to group unsorted information according to


similarities, patterns, and differences without any prior training of data.

• Unlike supervised learning, no teacher is provided that means no training


will be given to the machine

• It allows the model to work on its own to discover patterns and


information that was previously undetected

• It mainly deals with unlabeled data.


Unsupervised learning
• Given an image having both dogs and cats which it has never seen.

• The machine has no idea about the features of dogs and cats so we can’t
categorize it as dogs and cats

• But it can categorize them according to their similarities, patterns, and


differences

• We can categorize the below picture into two parts

• The first may contain all pics having dogs in them and the second part
may contain all pics having cats in them
Unsupervised learning

• Unsupervised learning is classified into two categories of algorithms:

• Clustering: A clustering problem is where you want to discover the


inherent groupings in the data, such as grouping customers by
purchasing behavior.
• Association: An association rule learning problem is where you want
to discover rules that describe large portions of your data, such as
people that buy X also tend to buy Y.
Unsupervised learning
• Types of Unsupervised Learning
• Clustering Types
• Hierarchical clustering
• K-means clustering
• Principal Component Analysis
• Singular Value Decomposition
• Independent Component Analysis
Supervised vs. Unsupervised Machine Learning

Parameters Supervised machine learning Unsupervised


machine learning
Input Data Algorithms are trained using Algorithms are used
labeled data. against data that is not
labeled
Computational Simpler method Computationally
Complexity complex
Accuracy Highly accurate Less accurate
Use New data is classified based on Given a set of
the training set measurements,
observations, etc. with
the aim of establishing
the existence of classes
or clusters in the data
Thank you
19CSCN1602 – Machine Intelligence
Unit –3 –Session 1-Part2

Unit 3 : Regression and Classification

Topic : Classification and Regression


Regression Versus Classification Problems

• Variables can be characterized as either quantitative or qualitative

• Quantitative variables take on numerical values.

• Example

• person’s age, height, or income

• the value of a house

• the price of a stock


Regression Versus Classification Problems

• qualitative variables take on values in one of K different classes, or


categories
• Example
• person’s gender (male or female)
• the brand of product purchased (brand A, B, or C)
• whether a person defaults on a debt (yes or no)
• cancer diagnosis (Acute Myelogenous Leukemia, Acute Lymphoblastic
Leukemia, or No Leukemia)
• problems with a quantitative response as regression problems
• problems with qualitative response are often referred to as classification
problem
Classification—A Two-Step Process

• Model construction: describing a set of predetermined classes

• Each tuple/sample is assumed to belong to a predefined class, as


determined by the class label attribute
• The set of tuples used for model construction is training set

• The model is represented as classification rules, decision trees, or


mathematical formulae
• Supervised Learning
Classification – A Two Step Process

• Model usage: for classifying future or unknown objects

• Estimate accuracy of the model

• The known label of test sample is compared with the classified


result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
• If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no OR years > 6
Anne Associate Prof 3 no THEN tenured = ‘yes’
Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
George Professor 5 yes
Thank you
Unit II

Regression and Classification

Supervised and Unsupervised learning – Classification and Regression –


Assessing model accuracy – Simple Linear regression – Multiple linear
regression – Qualitative predictors – Extensions of the linear Model – Logistic
regression – Naïve Bayes classifier.
Assessing Model Accuracy
Measuring the Quality of Fit

• Evaluate the performance of a statistical learning method on a given data set


• Mean squared error (MSE)

MSE= (xi))2

• Where (xi) is the prediction that gives for the ith observation

• The MSE will be


• Small : predicted responses are very close to the true responses
• Large: difference between predicted responses and true responses is
high
Measuring the Quality of Fit
• Training MSE :The MSE calculated using the training data used to fit the
model

• Test MSE : The MSE calculated using the test data

• The average squared prediction error is given by

Ave((x0) − y0) 2

where (x0,y0) is a previously unseen test observation

• Overfitting : training MSE is low but test MSE is high


Bias-Variance Trade-Off

• Bias is the difference between the expected value and the true value of the
predicted response.

• Variance is the squared deviation of a random variable from its mean and it
measures how far a set of numbers are spread out from their mean.

• The expected test MSE for a given value x0 is the sum of the variance of (x0),
the squared bias of and the variance of the error term.

• E(y0- (x0))2= Var( (x0))+[Bias( (x0))]2 +Var(

• where E(y0− (x0))2 is the average test MSE obtained if f is repeatedly

estimated using a large number of training sets and tested each at x0.
Bias-Variance Trade-Off
• The overall expected test MSE is computed by averaging

• E(y0- (x0))2 over all the possible values of x0 in the test data set

• In order to minimize the expected test error, a learning method with low
variance and low bias needs to be achieved

• The test MSE can never go down below the variance of the irreducible error
term.

• Bias refers to the error that is introduced by approximating a complicated


real-life problem by a simple model. Ex: Linear regression

• In more flexible methods the variance increases while the bias reduces
Bias-Variance Trade-Off

• A good performing model requires low variance and low squared

bias
• This relationship between bias, variance and test MSE is called as

the bias-variance trade-off.


• This is called trade-off as it is easy to obtain a method with

extremely low bias but high variance or a method with low

variance yet high bias.


• Finding a method in which both variance and squared bias is low is

quite a challenge for a data scientist.


Bias-Variance Trade-Off

• In a classification problem, the most common approach for

quantifying the accuracy of the prediction is the training error rate

• Formula for training error rate : )

• where is the prediction for the ith observation using .

• I(yi ) equals 1 if I(yi ) and 0 if I(yi ) .

• Formula for test error rate: Ave(I(yi ) )

• where is the predicted label from the classifier used to the

test observation with predictor x 0.


Bayes Classifier
• Assign a test observation with predictor vector x0 to the class j for which

Pr(Y = j|X = x0) is largest.

• This is called conditional probability which means it’s the probability of


Y = j for a given predictor vector x0.

• This simple classifier is called as Bayes classifier

• If in a problem there are two possible response values ‘a ‘and ‘ b’

• The Bayes classifier corresponds to predicting class ’ a ‘

if Pr(Y = 1| X = x0) > 0.5 or class’ b’ otherwise.

• Bayes error rate which is given by

• 1 – E(maxj Pr(Y = j|X))

• Expected value averages the probability over all the possible values of X
K-Nearest Neighbors (KNN)
• The K in the KNN is any positive integer.

• For a test observation x0, the KNN classifier first identifies the K

points in the training data that are closest to x0.

• then estimates the conditional probability for class j as a fraction of


the points in N0 whose response values equals j as:

• Pr(Y=j|X=x0)=

• If K = 1, the training error rate is 0 but the test error rate may be
quite high

• Low K values give more flexible methods and the flexibility decreases
with increasing K.
Bias –Variance tradeoff
Bias-Variance Trade-Off
• Underfitting:
• a model unable to capture the underlying pattern of the data
• happens when we have very less amount of data to build an accurate
model
• when trying to build a linear model with a nonlinear data.
• These models have high bias and low variance
• Overfitting:
• model captures the noise along with the underlying pattern in data
• happens when we train our model a lot over noisy dataset.
• These models have low bias and high variance.
• These models are very complex like Decision trees which are prone to
overfitting.
Bias-Variance Trade-Off

• high bias and low variance

• If the model is too simple

• has very few parameters

• high variance and low bias

• if the model has large number of parameters

• So we need to find the right/good balance without overfitting and


underfitting the data.

• An algorithm can’t be more complex and less complex at the same time
Unit II

Regression and Classification

Supervised and Unsupervised learning – Classification and Regression –


Assessing model accuracy – Simple Linear regression – Multiple linear
regression – Qualitative predictors – Extensions of the linear Model – Logistic
regression – Naïve Bayes classifier.
Simple Linear regression
Linear regression

• Linear regression is a useful tool for predicting a quantitative


response
• Advertising data :sales (in thousands of units) for a particular
product as a function of advertising budgets (in thousands of
dollars) for TV, radio, and newspaper media.
Linear regression

• Suggest a marketing plan for next year that will result in high product
sales.
• What information would be useful in order to provide such a
recommendation?
• Is there a relationship between advertising budget and sales?
• How strong is the relationship between advertising budget and
sales?
• Which media contribute to sales?
• How accurately can we estimate the effect of each medium on sales?
• How accurately can we predict future sales?
• Is the relationship linear?
• Is there synergy among the advertising media?
Simple Linear Regression
• It is a very straightforward approach for predicting a quantitative
response Y on the basis of a single predictor variable X.
• It assumes that there is approximately a linear relationship between X and
Y.
• Mathematically, we can write this linear relationship as

• Y≈β0+ β1X

• Eg: X may represent TV advertising and Y may represent sales.


• regress sales onto TV by fitting the model
• sales≈β0+ β1* TV

• β0 and β1 are two unknown constants that represent the intercept and
slope terms in the linear model

• β0 and β1 are known as the model coefficients or parameters


Estimating the Coefficients
• predict future sales on the basis of a particular value of TV advertising by
computing
• = + x
• where indicates a prediction of Y on the basis of X = x

• Let = + xi be the prediction for Y based on the ith value of X

• ei = yi − i : is the difference between residual the ith observed response


value and the ith response value that is predicted by linear model.
• residual sum of squares (RSS) is given by

• Equivalent to

x1)2+ 2)2+. . . + n) 2
x1)2+ 2)2+. . . + n) 2

Contour and three-dimensional plots of the RSS on the Advertising data, using sales
as the response and TV as the predictor.
Estimating the Coefficients

• least squares coefficient estimates for simple linear regression is given by.

= ^ 𝟎=𝒚 − ^
𝜷 𝜷𝟏𝒙
= 7.03 and = 0.0475
• Where and are the sample means = 7.03 + 0.0475x

an additional $1,000 spent on TV


advertising is associated with
selling approximately 47.5
additional units of the product
Assessing the Accuracy of the Coefficient
Estimates
• True relationship between X and Y takes the form
• Y = f(X) +ϵ for some unknown function f
• Where ϵ is a mean-zero random error term

• If f is to be approximated by a linear function the population regression line is

given by : Y = β0 + β1X + ϵ
• population regression line is the best linear approximation to the true
relationship between X and Y
• β0 is the intercept term—that is, the expected value of Y when X = 0

• β1 is the slope—the average increase in Y associated with a one-unit increase

in X
• ϵ measurement error :the error term is independent of X
• Population regression line is the best linear line that fit all of the data
• Least squares line is the line that makes RSS least

The red line represents the true relationship, The population regression line is
f(X)=2+3X, which is known as the population again shown in red, and the least
regression line. The blue line is the least squares line in dark blue. In light
squares line; it is the least squares estimate blue, ten least squares lines are
for f(X) based on the observed data, shown shown, each computed on the basis
in black of a separate random set of
observations
Assessing the Accuracy of the Coefficient
Estimates

• The population mean μ of some random variable Y is given by


•=
• Where = is the sample mean
• The sample mean will provide a good estimate of the population
mean
• The unknown coefficients β0 and β1 in linear regression define
the population regression line.
Assessing the Accuracy of the Coefficient
Estimates

• On the basis of one particular set of observations y 1,...,yn, might


overestimate μ
• On the basis of another set of observations, might
underestimate μ.
• So averaging a huge number of estimates of μ obtained from a
huge number of sets of observations, would exactly equal μ.
• average of ’s over many data sets will be very close to μ
• single estimate may be a substantial underestimate or
overestimate of μ.
• How far off will that single estimate of be?
• Computed using the standard error of

𝞼2
𝑉𝑎𝑟 ( ^µ )=𝑆𝐸 ( ^µ ) 2=
𝑛

• where σ is the standard deviation of each of the realizations yi of Y


• Standard error: the average amount that this estimate differs from the
actual value of μ
• n—the more observations we have, the smaller the standard error of
• Formulas to compute the standard errors associated with and,
SE( =

where σ2 = Var(ϵ).
• The estimate of σ is known as the residual standard error
• RSE =
• Standard errors can be used to compute confidence intervals
• confidence interval for β1 )
• confidence interval for β0 )

• In the case of the advertising data,


• 95 % confidence interval for β0 is [6.130, 7.935]
• 95 % confidence interval for β1 is [0.042, 0.053].
• Standard errors can also be used to perform hypothesis tests
on the hypothesis coefficients
• The most common hypothesis test involves testing the null
hypothesis of
• H0 : There is no relationship between X and Y
• Ha : There is some relationship between X and Y .
• H0 : β1 = 0 , Ha : β1 ≠0
• t-statistic measures the number of standard deviations that , is away from 0
^
β 1 −0
𝑡=
𝑆𝐸 ( ^ β 1)

• the p-value :probability of observing any value equal to |t| or larger, assuming
β1 = 0
• small p-value : association between the predictor and the response
• reject the null hypothesis
• p-value cutoffs for rejecting the null hypothesis are 5 or 1 %
Assessing the Accuracy of the Model
• The quality of a linear regression fit is typically assessed using two
related quantities:
• residual standard error (RSE)
• R2 statistic.
• Residual Standard Error(RSE)
• it is the average amount that the response will deviate from the true
regression line
• The RSE is considered a measure of the lack of fit of the model to data
• If RSE small: the model fits the data very well
• If RSE large: the model doesn’t fit the data well.
𝑛

√ √
𝑛
1 1
(𝑦 𝑖− ^ 𝑅𝑆𝑆=∑ ( 𝑦 𝑖− ^
𝑦 𝑖) 2
𝑅𝑆𝐸=
𝑛− 2
𝑅𝑆𝑆=¿ ∑
𝑛−2 𝑖=1
𝑦𝑖)2 𝑖=1
Assessing the Accuracy of the Model
• R2 Statistic is a measure of the linear relationship between X and Y
𝐧
𝐑𝐒𝐒 =∑ ( 𝐲 𝐢 − ^
=1-
𝐲 𝐢)𝟐
𝐢 =𝟏

where TSS = (yi − )2 is the total sum of squares

• R2 lies between 0 and 1


• Goodness of R2 will depend on the application
Assessing the Accuracy of the Model

• Correlation is also a measure of the linear relationship between X and Y


• Use r = Cor(X, Y ) instead of R2 in order to assess the fit of the linear
mode
• R2 = r2 the squared correlation and the R2 statistic are identical

Cor(X,Y)=
Unit II

Regression and Classification

Supervised and Unsupervised learning – Classification and Regression –


Assessing model accuracy – Simple Linear regression – Multiple linear
regression – Qualitative predictors – Extensions of the linear Model – Logistic
regression – Naïve Bayes classifier.
Motivation multiple linear regression
Multiple Linear Regression

• p distinct predictors

X1+ 2 X2+ …… + XP + ϵ

• where Xj represents the jth predictor and βj quantifies the association


between that variable and the response

• sales = β0 + β1 × TV + β2 × radio + β3 × newspaper +ϵ


Estimating the Regression Coefficients

= + x1 + x2 +…… + x p

= xi1 2 xi2 - . . .- p xip )2

For the Advertising data, least squares coefficient estimates of the multiple linear
regression of number of units sold on radio, TV, and newspaper advertising budgets.
Is There a Relationship Between the Response and
Predictors?

• Test the null hypothesis

• H0 : β1 = β2 = ··· = βp = 0

• versus the alternative

• Ha : at least one βj is non-zero.

• Hypothesis test is performed by computing the F-statistic


(𝑇𝑆𝑆− 𝑅𝑆𝑆)/𝑝
𝐹=
𝑅𝑆𝑆 /(𝑛 − 𝑝 −1)
• Where TSS =
• F-statistic value close to 1: H0 is true
• F is greater than 1:Ha is true
Is There a Relationship Between the Response
and Predictors?

• How large does the F-statistic need to be before we can reject H0 and
conclude that there is a relationship?

• the answer depends on the values of n and p

• n is large : F-statistic little larger than 1 is needed to reject H0

• n is small : a larger F-statistic is needed to reject H0


Deciding on Important Variables
• The task of determining which predictors are associated with the response, in
order to fit a single model involving only those predictors, is referred to as
variable selection.
• Try out a lot of different models, each containing a different subset of the
predictors.
• Ex: if p = 2 : (1) a model containing no variables (2) a model containing X1 only
(3) a model containing X2 only (4) a model containing both X1 and X2
• Various statistics can be used to judge the quality of a model
• Mallow’s Cp, Akaike information criterion(AIC), Bayesian information criterion
(BIC), and adjusted R2.
• Problem: if p = 30 consider 230 = 1,073,741,824 models .Not practical
• Automated and efficient approach to choose a smaller set of models to consider
Deciding on Important Variables
• Forward selection

• begin with the null model —a model that contains an intercept but no
predictors
• then fit p simple linear regressions and add to the null model the
variable that results in the lowest RSS
• then add to that model the variable that results in the lowest RSS for
the new two-variable model
• This approach is continued until some stopping rule is satisfied.
Deciding on Important Variables
• Backward selection
• start with all variables in the model
• remove the variable with the largest p-value
• The new (p − 1)-variable model is fit
• stop when all remaining variables have a p-value below some threshold

• Mixed selection

• combination of forward and backward selection

• start with no variables in the model and as with forward selection, add the
variable that provides the best fit

• continue to add variables one-by-one

• if at any point the p-value for one of the variables in the model rises above a
certain threshold, then remove that variable from the model
Model Fit
• Numerical measures of model fit are the RSE and R2
Predictions

• Once we have fit the multiple regression model

• Apply in order to predict the response Y on the basis of a set of values for
the predictors X1, X2,...,Xp
Unit III

Regression and Classification

Supervised and Unsupervised learning – Classification and Regression –


Assessing model accuracy – Simple Linear regression – Multiple linear
regression – Qualitative predictors – Extensions of the linear Model – Logistic
regression – Naïve Bayes classifier.
Qualitative predictors

The Credit data set contains


information about balance,
age, cards, education,
income, limit, and rating for
a number of potential
customers.

• quantitative predictors: age,


cards, education, income, limit,
rating
• qualitative variables: gender,
student , status , ethnicity
Predictors with Only Two Levels

• Qualitative predictor - Gender variable : Value-males and female


• Incorporate it into a regression model
• Create an indicator or dummy variable that takes on two possible
numerical values
• Example

• xi =

• use this variable as a predictor in the regression equation


Predictors with Only Two Levels

• yi =𝞫0+ 𝞫1xi+𝞮i=

• β0: average credit card balance among males


• β0 + β1 : average credit card balance among female
• β1 :average difference in credit card balance between females and males

Least squares coefficient estimates associated with the regression of


balance onto gender in the Credit data set
Qualitative Predictors with More than Two
Levels
• qualitative predictor variable– ethnicity -Values- , , American
• Create two dummy variables

xi1 =

xi2 =

yi =𝞫0+ 𝞫1xi1+ 𝞫2xi2 +𝞮i=

𝞫0 - average credit card balance for Americans


β1- difference in the average balance between the Asian and American
β2 - difference in the average balance between the African and American
Coefficient Std error t-statistic p-value

Intercept 531.00 46.32 11.464 <0.0001


Ethnicity(Asian) -18.69 65.02 -0.287 0.7740
Ethnicity(African) -12.50 56.68 -0.221 0.8260

Least squares coefficient estimates associated with the


regression of balance onto ethnicity in the Credit data set
Unit II

Regression and Classification

Supervised and Unsupervised learning – Classification and Regression –


Assessing model accuracy – Simple Linear regression – Multiple linear
regression – Qualitative predictors – Extensions of the linear Model – Logistic
regression – Naïve Bayes classifier.
Extensions of the linear Model
Extensions of the linear Model

• Two of the most important assumptions state that the relationship


between the predictors and response are additive and linear.

• The additive assumption

• means that the effect of changes in a predictor Xj on the response Y is


independent of the values of the other predictors.

• The linear assumption

• states that the change in the response Y due to a one-unit change in X j is

constant, regardless of the value of Xj


X1+ 2 X2+ …… + XP + ϵ

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper +ϵ


Removing the Additive Assumption
• Given a fixed budget of $100,000, spending half on radio and half on TV may
increase sales more than allocating the entire amount to either TV or to radio

• In marketing, this is known as a synergy effect, and in statistics it is referred to as


an interaction effect.

• Y = β0 + β1X1 + β2X2 +ε

• increase X1 by one unit, then Y will increase by an average of β1 units.

• regardless of the value of X2, a one-unit increase in X1 will lead to a β1-unit

increase in Y.

• One way of extending this model to allow for interaction effects is to include a
third predictor, called an interaction term.

• Y = β0 + β1X1 + β2X2 + β3X1X2 + ε


Removing the Additive Assumption
Y = β 0 + β 1X 1 + β 2X 2 + β 3X 1X 2 + ε

can be rewritten as

Y = β0 + (β1 + β3X2)X1 + β2X2 + ε

= β 0 + 1X 1 + β 2X 2 + ε

where 1 = β1 + β3X2.

Since 1 changes with X2 adjusting X2 will change the impact of X 1 on Y .

sales = β0 + β1 × TV + β2 × radio + β3 × (radio × TV) + ε

= β0 + (β1 + β3 × radio) × TV + β2 × radio + ε

can interpret β3 as the increase in the effectiveness of TV advertising for a one unit
increase in radio advertising
Removing the Additive Assumption

• β0 + (β1 + β3 × radio) × TV + β2 × radio + ε


• increase in TV advertising of $1,000 is associated with increased sales of
(1+ 3×radio)×1,000 = 19+1.1×radio units
• an increase in radio advertising of $1,000 will be associated with an
increase in sales of (+ 3× TV) × 1,000 = 29 + 1.1 × TV units.
Removing the Additive Assumption

• Interaction between a qualitative variable and a quantitative variable

• Credit data set :to predict balance using the income (quantitative) and
student (qualitative) variables
Non-linear Relationships

• Extend the linear model to accommodate non-linear relationships, using


polynomial regression

mpg = β0 + β1 × horsepower + β2 × horsepower2 +ε


Unit II

Regression and Classification

Supervised and Unsupervised learning – Classification and Regression –


Assessing model accuracy – Simple Linear regression – Multiple linear
regression – Qualitative predictors – Extensions of the linear Model – Logistic
regression – Naïve Bayes classifier.
Logistic Regression
• Logistic regression is a process of modeling the probability of a discrete outcome
given an input variable.

• It models a binary outcome that can take two values such as true/false, yes/no, and
so on.

• Logistic regression is a useful analysis method for classification problems

• The primary difference between linear regression and logistic regression is that
logistic regression's range is bounded between 0 and 1

• logistic regression uses a loss function referred to as “maximum likelihood


estimation (MLE)

• If the probability is greater than 0.5, the predictions will be classified as class 0,
Otherwise, class 1 will be assigned.
Estimated probability of default Predicted probabilities of default
using linear regression using logistic regression

• Default data set. response default falls into one of two categories, Yes or No.
• Logistic regression models the probability that Y belongs to a category Yes or No
Logistic Regression

• Logistic regression models the probability of default

• The probability of default given balance can be written as

• Pr(default = yes|balance).

• The value of p(balance), will range between 0 and 1.

• For any given value of balance, a prediction can be made for default.

• Predict default = yes for any individual for whom p(balance) > 0.5
The Logistic Model
• model p(X) using a function that gives outputs between 0 and 1 for all
values of X
• In logistic regression, use the logistic function,

------------------------(1)

= -----------------------(2)
• The quantity p(X)/[1−p(X)] is called the odds. It can take on any value
between 0 and ∞.

Taking logarithm of both sides of (2)

)=𝞫0+𝞫1X ---------(3)

• The left-hand side of (3) is called the log-odds or logit


Estimating the Regression Coefficients

• Use maximum likelihood to fit a logistic regression model

𝓵 ( 𝞫0, 𝞫1 ) = ∏ 𝑝(𝑥𝑖)
❑ ❑

𝑖: 𝑦𝑖=1 ′
∏′
(1−𝑝 ( 𝑥𝑖′ ) )
𝑖 :𝑦 𝑖 = 0

estimated coefficients of
the logistic regression
model that predicts the
probability of default
using balance

large (absolute) value of the z-statistic indicates no null hypothesis H0 : β1 = 0


Making Predictions

^ 0+^
𝞫 𝞫1 𝑋
^ 𝑒
𝑝 ( 𝑋 )= ^
𝞫 0+ ^
𝞫 1 𝑋❑
1+ 𝑒
predict that the default probability for an individual with a balance of $1, 000 is
=
Question

• Compute the predicted probability of default for an individual with a


balance of $2, 000

Ans:0.586 or 58.6 %.
qualitative predictors with the logistic
regression model

Table: For the Default data, estimated coefficients of the logistic regression
model that predicts the probability of default using student status
Multiple Logistic Regression
)=𝞫0+𝞫1X1+……..+ 𝞫pXp ----(1)

where X = (X1,...,Xp) are p predictors.


Equation (1) can be written as

𝞫 0+ 𝞫 1 𝑋 1+ …+𝞫 𝑝 𝑋 𝑝
𝑒
𝑝 ( 𝑋 )= 𝞫 0+ 𝞫 1 𝑋 1 +… 𝞫 𝑝 𝑋 𝑝
1+ 𝑒
Example

Table: For the Default data, estimated coefficients of the logistic regression model that
predicts the probability of default using balance, income, and student status

A student with a credit card balance of $1, 500 and an income of $40, 000 has
an estimated probability of default of Formula:
Question

• Calculate probability of default for non-student with balance of $1, 500


and an income of $40, 000

Formula:
Unit II

Regression and Classification

Supervised and Unsupervised learning – Classification and Regression –


Assessing model accuracy – Simple Linear regression – Multiple linear
regression – Qualitative predictors – Extensions of the linear Model – Logistic
regression – Naïve Bayes classifier.
Naïve Bayes classifier
Bayes Classification

• Bayesian classifiers are statistical classifiers

• They can predict class membership probabilities such as the


probability that a given tuple belongs to a particular class.

• Bayesian classification is based on Bayes’ theorem

• Bayesian classifiers have also exhibited high accuracy and speed


when applied to large databases.

• Naive Bayesian classifiers assume that the effect of an attribute value


on a given class is independent of the values of the other attributes.

• This assumption is called class conditional independence


Bayes’ Theorem

• The probability that any given customer will buy a computer, regardless of

age, income, or any other information

• P(X|H) is the posterior probability of X conditioned on H

• Probability that a customer, X ,is 35years old and earns $40,000,given that we

know the customer will buy a computer.

• P(X) is the prior probability of X.

• it is the probability that a person from our set of customers is 35 years old

and earns $40,000.


Bayes’ Theorem
• Let X be a data tuple

• Let H be some hypothesis such as that the data tuple X belongs to a

specified class C.

• determine P(H|X) the probability that the hypothesis H holds given the

“evidence” or observed data tuple X. probability that tuple X belongs to

class C

• P(H|X) is the posterior probability, or a posteriori probability, of H

conditioned on X.

• P(H|X) reflects the probability that customer X will buy a computer given

that we know the customer’s age and income.

• P(H) is the prior probability, or a priori probability, of H

• P(H|X)= P(X|H)P(H) /P(X)


Naive Bayesian Classification

• Let D be a training set of tuples and their associated class labels


X=(x1, x2,..., xn)
• Suppose that there are m classes, C1, C2,..., Cm. Given a tuple, X, the
classifier will predict that X belongs to the class having the highest
posterior probability,
• naive Bayesian classifier predicts that tuple X belongs to the class Ci if
and only if P(Ci|X) > P(Cj|X) for 1≤j≤m,j≠i

• maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is called


the maximum posteriori hypothesis.

• P(Ci|X)= P(X|Ci)P(Ci)/ P(X)


RID Age Income Stude Credit Buys_com
nt rating puter
1 Youth High No Fair No

2 Youth High No Excellent No

3 Middle_aged High No Fair Yes

4 senior Medium No Fair Yes

5 senior Low Yes Fair Yes

6 senior Low Yes Excellent No

7 Middle_aged Low Yes Excellent Yes


8 Youth Medium No Fair No

9 Youth Low Yes Fair Yes

10 senior Medium Yes Fair Yes

11 youth Medium Yes Excellent Yes

12 Middle_aged Medium No Excellent Yes

13 Middle_aged High Yes Fair Yes

14 senior Medium no Excellent No


Example
• Problem
• Classify the tuple
• X=(age=youth, income=medium, student=yes, credit_rating=fair)
•Solution
•Compute P(Ci), the prior probability of each class
• P(buys_computer=yes)=9/14=0.643
P(Ci|X)= P(X|Ci)P(Ci)/ P(X)
• P(buys _computer=no) =5/14=0.357

• To compute P(X|Ci),for i=1,2,we compute the following conditional probabilities


• P(age=youth|buys_computer=yes) =2/9=0.222
• P(age=youth|buys_computer=no) =3/5=0.600
• P(income=medium|buys_computer=yes)=4/9=0.444
• P(income=medium|buys_computer=no) =2/5=0.400
• P(student=yes|buys_computer=yes) =6/9=0.667
• P(student=yes|buys_computer=no) =1/5=0.200
• P(credit rating=fair|buys_computer=yes)=6/9=0.667
• P(credit rating=fair|buys_computer=no) =2/5=0.400
Example
• P(X|buys_computer=yes)=P(age=youth|buys_computer=yes)
×P(income=medium|buys_computer=yes)×P(student=yes|
buys_computer=yes) ×P(credit rating=fair|buys_computer=yes)
P(Ci|X)= P(X|Ci)P(Ci)/ P(X)
=0.222×0.444×0.667×0.667=0.044.
P(X|Ci)= ∏n k=1 P(xk|Ci)

• P(X|buys_ computer=no)=0.600×0.400×0.200×0.400=0.019.

• P(X|
buys_computer=yes)P(buys_computer=yes)=0.044×0.643=0.028
the naive Bayesian classifier predicts buys computer D yes for tuple X.

• P(X|buys computer=no)P(buys_computer=no)=0.019×0.357=0.007
Reference(s):

• James G, Witten D, Hastie T and Tibshirani R, “An


Introduction to Statistical Learning with
Applications in R”, Springer,2013
Correlation matrix for TV, radio, newspaper, and sales for the
Advertising data.
• P(X|Ci)P(Ci) needs to be maximized

• P(X|Ci)= ∏n k=1 P(xk|Ci)

• =P(x1|Ci)×P(x2|Ci)×···×P(xn|Ci).

• If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in D having the value xk

for Ak, divided by |Ci,D|, the number of tuples of class Ci in D.

• If Ak is continuous-valued,

• g(x, µ, σ)=1 /√2πσ e−(x−µ)2/ 2σ2

• To predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci

• The classifier predicts that the class label of tuple X is the class C i if and only if

• P(X|Ci)P(Ci) > P(X|Cj)P(Cj) for 1≤j≤m, j≠i

You might also like