MI - Unit 3
MI - Unit 3
Unit –3 –Session 1
Prabhu K, AP(SS)/CSE
Unit III
• Disadvantages
• The machine has no idea about the features of dogs and cats so we can’t
categorize it as dogs and cats
• The first may contain all pics having dogs in them and the second part
may contain all pics having cats in them
Unsupervised learning
• Example
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
George Professor 5 yes
Thank you
Unit II
MSE= (xi))2
• Where (xi) is the prediction that gives for the ith observation
Ave((x0) − y0) 2
• Bias is the difference between the expected value and the true value of the
predicted response.
• Variance is the squared deviation of a random variable from its mean and it
measures how far a set of numbers are spread out from their mean.
• The expected test MSE for a given value x0 is the sum of the variance of (x0),
the squared bias of and the variance of the error term.
estimated using a large number of training sets and tested each at x0.
Bias-Variance Trade-Off
• The overall expected test MSE is computed by averaging
• E(y0- (x0))2 over all the possible values of x0 in the test data set
• In order to minimize the expected test error, a learning method with low
variance and low bias needs to be achieved
• The test MSE can never go down below the variance of the irreducible error
term.
• In more flexible methods the variance increases while the bias reduces
Bias-Variance Trade-Off
bias
• This relationship between bias, variance and test MSE is called as
• Expected value averages the probability over all the possible values of X
K-Nearest Neighbors (KNN)
• The K in the KNN is any positive integer.
• For a test observation x0, the KNN classifier first identifies the K
• Pr(Y=j|X=x0)=
• If K = 1, the training error rate is 0 but the test error rate may be
quite high
• Low K values give more flexible methods and the flexibility decreases
with increasing K.
Bias –Variance tradeoff
Bias-Variance Trade-Off
• Underfitting:
• a model unable to capture the underlying pattern of the data
• happens when we have very less amount of data to build an accurate
model
• when trying to build a linear model with a nonlinear data.
• These models have high bias and low variance
• Overfitting:
• model captures the noise along with the underlying pattern in data
• happens when we train our model a lot over noisy dataset.
• These models have low bias and high variance.
• These models are very complex like Decision trees which are prone to
overfitting.
Bias-Variance Trade-Off
• An algorithm can’t be more complex and less complex at the same time
Unit II
• Suggest a marketing plan for next year that will result in high product
sales.
• What information would be useful in order to provide such a
recommendation?
• Is there a relationship between advertising budget and sales?
• How strong is the relationship between advertising budget and
sales?
• Which media contribute to sales?
• How accurately can we estimate the effect of each medium on sales?
• How accurately can we predict future sales?
• Is the relationship linear?
• Is there synergy among the advertising media?
Simple Linear Regression
• It is a very straightforward approach for predicting a quantitative
response Y on the basis of a single predictor variable X.
• It assumes that there is approximately a linear relationship between X and
Y.
• Mathematically, we can write this linear relationship as
• Y≈β0+ β1X
• β0 and β1 are two unknown constants that represent the intercept and
slope terms in the linear model
• Equivalent to
x1)2+ 2)2+. . . + n) 2
x1)2+ 2)2+. . . + n) 2
Contour and three-dimensional plots of the RSS on the Advertising data, using sales
as the response and TV as the predictor.
Estimating the Coefficients
• least squares coefficient estimates for simple linear regression is given by.
= ^ 𝟎=𝒚 − ^
𝜷 𝜷𝟏𝒙
= 7.03 and = 0.0475
• Where and are the sample means = 7.03 + 0.0475x
given by : Y = β0 + β1X + ϵ
• population regression line is the best linear approximation to the true
relationship between X and Y
• β0 is the intercept term—that is, the expected value of Y when X = 0
in X
• ϵ measurement error :the error term is independent of X
• Population regression line is the best linear line that fit all of the data
• Least squares line is the line that makes RSS least
The red line represents the true relationship, The population regression line is
f(X)=2+3X, which is known as the population again shown in red, and the least
regression line. The blue line is the least squares line in dark blue. In light
squares line; it is the least squares estimate blue, ten least squares lines are
for f(X) based on the observed data, shown shown, each computed on the basis
in black of a separate random set of
observations
Assessing the Accuracy of the Coefficient
Estimates
𝞼2
𝑉𝑎𝑟 ( ^µ )=𝑆𝐸 ( ^µ ) 2=
𝑛
where σ2 = Var(ϵ).
• The estimate of σ is known as the residual standard error
• RSE =
• Standard errors can be used to compute confidence intervals
• confidence interval for β1 )
• confidence interval for β0 )
• the p-value :probability of observing any value equal to |t| or larger, assuming
β1 = 0
• small p-value : association between the predictor and the response
• reject the null hypothesis
• p-value cutoffs for rejecting the null hypothesis are 5 or 1 %
Assessing the Accuracy of the Model
• The quality of a linear regression fit is typically assessed using two
related quantities:
• residual standard error (RSE)
• R2 statistic.
• Residual Standard Error(RSE)
• it is the average amount that the response will deviate from the true
regression line
• The RSE is considered a measure of the lack of fit of the model to data
• If RSE small: the model fits the data very well
• If RSE large: the model doesn’t fit the data well.
𝑛
√ √
𝑛
1 1
(𝑦 𝑖− ^ 𝑅𝑆𝑆=∑ ( 𝑦 𝑖− ^
𝑦 𝑖) 2
𝑅𝑆𝐸=
𝑛− 2
𝑅𝑆𝑆=¿ ∑
𝑛−2 𝑖=1
𝑦𝑖)2 𝑖=1
Assessing the Accuracy of the Model
• R2 Statistic is a measure of the linear relationship between X and Y
𝐧
𝐑𝐒𝐒 =∑ ( 𝐲 𝐢 − ^
=1-
𝐲 𝐢)𝟐
𝐢 =𝟏
Cor(X,Y)=
Unit II
• p distinct predictors
X1+ 2 X2+ …… + XP + ϵ
= + x1 + x2 +…… + x p
For the Advertising data, least squares coefficient estimates of the multiple linear
regression of number of units sold on radio, TV, and newspaper advertising budgets.
Is There a Relationship Between the Response and
Predictors?
• H0 : β1 = β2 = ··· = βp = 0
• How large does the F-statistic need to be before we can reject H0 and
conclude that there is a relationship?
• begin with the null model —a model that contains an intercept but no
predictors
• then fit p simple linear regressions and add to the null model the
variable that results in the lowest RSS
• then add to that model the variable that results in the lowest RSS for
the new two-variable model
• This approach is continued until some stopping rule is satisfied.
Deciding on Important Variables
• Backward selection
• start with all variables in the model
• remove the variable with the largest p-value
• The new (p − 1)-variable model is fit
• stop when all remaining variables have a p-value below some threshold
• Mixed selection
• start with no variables in the model and as with forward selection, add the
variable that provides the best fit
• if at any point the p-value for one of the variables in the model rises above a
certain threshold, then remove that variable from the model
Model Fit
• Numerical measures of model fit are the RSE and R2
Predictions
• Apply in order to predict the response Y on the basis of a set of values for
the predictors X1, X2,...,Xp
Unit III
• xi =
• yi =𝞫0+ 𝞫1xi+𝞮i=
xi1 =
xi2 =
• Y = β0 + β1X1 + β2X2 +ε
increase in Y.
• One way of extending this model to allow for interaction effects is to include a
third predictor, called an interaction term.
can be rewritten as
= β 0 + 1X 1 + β 2X 2 + ε
where 1 = β1 + β3X2.
can interpret β3 as the increase in the effectiveness of TV advertising for a one unit
increase in radio advertising
Removing the Additive Assumption
• Credit data set :to predict balance using the income (quantitative) and
student (qualitative) variables
Non-linear Relationships
• It models a binary outcome that can take two values such as true/false, yes/no, and
so on.
• The primary difference between linear regression and logistic regression is that
logistic regression's range is bounded between 0 and 1
• If the probability is greater than 0.5, the predictions will be classified as class 0,
Otherwise, class 1 will be assigned.
Estimated probability of default Predicted probabilities of default
using linear regression using logistic regression
• Default data set. response default falls into one of two categories, Yes or No.
• Logistic regression models the probability that Y belongs to a category Yes or No
Logistic Regression
• Pr(default = yes|balance).
• For any given value of balance, a prediction can be made for default.
• Predict default = yes for any individual for whom p(balance) > 0.5
The Logistic Model
• model p(X) using a function that gives outputs between 0 and 1 for all
values of X
• In logistic regression, use the logistic function,
------------------------(1)
= -----------------------(2)
• The quantity p(X)/[1−p(X)] is called the odds. It can take on any value
between 0 and ∞.
)=𝞫0+𝞫1X ---------(3)
𝓵 ( 𝞫0, 𝞫1 ) = ∏ 𝑝(𝑥𝑖)
❑ ❑
𝑖: 𝑦𝑖=1 ′
∏′
(1−𝑝 ( 𝑥𝑖′ ) )
𝑖 :𝑦 𝑖 = 0
estimated coefficients of
the logistic regression
model that predicts the
probability of default
using balance
^ 0+^
𝞫 𝞫1 𝑋
^ 𝑒
𝑝 ( 𝑋 )= ^
𝞫 0+ ^
𝞫 1 𝑋❑
1+ 𝑒
predict that the default probability for an individual with a balance of $1, 000 is
=
Question
Ans:0.586 or 58.6 %.
qualitative predictors with the logistic
regression model
Table: For the Default data, estimated coefficients of the logistic regression
model that predicts the probability of default using student status
Multiple Logistic Regression
)=𝞫0+𝞫1X1+……..+ 𝞫pXp ----(1)
𝞫 0+ 𝞫 1 𝑋 1+ …+𝞫 𝑝 𝑋 𝑝
𝑒
𝑝 ( 𝑋 )= 𝞫 0+ 𝞫 1 𝑋 1 +… 𝞫 𝑝 𝑋 𝑝
1+ 𝑒
Example
Table: For the Default data, estimated coefficients of the logistic regression model that
predicts the probability of default using balance, income, and student status
A student with a credit card balance of $1, 500 and an income of $40, 000 has
an estimated probability of default of Formula:
Question
Formula:
Unit II
• The probability that any given customer will buy a computer, regardless of
• Probability that a customer, X ,is 35years old and earns $40,000,given that we
• it is the probability that a person from our set of customers is 35 years old
specified class C.
• determine P(H|X) the probability that the hypothesis H holds given the
class C
conditioned on X.
• P(H|X) reflects the probability that customer X will buy a computer given
• P(X|buys_ computer=no)=0.600×0.400×0.200×0.400=0.019.
• P(X|
buys_computer=yes)P(buys_computer=yes)=0.044×0.643=0.028
the naive Bayesian classifier predicts buys computer D yes for tuple X.
• P(X|buys computer=no)P(buys_computer=no)=0.019×0.357=0.007
Reference(s):
• =P(x1|Ci)×P(x2|Ci)×···×P(xn|Ci).
• If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in D having the value xk
• If Ak is continuous-valued,
• The classifier predicts that the class label of tuple X is the class C i if and only if