0% found this document useful (0 votes)
6 views27 pages

Regression 2

The document provides an overview of regression analysis, introduced by Sir Francis Galton, focusing on the relationship between children's heights and their parents'. It covers key concepts such as central tendency, dispersion, least squares estimates, and the effects of outliers, as well as the distinction between simple and multiple regression. Additionally, it discusses the coefficient of determination (R²) and the importance of adjusted R² in evaluating model performance.

Uploaded by

Aysha Safdar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views27 pages

Regression 2

The document provides an overview of regression analysis, introduced by Sir Francis Galton, focusing on the relationship between children's heights and their parents'. It covers key concepts such as central tendency, dispersion, least squares estimates, and the effects of outliers, as well as the distinction between simple and multiple regression. Additionally, it discusses the coefficient of determination (R²) and the importance of adjusted R² in evaluating model performance.

Uploaded by

Aysha Safdar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Regression

Brief Introduction
• Regression: Introduced by Sir Francis Galton (1822-1911).
• Experiment based on Heights of children vs height of their parents
• Observations:
• Tall parents have tall children.
• Short parents have short Children
• Findings:
• Average height of children tends to step back
OR
• Regressed towards the average height of all men.
• Tendency toward the average height of all men called Regression
Galton Dataset height of children vs
parents child
61.7
61.7
parent
70.5
68.5
61.7 65.5
61.7 64.5
—Find the Basic Statistics (Mean, Variance, S.D) 61.7
62.2
64
67.5
— Find the outliers (if any)? 62.2
62.2
67.5
67.5
— How to find outliers( Five number summary IQR) 62.2
62.2
66.5
66.5
— How to resolve outlier problems? Either can be 62.2
62.2
66.5
64.5

replaced with mean or you can remove instance? 63.2


63.2
70.5
69.5
— After pre-processing, plot histogram to see the 63.2
63.2
68.5
68.5

data behavior. 63.2


63.2
68.5
68.5
63.2 68.5
63.2 68.5
63.2 68.5
63.2 67.5
Measuring the Central Tendency 1 n
  x
x 
n
x
i1
i N
• Mean (algebraic measure) (sample vs. population): n

Note: n is sample size and N is population size. w x i i


x  i 1
n
• Weighted arithmetic mean:
w i
• Median: i 1

• Middle value if odd number of values, or average of the middle two values
otherwise
• Estimated by interpolation (for grouped data): Median

n / 2  ( freq )l
interval

median L1  ( ) width


• Mode freqmedian
• Value that occurs most frequently in the data

4
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and symmetric

negatively skewed data

positively skewed negatively skewed

5
June 16, 2025 Data Mining: Concepts and Techniques
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
•Variance: (algebraic, scalable computation)
n n
1 n 1 n
1 n 1 1
 [ xi  ( xi ) 2 ]  x
2
2
s  2
( xi  x ) 
2
 2
 ( xi  
2
)  i  2
n  1 i 1 n  1 i 1 n i 1 N i 1 N i 1

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)


6
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles,
i.e., the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to
Minimum and Maximum
• Outliers: points beyond a specified outlier threshold,
plotted individually
7
Curve Fitting Motivation
Least Squares Estimates in Simple
Linear Regression (Best Fit)
Yi a  bX i  ei
- a and b are Least square estimates and e is residual.
- Least square estimates
- a is Intercept
- b is slope (might be +ve or –ve depending on the direction of the relationship
between X and Y)
- e is residual( it is the deviation between the observed from its estimate).
- What to Minimize?
- Error?
ei Yi  a  bX i
- For this purpose, we need to find the minima of function using Partial
Derivatives (You may familiar with derivatives)
- Finally, we get the following equations
Best Fit
- These equations are known as
ei Yi  a  bX i normal equations and solved by
Simultaneously to find co-efficients
 Y na  b X
i i

 X Y a X  b X
i i i i
2

Alternatively:

n XY   X Y   X  X Y  Y  a
 Y 
X 2
 X  XY
b 
  X 
2
n X 2
  X   X  X 
2 2
n X 2
Example
X Y XY X2 Y(hat) Y-Y(hat) (Y-Y(hat))2
5 16 80 25 15.625 0.375 0.140625
6 19 114 36 18.456 0.544 0.295936
8 23 184 64 24.118 -1.118 1.249924
10 28 280 100 29.78 -1.78 3.1684
12 36 432 144 35.442 0.558 0.311364
13 41 533 169 38.273 2.727 7.436529
15 44 660 225 43.935 0.065 0.004225
16 45 720 256 46.766 -1.766 3.118756
17 50 850 289 49.597 0.403 0.162409
102 302 3853 1308 0.008 15.88817

a 1.47 SE 1.506565
b 2.831
Evaluation
• S.D of Regression/Standard Error of Estimate (SEE)
• Degree of Scatter of the observed values about the regression line is
measured by SEE.
2
 ^

  Y  Y 

SEE ( s ) 
n k  1
where k no of independent variables
• Inferences:
• SEE=0 iff When all observed values fall on regression line.
^ ^ ^
• 68%, 95.4% and 99.7% of Observations lies between Y SEE , Y 2SEE , Y 3SEE
Residuals
• Residuals represent variation left unexplained by our model. We
emphasize the difference between residuals and errors.
• The errors unobservable true errors from the known coefficients,
while residuals are the observable errors from the estimated
coefficients.
• In a sense, the residuals are estimates of the errors.
Co-efficient of Determination (R2)
Total Variation
^
Yi  Y Unexplained Variation

^
Yi  Y Y a  bX

^
Explained Variation
YY
Y

X Xi
Co-efficient of Determination (R2)
• Proportion of Variability in the value of dependent variable explained
by the Independent variable.
• It is Ratio of explained variable by total variation
2 2

     
2 ^ ^
 Yi Y   Y
i  Y  



Y  Y 

RSS ESS
• Total SS=Estimated SS + Regression SS R2  1 
TSS TSS
Co-efficient of Determination
• Multiple R2
• Strength of the linear relationship between actual and
estimated values for the dependent variable
• When Regressing model includes only one
independent variable the Multiple R statics is square
root of squared-R static
Effects of Outlier on Regression
• Change a single observation in data as an outlier and compare the
results
• Outliers need to be handled effectively.
Multiple Regression
• Most of the real world problems may not explained with single
variable
• It may requires more than one independent variable

• E.g. Crop Yield? Does it depend on only Land?


Multiple Regression cont’d
• Yield of crop depends on many factors e.g.
• Fertility of Land
• Fertilizer applied
• Rainfall
• Quality of seed
• Use of chemical against pests etc.
• Regression with more than one independent variable is known as
Multiple Regression
Multiple Regression cont’d
Yi a  bX 1i  cX 2i  dX 3i  ...  nX ni  ei

• Objective is to find the most optimal solution of the above function


• Function minimization need to be computed
^
Y i a  b1 X 1i  b2 X 2i
2
 ^

ei  Yi  Y  Yi  a  b1 X 1i  b2 X 2i 
2

 
• We will get the normal equations and solve them simultaneously to
get the unknown parameters values.
 Y na  b  X  b  X
1 1 2 2

 X Y a X  b  X  b  X X
1 1 1 1
2
2 1 2

 X Y a X  b  X X  b  X
2 2 1 1 2 2
2
2
Example
• As a Data Science Expert, you want to predict income of a restaurants,
using two independent variables: the number of Restaurant
employees and restaurant floor area.
Inflating Squared-R
• Squared R always tends to increase as you add independent variables.
• But in actually it might not be true that model perfect.
• Solution??
Adjusted Squared-R
• To see the actual picture of model performance in multiple you need
to focus on Adjusted Squared-R.

• Adjusted Squared-R=1-(ESS/TSS)*(n-1/n-k-1)
Types of Regressions

You might also like