0% found this document useful (0 votes)

6 views27 pages

Regression 2

The document provides an overview of regression analysis, introduced by Sir Francis Galton, focusing on the relationship between children's heights and their parents'. It covers key concepts such as central tendency, dispersion, least squares estimates, and the effects of outliers, as well as the distinction between simple and multiple regression. Additionally, it discusses the coefficient of determination (R²) and the importance of adjusted R² in evaluating model performance.

Uploaded by

Aysha Safdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views27 pages

Regression 2

Uploaded by

Aysha Safdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Regression

Brief Introduction
• Regression: Introduced by Sir Francis Galton (1822-1911).
• Experiment based on Heights of children vs height of their parents
• Observations:
• Tall parents have tall children.
• Short parents have short Children
• Findings:
• Average height of children tends to step back
OR
• Regressed towards the average height of all men.
• Tendency toward the average height of all men called Regression
Galton Dataset height of children vs
parents child
61.7
61.7
parent
70.5
68.5
61.7 65.5
61.7 64.5
—Find the Basic Statistics (Mean, Variance, S.D) 61.7
62.2
64
67.5
— Find the outliers (if any)? 62.2
62.2
67.5
67.5
— How to find outliers( Five number summary IQR) 62.2
62.2
66.5
66.5
— How to resolve outlier problems? Either can be 62.2
62.2
66.5
64.5

replaced with mean or you can remove instance? 63.2

63.2
70.5
69.5
— After pre-processing, plot histogram to see the 63.2
63.2
68.5
68.5

data behavior. 63.2

63.2
68.5
68.5
63.2 68.5
63.2 68.5
63.2 68.5
63.2 67.5
Measuring the Central Tendency 1 n
  x
x 
n
x
i1
i N
• Mean (algebraic measure) (sample vs. population): n

Note: n is sample size and N is population size. w x i i

x  i 1
n
• Weighted arithmetic mean:
w i
• Median: i 1

• Middle value if odd number of values, or average of the middle two values
otherwise
• Estimated by interpolation (for grouped data): Median

n / 2  ( freq )l
interval

median L1  ( ) width

• Mode freqmedian
• Value that occurs most frequently in the data

4
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and symmetric

negatively skewed data

positively skewed negatively skewed

5
June 16, 2025 Data Mining: Concepts and Techniques
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
•Variance: (algebraic, scalable computation)
n n
1 n 1 n
1 n 1 1
 [ xi  ( xi ) 2 ]  x
2
2
s  2
( xi  x ) 
2
 2
 ( xi  
2
)  i  2
n  1 i 1 n  1 i 1 n i 1 N i 1 N i 1

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

6
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles,
i.e., the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to
Minimum and Maximum
• Outliers: points beyond a specified outlier threshold,
plotted individually
7
Curve Fitting Motivation
Least Squares Estimates in Simple
Linear Regression (Best Fit)
Yi a  bX i  ei
- a and b are Least square estimates and e is residual.
- Least square estimates
- a is Intercept
- b is slope (might be +ve or –ve depending on the direction of the relationship
between X and Y)
- e is residual( it is the deviation between the observed from its estimate).
- What to Minimize?
- Error?
ei Yi  a  bX i
- For this purpose, we need to find the minima of function using Partial
Derivatives (You may familiar with derivatives)
- Finally, we get the following equations
Best Fit
- These equations are known as
ei Yi  a  bX i normal equations and solved by
Simultaneously to find co-efficients
 Y na  b X
i i

 X Y a X  b X
i i i i
2

Alternatively:

n XY   X Y   X  X Y  Y  a
 Y 
X 2
 X  XY
b 
  X 
2
n X 2
  X   X  X 
2 2
n X 2
Example
X Y XY X2 Y(hat) Y-Y(hat) (Y-Y(hat))2
5 16 80 25 15.625 0.375 0.140625
6 19 114 36 18.456 0.544 0.295936
8 23 184 64 24.118 -1.118 1.249924
10 28 280 100 29.78 -1.78 3.1684
12 36 432 144 35.442 0.558 0.311364
13 41 533 169 38.273 2.727 7.436529
15 44 660 225 43.935 0.065 0.004225
16 45 720 256 46.766 -1.766 3.118756
17 50 850 289 49.597 0.403 0.162409
102 302 3853 1308 0.008 15.88817

a 1.47 SE 1.506565
b 2.831
Evaluation
• S.D of Regression/Standard Error of Estimate (SEE)
• Degree of Scatter of the observed values about the regression line is
measured by SEE.
2
 ^

  Y  Y 

SEE ( s ) 
n k  1
where k no of independent variables
• Inferences:
• SEE=0 iff When all observed values fall on regression line.
^ ^ ^
• 68%, 95.4% and 99.7% of Observations lies between Y SEE , Y 2SEE , Y 3SEE
Residuals
• Residuals represent variation left unexplained by our model. We
emphasize the difference between residuals and errors.
• The errors unobservable true errors from the known coefficients,
while residuals are the observable errors from the estimated
coefficients.
• In a sense, the residuals are estimates of the errors.
Co-efficient of Determination (R2)
Total Variation
^
Yi  Y Unexplained Variation

^
Yi  Y Y a  bX

^
Explained Variation
YY
Y

X Xi
Co-efficient of Determination (R2)
• Proportion of Variability in the value of dependent variable explained
by the Independent variable.
• It is Ratio of explained variable by total variation
2 2

     
2 ^ ^
 Yi Y   Y
i  Y  



Y  Y 


RSS ESS
• Total SS=Estimated SS + Regression SS R2  1 
TSS TSS
Co-efficient of Determination
• Multiple R2
• Strength of the linear relationship between actual and
estimated values for the dependent variable
• When Regressing model includes only one
independent variable the Multiple R statics is square
root of squared-R static
Effects of Outlier on Regression
• Change a single observation in data as an outlier and compare the
results
• Outliers need to be handled effectively.
Multiple Regression
• Most of the real world problems may not explained with single
variable
• It may requires more than one independent variable

• E.g. Crop Yield? Does it depend on only Land?

Multiple Regression cont’d
• Yield of crop depends on many factors e.g.
• Fertility of Land
• Fertilizer applied
• Rainfall
• Quality of seed
• Use of chemical against pests etc.
• Regression with more than one independent variable is known as
Multiple Regression
Multiple Regression cont’d
Yi a  bX 1i  cX 2i  dX 3i  ...  nX ni  ei

• Objective is to find the most optimal solution of the above function

• Function minimization need to be computed
^
Y i a  b1 X 1i  b2 X 2i
2
 ^

ei  Yi  Y  Yi  a  b1 X 1i  b2 X 2i 
2

 
• We will get the normal equations and solve them simultaneously to
get the unknown parameters values.
 Y na  b  X  b  X
1 1 2 2

 X Y a X  b  X  b  X X
1 1 1 1
2
2 1 2

 X Y a X  b  X X  b  X
2 2 1 1 2 2
2
2
Example
• As a Data Science Expert, you want to predict income of a restaurants,
using two independent variables: the number of Restaurant
employees and restaurant floor area.
Inflating Squared-R
• Squared R always tends to increase as you add independent variables.
• But in actually it might not be true that model perfect.
• Solution??
Adjusted Squared-R
• To see the actual picture of model performance in multiple you need
to focus on Adjusted Squared-R.

• Adjusted Squared-R=1-(ESS/TSS)*(n-1/n-k-1)
Types of Regressions

Applied Regression Analysis Norman R. DR
100% (3)
Applied Regression Analysis Norman R. DR
705 pages
Solutions To II Unit Exercises From Kamber
83% (42)
Solutions To II Unit Exercises From Kamber
16 pages
Schaum's Outline of Advanced Calculus, Third Edition
From Everand
Schaum's Outline of Advanced Calculus, Third Edition
Robert C. Wrede
4/5 (9)
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
See30 Norms
No ratings yet
See30 Norms
4 pages
Classical And. Modern Regression With Applications: Duxbury
No ratings yet
Classical And. Modern Regression With Applications: Duxbury
7 pages
(Norman R. Draper, Harry Smith) Applied Regression
No ratings yet
(Norman R. Draper, Harry Smith) Applied Regression
705 pages
Ch15 HW s17
No ratings yet
Ch15 HW s17
5 pages
Quantitative Mathematics Module 2 PDF
No ratings yet
Quantitative Mathematics Module 2 PDF
13 pages
Lect5 Math231
No ratings yet
Lect5 Math231
31 pages
US - TMC - 06 - Curve Fitting & Interpolation
No ratings yet
US - TMC - 06 - Curve Fitting & Interpolation
64 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
Lecture 6 Simple Linear Regression
No ratings yet
Lecture 6 Simple Linear Regression
36 pages
L7 CurveFitting (LeastSquaresRegression)
No ratings yet
L7 CurveFitting (LeastSquaresRegression)
45 pages
4-Curve Fitting and Interpolation
No ratings yet
4-Curve Fitting and Interpolation
48 pages
Stats101A - Chapter 2
No ratings yet
Stats101A - Chapter 2
59 pages
Topic 7 Linear Regreation CHP14
No ratings yet
Topic 7 Linear Regreation CHP14
21 pages
M07 Julita Nahar PDF
No ratings yet
M07 Julita Nahar PDF
8 pages
CH 2
No ratings yet
CH 2
31 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
Chapter 1 - Linear Regression With 1 Predictor: Statistical Model
No ratings yet
Chapter 1 - Linear Regression With 1 Predictor: Statistical Model
35 pages
SST307 Complete
No ratings yet
SST307 Complete
72 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
18 pages
Basic Econometrics Health
No ratings yet
Basic Econometrics Health
183 pages
Stat 378
No ratings yet
Stat 378
73 pages
Statistical Analysis (SM 901B) Unit 2 - Regression: Goonjan Jain Department of Applied Mathematics DTU
No ratings yet
Statistical Analysis (SM 901B) Unit 2 - Regression: Goonjan Jain Department of Applied Mathematics DTU
19 pages
PE Civil: Transportation Ebook Practice Exam
No ratings yet
PE Civil: Transportation Ebook Practice Exam
41 pages
Intronumericalrecipes v01 Chapter02 Regress
No ratings yet
Intronumericalrecipes v01 Chapter02 Regress
15 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
95 pages
Module01.1 LinearRegression
No ratings yet
Module01.1 LinearRegression
32 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Module05 Notes
No ratings yet
Module05 Notes
19 pages
Curve Fitting
No ratings yet
Curve Fitting
17 pages
Regression Models Course Notes
No ratings yet
Regression Models Course Notes
102 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Unit-5 - Notes
No ratings yet
Unit-5 - Notes
41 pages
WEEK2 Simple Regression
No ratings yet
WEEK2 Simple Regression
133 pages
13 Chapter14
No ratings yet
13 Chapter14
28 pages
Linear Regression With Python
No ratings yet
Linear Regression With Python
140 pages
4 Curve Fitting
No ratings yet
4 Curve Fitting
25 pages
FDSA Unit V LECTURE NOTS
No ratings yet
FDSA Unit V LECTURE NOTS
28 pages
Linear Regression
100% (2)
Linear Regression
228 pages
1 - Simple Linear Regression
No ratings yet
1 - Simple Linear Regression
43 pages
Linear Regression Models
No ratings yet
Linear Regression Models
187 pages
2021 WS 3 71 2 Course Math Statist 2021 22 Complete
No ratings yet
2021 WS 3 71 2 Course Math Statist 2021 22 Complete
108 pages
Lecture 16 Regression
No ratings yet
Lecture 16 Regression
30 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
Regression Models Notes
No ratings yet
Regression Models Notes
13 pages
Regression
No ratings yet
Regression
60 pages
Linear Regression
No ratings yet
Linear Regression
49 pages
Multiple Regression
No ratings yet
Multiple Regression
22 pages
Unit 2 Regression
No ratings yet
Unit 2 Regression
31 pages
Regression III: Advanced Methods: William G. Jacoby Department of Political Science
No ratings yet
Regression III: Advanced Methods: William G. Jacoby Department of Political Science
21 pages
Regression PDF
No ratings yet
Regression PDF
18 pages
Linear Regression
No ratings yet
Linear Regression
56 pages
TCMG - MEEG 573 - SP - 20 - Lecture - 7
No ratings yet
TCMG - MEEG 573 - SP - 20 - Lecture - 7
69 pages
HMX7001 Analysis of Data Using SPSS - Advanced Level
No ratings yet
HMX7001 Analysis of Data Using SPSS - Advanced Level
97 pages
Ch10 - Curve Fitting
No ratings yet
Ch10 - Curve Fitting
157 pages
DMJAP LinearRegression 3
No ratings yet
DMJAP LinearRegression 3
28 pages
CS550 Regression
No ratings yet
CS550 Regression
62 pages
Mathematics of The Linear Model and Linear Mixed Model: Brian Zhang February 2020
No ratings yet
Mathematics of The Linear Model and Linear Mixed Model: Brian Zhang February 2020
20 pages
Introduction to Coding in Hours With Python Level 1: A Guide to Programming for Students With No Prior Experience (Learn Coding Basics With Python)
From Everand
Introduction to Coding in Hours With Python Level 1: A Guide to Programming for Students With No Prior Experience (Learn Coding Basics With Python)
Jack C. Stanely
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
45 pages
Statistika States Abd Karim
No ratings yet
Statistika States Abd Karim
8 pages
D e P A R T M e N T o F S T A T I S T I C S & O - R
100% (1)
D e P A R T M e N T o F S T A T I S T I C S & O - R
28 pages
Sta 224 Lecture Note by Upcoming Update ?
No ratings yet
Sta 224 Lecture Note by Upcoming Update ?
63 pages
Bayesian Inference With INLA, 1st Edition Exclusive Download
100% (9)
Bayesian Inference With INLA, 1st Edition Exclusive Download
14 pages
Chapter III
0% (1)
Chapter III
23 pages
Latest IASSC ICBB Exam Questions Free Update
No ratings yet
Latest IASSC ICBB Exam Questions Free Update
7 pages
Population Proportion
No ratings yet
Population Proportion
3 pages
International University-Vnu HCM City School of Biotechnology
No ratings yet
International University-Vnu HCM City School of Biotechnology
6 pages
Adijfpqo
No ratings yet
Adijfpqo
8 pages
Group 1 ECON6006 Financial Econometrics Assignment 2 Submission
No ratings yet
Group 1 ECON6006 Financial Econometrics Assignment 2 Submission
20 pages
Management Quizz
No ratings yet
Management Quizz
124 pages
500-633012 SPC
No ratings yet
500-633012 SPC
60 pages
Jurnal Pengaruh Persepsi Dan Harga PDF
No ratings yet
Jurnal Pengaruh Persepsi Dan Harga PDF
11 pages
Rev 5 Hypothesis Tests STPM T3
50% (2)
Rev 5 Hypothesis Tests STPM T3
3 pages
Distribution in Statistics
No ratings yet
Distribution in Statistics
49 pages
Research Methodology Lecture 3
No ratings yet
Research Methodology Lecture 3
111 pages
Econometrics 2
No ratings yet
Econometrics 2
84 pages
FormualSheet Final 2025 QM1
No ratings yet
FormualSheet Final 2025 QM1
2 pages
Flexible Instruction Delivery Plan (FIDP) Grade: Semester: Core Subject Title: No. of Hours/ Semester: Core Subject Description
No ratings yet
Flexible Instruction Delivery Plan (FIDP) Grade: Semester: Core Subject Title: No. of Hours/ Semester: Core Subject Description
1 page
Sir Syed University of Engineering & Technology
No ratings yet
Sir Syed University of Engineering & Technology
4 pages
2 Hypothesis-Testing
No ratings yet
2 Hypothesis-Testing
43 pages
Uconn ECE6439 Final Exam
No ratings yet
Uconn ECE6439 Final Exam
2 pages
Proc Means
No ratings yet
Proc Means
22 pages
A Short Introduction To Linear Mixed Models
No ratings yet
A Short Introduction To Linear Mixed Models
5 pages

Regression 2

Uploaded by

Regression 2

Uploaded by

Regression

replaced with mean or you can remove instance? 63.2

data behavior. 63.2

Note: n is sample size and N is population size. w x i i

median L1  ( ) width

negatively skewed data

positively skewed negatively skewed

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

• E.g. Crop Yield? Does it depend on only Land?

• Objective is to find the most optimal solution of the above function

You might also like