0% found this document useful (0 votes)
17 views

Regression

Uploaded by

mawda131199
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Regression

Uploaded by

mawda131199
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

CS 5163 Intro to Data

Sci
Regression Analysis
Machine learning intro
• Regression
• Classification
• Clustering
• Dimension reduction
• Feature selection
This topic
• Simple linear regression
• Multi linear regression
• Ridge regression
• Lasso
• Logistic regression (for classification)
Recall: Pearson Correlation
Coefficient
Linear regression
• In correlation, the two variables are treated as equals.
In regression, one variable is considered independent
(=predictor) variable (X) and the other the dependent
(=outcome) variable Y.
• The output of a regression is a function that predicts the dependent
variable based upon values of the independent variables.

• Simple regression fits a straight line to the data.


What is “Linear”?
• Remember this:
• Y= + β X?
y

β
1 slope
x
A slope of β means that every 1-unit
intercept  change in X yields a β -unit change in Y.
Prediction
If you know something about X, this knowledge helps you
predict something about Y. (Sound familiar?…sound like
conditional probabilities?)
Regression equation…

Expected value of y at a given level of x =

E ( yi / xi )    xi
Predicted value for an individual…
yi =  + *xi + random error i

Fixed – Follows a normal


exactly distribution
on the
line
Examples
Number of Friends vs Daily Minutes
Online

What does best fit mean?


Best fit line
Y = 0.9039 X + 22.95
The standard error of Y given X is the average variability around the
regression line at any given value of X. It is assumed to be equal at all
values of X.

Sy/x

Sy/x
Sy/x
Sy/x
Sy/x

Sy/x
Simple linear regression

Observation: y

Dependent variable
Prediction: y^

Zero
Independent variable (x)

The function will make a prediction for each observed data point.
The observation is denoted by y and the prediction is denoted by ^y.
Regression Error
Prediction error: ε

Observation: y
Prediction: y^

Zero

For each observation, the variation can be described as:

y = y^ + ε
Actual = Explained + Error
Sum of squares of error (SSE)

Dependent variable

Independent variable (x)


A least squares regression selects the line with the lowest total sum of squared
prediction errors.
This value is called the Sum of Squares of Error, or SSE.
Sum of squares of regression (SSR)

Dependent variable
Population mean: y

Independent variable (x)

The Sum of Squares Regression (SSR) is the sum of the squared differences
between the prediction for each observation and the population mean.
SST, SSR and SSE

The Total Sum of Squares (SST) is equal to SSR + SSE.

Mathematically,

SSR = ∑ ( y^ – y ) 2 (measure of explained variation)

SSE = ∑ ( y – ^y ) 2 (measure of unexplained variation)

SST = SSR + SSE = ∑ ( y – y ) 2(measure of total variation in y)


Least square regression
yi
ŷi = bxi + a
C A

B
y
A
B y
C
yi
*Least squares estimation
gives us the parameters (a,
β) that minimizes C2 and SSE
x
n n n
Equality holds when least
 (y
i1
i  y) 2
  ( yˆ
i1
i  y) 2
  ( yˆ
i1
i  yi ) 2
square solution is found
A2 B2 C2
SST: Total variation in y SSR: variation explained by x SSE: unexplained variance
Total squared distance of Distance from regression line to naïve Variance around the regression line
observations from naïve mean
mean of y
of y
The Coefficient of Determination (aka R-
squared)

The proportion of total variation (SST) that is explained by the regression (SSR) is
known as the Coefficient of Determination, and is often referred to as R 2 .

R2=B2/A2=SSR/SST= 1 – SSE/SST

The value of R can range between 0 and 1, and the higher its value the more
accurate the regression
2 model is. It is often referred to as a percentage.
Solutions for least square fit
• In general, can be solved with optimization algorithms
• E.g. gradient descent

• In simple linear regression (single predictor), solutions can be


calculated easily
• In correlation, the two variables are treated as equals. In
regression, one variable is considered independent (=predictor)
variable (X) and the other the dependent (=outcome) variable Y.
• Also, 2
Significance test
• Is the coefficient significantly different from 0?
Slope
Distribution of slope ~ Tn-2(β, s.e.( ˆ ))

H0: β1 = 0 (no linear relationship)


H1: β1  0 (linear relationship does
exist)

Tn-2=
ˆ  0
s.e.( ˆ )
Significance test

(y
i 1
i  yˆ i ) 2

n
2 where SSx  ( xi  x ) 2
n 2 sy / x
sˆ  
i 1

and yˆ i ˆ  ˆxi


SS x SS x
Number of Friends vs Daily Minutes
Online
a = 22.95
β = 0.9039

ρ = 0.574
R2 = 0.329

Best fit line s.e. (β) = 0.091


Y = 0.9039 X + 22.95
T201 = 0.9039/0.091 = 9.93
P-value
= 1 – scipy.stats.t.cdf(9.93, 201)
= spipy.stats.t.cdf(-9.93, 201)
= 1.8e-19
Sklearn example In [2629]: from sklearn import linear_model
...: from sklearn.metrics import mean_squared_error as mse
...: from sklearn.metrics import r2_score as r2
a = 22.95 ...:
β = 0.9039 ...: lr = linear_model.LinearRegression()
...: lr.fit(x, y)
...: pred = lr.predict(x)
ρ = 0.574 ...:
R2 = 0.329 ...: print('beta = %.2f' % lr.coef_)
...: print('alpha = %.2f' % lr.intercept_)
...: print('r2 = %.2f' % r2(y, pred))
s.e. (β) = 0.091 ...: print('mse = %.2f' % mse(y, pred)) mse: SSE/n
T201 = 0.9039/0.091 = 9.93 ...:
beta = 0.90
P-value alpha = 22.95
= 1 – scipy.stats.t.cdf(9.93, 201) r2 = 0.33
= spipy.stats.t.cdf(-9.93, 201) mse = 65.01
= 1.8e-19
Residual analysis Residual =
observed - predicted

plt.scatter(x, y-pred)
plt.scatter(x, y)
plt.plot(x, pred, '-')
Assumptions (or the fine print)
• Linear regression assumes that…
1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the same
(homogeneity of variances)
4. The observations are independent
Residual Analysis for
Linearity
Y Y

x x
residuals

residuals
x x

Not Linear
 Linear
Residual Analysis for
Homoscedasticity

Y Y

x x
residuals

residuals
x x

Non-constant variance
 Constant variance
Residual Analysis for
Independence

Not Independent
 Independent
residuals

residuals
X
residuals

X
Multiple linear regression…
• More than one independent variables can be used to explain variance
in the dependent variable, as long as they are not linearly related.

• A multiple regression takes the form:


y = A + β1 X 1 + β2 X 2 + … + β k X k + ε
where k is the number of variables, or parameters.
• Each regression coefficient is the amount of change in the outcome
variable that would be expected per one-unit change of the predictor,
if all other variables in the model were held constant.
Multiple linear regression in matrix
form Data Unknown Error to be
parameters minimized
to be solved

Y = 1 *  + X * β
+ 
(1,1)

(k,1)

(n,1) (n,1) dummy (n,k) (n,1)


Multiple linear regression in matrix
form - 2
Some packages such as
the scipy.linalg.lstsq()
 needs you to add the
dummy variable.
Y = 1 X * β + 
sklearn does it for you.

k should to be smaller than


(k+1,1) n. Otherwise the system
may be underdetermined.
(infinity number of
(n,1) (n,k+1) (n,1) solutions and  = 0.)
Boston house :Attribute Information (in order):
- CRIM per capita crime rate by town
price - ZN proportion of residential land zoned for lots over 25,000
sq.ft.
In [2665]: from sklearn import datasets, linear_model - INDUS proportion of non-retail business acres per town
...: boston = datasets.load_boston() - CHAS Charles River dummy variable (= 1 if tract bounds river;
...: print(boston.DESCR) 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
Boston House Prices dataset - AGE proportion of owner-occupied units built prior to 1940
=========================== - DIS weighted distances to five Boston employment centres
Notes - RAD index of accessibility to radial highways
------ - TAX full-value property-tax rate per $10,000
Data Set Characteristics: - PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by
:Number of Instances: 506 town
- LSTAT % lower status of the population
:Number of Attributes: 13 numeric/categorical predictive - MEDV Median value of owner-occupied homes in $1000's
:Median Value (attribute 14) is usually the target
:Missing Attribute Values: None

Please see this example https://www.kaggle.com/code/shreayan98c/boston-house-price-prediction


Boston house price data

In [2669]: data = DataFrame(boston.data, In [2671]: def zscore(s): return (s - s.mean()) / s.std()


columns=boston.feature_names) In [2672]: normData=data.apply(zscore)
In [2670]: data.boxplot(rot=90) In [2673]: normData.boxplot(rot=90)
Predicting Boston house price
In [2686]: from sklearn import linear_model
...: y = boston.target
...: x = normData
...: lr = linear_model.LinearRegression()
...: lr.fit(x, y)
...: coef=Series(lr.coef_, index=boston.feature_names)
...: coef.plot(kind='bar'); plt.ylabel('Coefficient')

In [2690]: r2(y, pred)


Out[2690]: 0.7406077428649428
Without
With
normalizati normalization
on

In [2690]: r2(y, pred) In [2693]: r2(y, pred)


Out[2690]: 0.7406077428649428 Out[2693]: 0.7406077428649428

Coefficient value does not necessarily reflect significance/relevance, especially if data is not normalized.
Real predictors Fake predictors

Overfitting
• The more variables you add, the better the fit
• Smaller SSE and MSE
• Larger R2
• True even if the variables are random (i.e.,
irrelevant)
In [2700]: fakedata = np.random.rand(len(y), 10)
...: data2=pd.concat([normData, DataFrame(fakedata)], axis=1)
...: x = data2;
...: lr.fit(x, y)
...: pred = lr.predict(x)
...: r2(y, pred)
...:
Out[2700]: 0.74575835459198614
Bootstrapping
• A way to estimate the distribution given limited data
• Key idea: draw random samples with replacement
x = normData
y = np.array(boston.target) sample:
n = 100 len(y) random int
dataSize = len(y) from 0 to len(y)-1,
coef_bs = np.zeros((n, x.shape[1])) repeats allowed
for i in range(n):
sample = np.random.choice(np.arange(dataSize),
size=dataSize, replace=True)
newY = y[sample]
newX = x.iloc[sample, :]
lr.fit(newX, newY)
coef_bs[i,:] = lr.coef_

coef_bs = DataFrame(coef_bs, columns=boston.feature_names)


coef_bs.boxplot(rot=90)
Bootstrap with real and fake
predictors
• Index 0-12 are real predictors Real predictors Fake predictors
• Index 13-22 are fake predictors
Significance testing for coefficients
using bootstrapping
t = coef / coef_bs.std(axis=0)
pvalue = 2*stats.t.cdf(-np.abs(t), dataSize - data2.shape[1]-1)

Real predictors Fake predictors


follows a t-distribution with
n-k-1 degree of freedom
Polynomial regression
• e.g. y = A + β1 X 1 + β2 X2 + β12 X1 X2 + β11 X12 + β22 X22 + ε

• Precompute the high-degree terms from x1, x2, …

• Still a linear regression …

• Sometimes only the interaction terms are considered


i.e. y = A + β1 X 1 + β2 X2 + β12 X1 X2 + ε

Can be solved exactly the same way as for regular multiple linear regression.
Polynomial regression – overfitting
vs underfitting
Multicollinearity
 Multicollinearityarises when two variables that
measure the same thing or similar things (e.g.,
weight and BMI) are both included in a multiple
regression model; they will, in effect, cancel each
other out and generally destroy your model.

 Model building and diagnostics are tricky


business!
Predicting height from weight, age
and gender
brfss = pd.read_csv('brfss.csv', index_col = 0);
brfss2 = brfss.dropna(axis=0, how='any');

y = brfss2['htm3']
x=brfss2.drop(['wtyrago', 'wtkg2', 'htm3'], axis=1).apply(zscore)
lr.fit(x, y)

In [2926]: x.columns
Out[2926]: Index(['age', 'weight2', 'sex'], dtype='object')

In [2928]: pred = lr.predict(x)

In [2929]: r2(y, pred)


Out[2929]: 0.55700902357112936 In [2930]: Series(lr.coef_, index=x.columns).plot(kind='bar')
Out[2930]: <matplotlib.axes._subplots.AxesSubplot at
0x15c3ad6d8>
Including wtyrago/wtkg2 as
predictor

In [2934]: r2(y, pred)


In [2938]: r2(y, pred)
Out[2934]: 0.55701938503927018
Out[2938]: 0.55741517914915306
Correlation between predictors
• weight2, wtyrago and wtkg2 are
highly correlated
In [2961]: brfss2[['wtkg2','wtyrago']].corrwith(brfss2['weight2'])

Out[2961]:
wtkg2 1.000000
wtyrago 0.936271

weight2, and wtkg2 are


plt.imshow(brfss2.corr()); almost but not identical
plt.colorbar();
Ridge regression
• Force all coefficients to be small by adding a penalty term that is the
sum of squares of all coefficients (replace w in the formula with )

x=brfss2.drop(['wtyrago', 'htm3'], axis=1).apply(zscore)


y = brfss2['htm3']
reg = linear_model.Ridge(alpha=1)
reg.fit(x, y)
pred = reg.predict(x)

In [2969]: r2(y, pred) Series(reg.coef_, index=x.columns).plot(kind='bar')


Out[2969]: 0.55700898468780091
Performance evaluation
• How predictive is the model we learned?
• For regression, usually R2 or MSE
• For classification, many options (discuss later)
• Performance on the training data (data used to build models) is not a
good indicator of performance on future data
• Q: Why?
• A: Because new data will probably not be exactly the same as the training
data!
• Overfitting – fitting the training data too precisely - usually leads to
poor results on new data
Evaluation on “LARGE” data
• If many (thousands) of examples are available, then how can we
evaluate our model?
• A simple evaluation is sufficient
• Randomly split data into training and test sets (e.g. 2/3 for train, 1/3 for test)
• For classification, make sure training and testing have similar distribution of
class labels
• Build a model using the train set and evaluate it using the test set.
Model Evaluation Step 1:
Split data into train and test sets
THE PAST
Results Known

0
3 Training set
2
5
1
Data

Testing set
Model Evaluation Step 2:
Build a model on a training set
THE PAST
Results Known

0
3 Training set
2
5
1
Data

Model Builder

Testing set
Model Evaluation Step 3:
Evaluate on test set
Results Known
0
3 Training set
2
5
1
Data

Model Builder
Evaluate
Predictions
3
Y N
4
1
Testing set 2
A note on parameter tuning
• It is important that the test data is not used in any way to build the model
• Some learning schemes operate in two stages:
• Stage 1: builds the basic structure
• Stage 2: optimizes parameter settings
• The test data can’t be used for parameter tuning!
• Proper procedure uses three sets: training data, validation data, and test data
• Validation data is used to optimize parameters
Model Evaluation:
Train, Validation, Test split
Results Known
0 Model
3 Training set Builder
2
5
1
Data
Evaluate
Model Builder
Predictions
1
3
Y N 0
Validation set 2

2
3 Final Evaluation
1
Final Test Set Final Model 0
Cross-validation
• Cross-validation more useful in small datasets
• First step: data is split into k subsets of equal size
• Second step: each subset in turn is used for testing and the remainder for
training
• This is called k-fold cross-validation
• For classification, often the subsets are stratified before the cross-
validation is performed
• The error estimates are averaged to yield an overall error estimate
Cross-validation example:
— Break up data into groups of the same size

— Hold aside one group for testing and use the rest to build
model
Test

— Repeat

60

You might also like