Regression
Regression
Sci
Regression Analysis
Machine learning intro
• Regression
• Classification
• Clustering
• Dimension reduction
• Feature selection
This topic
• Simple linear regression
• Multi linear regression
• Ridge regression
• Lasso
• Logistic regression (for classification)
Recall: Pearson Correlation
Coefficient
Linear regression
• In correlation, the two variables are treated as equals.
In regression, one variable is considered independent
(=predictor) variable (X) and the other the dependent
(=outcome) variable Y.
• The output of a regression is a function that predicts the dependent
variable based upon values of the independent variables.
β
1 slope
x
A slope of β means that every 1-unit
intercept change in X yields a β -unit change in Y.
Prediction
If you know something about X, this knowledge helps you
predict something about Y. (Sound familiar?…sound like
conditional probabilities?)
Regression equation…
E ( yi / xi ) xi
Predicted value for an individual…
yi = + *xi + random error i
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Simple linear regression
Observation: y
Dependent variable
Prediction: y^
Zero
Independent variable (x)
The function will make a prediction for each observed data point.
The observation is denoted by y and the prediction is denoted by ^y.
Regression Error
Prediction error: ε
Observation: y
Prediction: y^
Zero
y = y^ + ε
Actual = Explained + Error
Sum of squares of error (SSE)
Dependent variable
Dependent variable
Population mean: y
The Sum of Squares Regression (SSR) is the sum of the squared differences
between the prediction for each observation and the population mean.
SST, SSR and SSE
Mathematically,
B
y
A
B y
C
yi
*Least squares estimation
gives us the parameters (a,
β) that minimizes C2 and SSE
x
n n n
Equality holds when least
(y
i1
i y) 2
( yˆ
i1
i y) 2
( yˆ
i1
i yi ) 2
square solution is found
A2 B2 C2
SST: Total variation in y SSR: variation explained by x SSE: unexplained variance
Total squared distance of Distance from regression line to naïve Variance around the regression line
observations from naïve mean
mean of y
of y
The Coefficient of Determination (aka R-
squared)
The proportion of total variation (SST) that is explained by the regression (SSR) is
known as the Coefficient of Determination, and is often referred to as R 2 .
R2=B2/A2=SSR/SST= 1 – SSE/SST
The value of R can range between 0 and 1, and the higher its value the more
accurate the regression
2 model is. It is often referred to as a percentage.
Solutions for least square fit
• In general, can be solved with optimization algorithms
• E.g. gradient descent
Tn-2=
ˆ 0
s.e.( ˆ )
Significance test
(y
i 1
i yˆ i ) 2
n
2 where SSx ( xi x ) 2
n 2 sy / x
sˆ
i 1
ρ = 0.574
R2 = 0.329
plt.scatter(x, y-pred)
plt.scatter(x, y)
plt.plot(x, pred, '-')
Assumptions (or the fine print)
• Linear regression assumes that…
1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the same
(homogeneity of variances)
4. The observations are independent
Residual Analysis for
Linearity
Y Y
x x
residuals
residuals
x x
Not Linear
Linear
Residual Analysis for
Homoscedasticity
Y Y
x x
residuals
residuals
x x
Non-constant variance
Constant variance
Residual Analysis for
Independence
Not Independent
Independent
residuals
residuals
X
residuals
X
Multiple linear regression…
• More than one independent variables can be used to explain variance
in the dependent variable, as long as they are not linearly related.
Y = 1 * + X * β
+
(1,1)
(k,1)
Coefficient value does not necessarily reflect significance/relevance, especially if data is not normalized.
Real predictors Fake predictors
Overfitting
• The more variables you add, the better the fit
• Smaller SSE and MSE
• Larger R2
• True even if the variables are random (i.e.,
irrelevant)
In [2700]: fakedata = np.random.rand(len(y), 10)
...: data2=pd.concat([normData, DataFrame(fakedata)], axis=1)
...: x = data2;
...: lr.fit(x, y)
...: pred = lr.predict(x)
...: r2(y, pred)
...:
Out[2700]: 0.74575835459198614
Bootstrapping
• A way to estimate the distribution given limited data
• Key idea: draw random samples with replacement
x = normData
y = np.array(boston.target) sample:
n = 100 len(y) random int
dataSize = len(y) from 0 to len(y)-1,
coef_bs = np.zeros((n, x.shape[1])) repeats allowed
for i in range(n):
sample = np.random.choice(np.arange(dataSize),
size=dataSize, replace=True)
newY = y[sample]
newX = x.iloc[sample, :]
lr.fit(newX, newY)
coef_bs[i,:] = lr.coef_
Can be solved exactly the same way as for regular multiple linear regression.
Polynomial regression – overfitting
vs underfitting
Multicollinearity
Multicollinearityarises when two variables that
measure the same thing or similar things (e.g.,
weight and BMI) are both included in a multiple
regression model; they will, in effect, cancel each
other out and generally destroy your model.
y = brfss2['htm3']
x=brfss2.drop(['wtyrago', 'wtkg2', 'htm3'], axis=1).apply(zscore)
lr.fit(x, y)
In [2926]: x.columns
Out[2926]: Index(['age', 'weight2', 'sex'], dtype='object')
Out[2961]:
wtkg2 1.000000
wtyrago 0.936271
0
3 Training set
2
5
1
Data
Testing set
Model Evaluation Step 2:
Build a model on a training set
THE PAST
Results Known
0
3 Training set
2
5
1
Data
Model Builder
Testing set
Model Evaluation Step 3:
Evaluate on test set
Results Known
0
3 Training set
2
5
1
Data
Model Builder
Evaluate
Predictions
3
Y N
4
1
Testing set 2
A note on parameter tuning
• It is important that the test data is not used in any way to build the model
• Some learning schemes operate in two stages:
• Stage 1: builds the basic structure
• Stage 2: optimizes parameter settings
• The test data can’t be used for parameter tuning!
• Proper procedure uses three sets: training data, validation data, and test data
• Validation data is used to optimize parameters
Model Evaluation:
Train, Validation, Test split
Results Known
0 Model
3 Training set Builder
2
5
1
Data
Evaluate
Model Builder
Predictions
1
3
Y N 0
Validation set 2
2
3 Final Evaluation
1
Final Test Set Final Model 0
Cross-validation
• Cross-validation more useful in small datasets
• First step: data is split into k subsets of equal size
• Second step: each subset in turn is used for testing and the remainder for
training
• This is called k-fold cross-validation
• For classification, often the subsets are stratified before the cross-
validation is performed
• The error estimates are averaged to yield an overall error estimate
Cross-validation example:
— Break up data into groups of the same size
— Hold aside one group for testing and use the rest to build
model
Test
— Repeat
60