Linear Models - Numeric Prediction
Linear Models - Numeric Prediction
Linear Models
Connectionist and Statistical Language Processing
Frank Keller
[email protected]
linear regression least square estimation evaluating a numeric model, correlation selecting a regression model linear regression for classication regression trees, model trees
Literature: Witten and Frank (2000: ch. 4, 6), Howell (2002: ch. 15).
Linear Models p.1/26 Linear Models p.2/26
Numeric Prediction
An instance in the data set has the following general form:
Example
Predict CPU performance from conguration data: cycle time (ns) 125 29 29 29 29 memory min (kB) 256 8000 8000 8000 8000 2000 512 1000 memory max (kB) 6000 32000 32000 32000 16000 8000 4000 4000 cache (kB) 256 32 32 32 32 0 32 0 chan min 16 8 8 8 8 2 0 0 chan max 128 32 32 32 16 14 0 0 performance 198 269 220 172 132 52 67 45
Linear Models p.4/26
a1 , i , a2 , i , . . . , ak , i , xi
where a1,i , . . . , ak,i are attribute values, and xi is the target value, for the i-th instance in the data set. So far we have only seen classication tasks, where the target value xi is categorical (represents a class). Techniques such as decision tree and Naive Bayes are not (directly) applicable if the target is numeric. Instead algorithms for numeric prediction can be used, e.g., linear models.
...
125 480 480
Linear Regression
Linear regression is a technique for numeric predictions thats widely used in psychology, medical research, etc. Key idea: nd a linear equation that predicts the target value x from the attribute values a1 , . . . , ak :
Linear Regression
The regression equation computes the following predicted value xi for the i-th instance in the data set.
k
(2)
(1)
x = w0 + w1 a1 + w2 a2 + . . . + wk ak
Here, w1 , . . . wk are the regression coefcients, w0 is called the intercept. These are the model parameters that need to be induced from the data set.
Key idea: to determine the coefcients w0 , . . . wk , minimize e, the squared difference between the predicted and the actual value, summed over all n instances in the data set:
(3)
e = (xi xi )2 = xi w0 w j a j,i
i=1 i=1 j=1
(4)
(6)
2 ai xi + 2w a2 i =0
i i
By resolving this equation to w, we obtain a formula for computing the value of w that minimizes the error:
(5)
e 2 = (2ai xi + 2wa2 i ) = 2 ai xi + 2w ai w i i i
(7)
w=
i ai xi i a2 i
The derivative is the slope of the error function. The slope is zero at all points at which the function has a minimum.
This formula can be generalized to regression equations with more than one coefcient.
Example
Sample data set:
(8)
-1
1 2 3 attribute value a
6
Linear Models p.10/26
Example
Compute the mean squared error for the sample data set and the regression equation x = 1.74a:
(9)
Intuitively, this represents how much the predicted values diverge from the actual values on average. Note that the MSE is the quantity the LSE algorithm minimizes.
a x x (x x )2 1 2 1.74 0.068 2 5 3.48 1.346 1 2 1.74 0.068 5 8 8.70 0.490 MSE = 0.646
1 n i=1
Correlation Coefcient
The correlation coefcient r measures the degree of linear association between predicted and the actual values:
Example
Compute the correlation coefcient for the example data set:
Here x and x are the means of the actual and predicted values, SP and SA their standard deviations. SPA is the covariance of the actual and predicted values.
Linear Models p.13/26
x = 3.25 x = 3.14 SPA = ((1.74 3.14)(2 3.25) + (3.48 3.14)(5 3.25)+ (1.74 3.14)(2 3.25) + (8.70 3.14)(8 3.25))/3 = 18.13 2 SP = ((1.74 3.14)2 + (3.48 3.14)2+ (1.74 3.14)2 + (8.70 3.14)2)/3 = 18.93 2 = ((2 3.25)2 + (5 3.25)2+ SA (2 3.25)2 + (8 3.25)2)/3 = 18.25 r = 18.13/( 18.93 18.25) = 0.975 r2 = 0.951
Linear Models p.14/26
Correlation Coefcient
Some important properties:
The correlation coefcient r ranges from
Partial Correlations
We can compute the multiple correlation coefcient that tells us how well the full regression model (with all attributes) ts the target values. We can also compute the correlation between the values of a single attribute and the target values. However this is not very useful as attributes can be intercorrelated, i.e., they correlate with each other (colinearity). We need to compute the partial correlation coefcient, which tells us how much variance is uniquely accounted for by an attribute once the other attributes are partialled out.
1.0 (perfect correlation) to 0 (no correlation) to 1.0 (negative correlation). r expresses how well the data points t on the straight line described by the regression model. r is signicant. Null hypothesis: there is no
Intuitively,
We can test if
and then eliminate the one with the lowest partial r. Iterate until the multiple r deteriorates.
Forward selection: compute a model consisting only of the
attributes with the highest partial r. Then add the next best attribute. Stop when the multiple r doesnt improve. Different model selection algorithm can yield different models.
Linear Models p.17/26
leave-one-out: set
k to the number of instances in the data set, i.e., test on each instance separately.
Linear Models p.18/26
Linear Separability
Linear regression approximates a linear function. This means that the classes have to be linearly separable.
for each membership function, and assign the new instance the class with the highest value.
This procedure is called multiresponse linear regression.
Linear Models p.19/26
attribute a1
attribute a2
attribute a1
Regression Trees
Regression trees are decision trees for numeric attributes. The leaves are not labeled with classes, but with the mean of the target values of the instances classied by a given branch.
To construct a regression tree, chose splitting attributes to minimize intrasubset variation for each branch. Maximize standard deviation reduction (instead of information gain):
Example
(13)
SDR = T
i
|Ti | T |T | i
Where T is the set of instances classied at a given node, T1 , . . . , Ti are the subset that T is split into, and is the standard deviation.
Model Trees
Model trees are regression trees that have linear regression models at their leaves, not just numeric values.
Induction algorithm:
Induce a regression tree using standard deviation
Example
for the instances classied by this leaf. Model trees get around the problem of linear separability by combining several regression models.
Summary
Linear regression models are used for numeric prediction. They t a linear equation that combines attributes values to predict a numeric target attribute.
References
Howell, David C. 2002. Statistical Methods for Psychology. Pacic Grove, CA: Duxbury, 5th edn. Witten, Ian H., and Eibe Frank. 2000. Data Mining: Practical Machine Learing Tools and
Least square estimation can be used to determine the coefcients in the regression equation so that the difference between predicted and actual values is minimal.
A numeric model can be evaluates using the mean squared error or the correlation coefcient. Regression models can be used for classication either directly in multiresponse regression or in combination with decision trees: regression trees, model trees.
Linear Models p.25/26 Linear Models p.26/26