0% found this document useful (2 votes)
122 views

Ch06 MultipleLinearRegression

This document contain chapter on multiple linear regression.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (2 votes)
122 views

Ch06 MultipleLinearRegression

This document contain chapter on multiple linear regression.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 19
CHAPTER 6 In this chapter we introduce linear regression m@deliyfor the purpose of pre~ diction, We discuss the differences between fit ising regression models for the purpose of inference (as in classical 4 and for prediction. A pre~ dictive goal calls for evaluating mi fe on a validation set, and for using predictive metrics. We thet c Wellenges of using many predictors and describe variable selection algoftins What are often implemented in linear regression procedures. 4, 6.1 INTRODUCTI C ‘The most poy model enc, el Yor making predictions is the multiple linear regression most introductory statistics courses and textbooks. This it Mrelationship between a quantitative dependent variable ¥ (also cffied th 6, target, or response variable) and a set of predictors X4,Xp, ...Xp -d to as independent variables, input varlables, regressors, or covariates). The assulliption is that che following function approximates the relationship between as input and outcome variables: * Y= fo + Bix, + Byxy to + Bx, He .1) fon where Phy... fp are coefficients and ¢ is the noise or unexplained part. Data are < then used to estimate the coefficients and to quantify the noise. In predictive modeling, the data are also used to evaluate model performance. ise Ming fr Basis Anais: Coup, igus, and Appin ia XLMne®, Tied Bion, Gale Stas Peer C. Brice an Nia, Patel © 2016 Joha Wey & Sons lnc. Published 2016 by John Wiley & Son, lc. 140 EXPLANATORY VS. PREDICTIVE MODELING Regression modeling means not only estimating the coefficients but alo choosing which input variables to include and in what form. For example, a numerical input can be included as-is, or in logarithmic form (log X), or in a binned form (eg., age group). Choosing the right form depends on domain knowledge, data availability and needed predictive power. Multiple linear regression is applicable to numerous predictive modeling situations, Examples are predicting customer activity on credit cards from their demographics and historical activity patterns, predicting the time to failure of equipment based on utilization and environment conditions, predicting oe © staffing requirements at help desks based on historical data and produ: aaa ditures on vacation travel based on historical frequent flyer data, se sales information, predicting sales from cross selling of products for information, and predicting the impact of discounts on sales in $8, 6.2 EXPLANATORY VS. PREDICTIVE MODE! Before introducing the use of linear regression he mn, we must clarify an important distinction that often escapes thigoith cltter familiarity with linear regression from courses in statistics, In the two popular but different objectives behind fitting a regresign n 1, Explaining or quantifyi ~ effect of inputs on an output (explanatory or descriptile tk, respectively) 2. Predicting the o i for new records, given their input values (predictive task) ‘The classic, G35 is focused on the first objective. In that scenario, the data fe treatgd Wa random sample from a larger population of interest The regresgion mpdel estimated from this sample is an attempt to capture the hip in the larger population. This model is then used in decision ing to generate statements such as “a unit increase in service speed (X}) is @ with an average increase of 5 points in customer satisfaction (Y), all wctors (Xp, Xq,...,X,) being equal. If X; is known to cause Y, then € Qs a statement indicates actionable policy changes—this is called explanatory ‘Ss > odeling, When the causal structure is unknown, then this model quantifies the degree of association between the inputs and output, and the approach is called descriptive modeling. In predictive analytics, however, the focus is typically on the second goal: predicting new individual observations. Here we are not interested in the co- efficients themselves, nor in the “average record,” but rather in the predictions a 142 MULTIPLE LINEAR REGRESSION that this model can generate for new records. In this scenario, the model is used for micro-decision-making at the record level. In our previous example, we would use the regression model to predict customer satisfaction for each new Conomes Fine ~N Both explanatory and predictive modeling involve using a dataset to fitg, a model (i.e., to estimate coefficients), checking model validity, assessing its, “! performance, and comparing to other models. However, the modeling a and performance assessment differ in the two cases, usually leading to dit final models. Therefore the choice of model is closely tied to acne is explanatory or predictive. In explanatory and descriptive modeling, where the fo; pcos the average record, we try to fit the best model to the data in ‘to learn about the underlying relationship in the population. tin predictive modeling (data mining), the goal is to find a regression m t best predicts new individual records. A regression model that fit&the exigting data too well is not likely to perform well with new data. Hi Jook for a model that has the highest predictive power by ¥@ holdout set and using predictive metrics (see Chapter 5) Let us summarize the main differenc scenarios 1 linear regression in the two 1. A good explanatory m good predictive m of input vatiabl he that fits the data closely, whereas 2 1 predicts new cases accurately. Choices ch can therefore differ. el®, the entice datasct is used for estimating the best- ize the amount of information that we have about iged felationship in the population, When the goal is to Games of new individual cases, the data are typically split into thémodel, and the validation or holdout set is used to assess this models idictive performance on new, unobserved data. 3. Performance measures for explanatory models measure how close the data fit the model (how well the model approximates the data) and how strong the average relationship is, whereas in predictive models performance is measured by predictive accuracy (how well the model O predicts new individual cases). > . In explanatory models the focus is on the coefficients (f), whereas in predictive models the focus is on the predictions (). For these reasons it is extremely important to know the goal of the analysis before beginning the modeling process. A good predictive model can have a ESTIMATING THE REGRESSION EQUATION AND PREDICTION looser fit to the data on which it is based, and a good explanatory model can have low prediction accuracy. In the remainder of this chapter we focus on predictive models because these are more popular in data mining and because ‘most statistics textbooks focus on explanatory modeling. 6.3 ESTIMATING THE REGRESSION EQUATION AND PREDICTION ‘Once we determine the input variables to include and their form, we oa ordinary least squares (OLS). This method finds values fo, B,, By, 143 the coefficients of the regression formula from the data using a method ney predicted values based on that model (P). To predict the value of the output variable for a record with iypue values Xyp%py veep, We use the equation P= fot Bix + hax. + +8, (6.2) Predictions based on this equation are the, be ns possible in the sense that they will be unbiased (equal to the tue¥ues%gn average) and will have the smallest average squared error compa sant = iased estimates if we make the following assumptions: 2 anormal distribution, 2. The choice of variatl their form is correct (linearity). 1. The noise € (or equival 3, The cases are ing each other. alues for 2 given set of predictors is the same of the predictors (homoskedastcty). ‘he first assumpiion and allow the noise to follow an arbitrary distribution, these estimates se [for prediction, in the sense that among all linear models, as defined by 6.1), the model using the least squares estimates, Hip. Ay, Bo By, will the smallest average squared errors. The assumption of a normal distribution Cyrest in explanatory modeling, where itis used for constructing confidence intervals and statistical tests for the model parameters Even ifthe other assumptions are violated, itis still possible that the resulting predictions are sufficiently accurate and precise for the purpose they are intended for. The key is to evaluate predictive performance of the model, which is the main priority, Satisfying assumptions is of secondary interest and residual analysis can give clues to potential improved models to examine. 144 MULTIPLE LINEAR REGRESSION ‘TABLE 6.1 VARIABLES IN THE TOYOTA COROLLA EXAMPLE Description N Price Offer price in euros Age Age in manths as of August 2006 Kilometers Accumulated kilometers on odometer 9% Fuel type Fuel type (Petrot, Diesel, CNG) HP Horsepower Metallic Metallic color? (Yes = 1, No = Automatic ‘Automatic (Yes = 1, No c Cylinder volume in cubig, cr Doors Number of doors Quart tax Quarterly road! Weight Weight in kilo Example: Predicting the Price of Used Toyota €grolla C A large Toyota car dealership offers purchasers of yyota cars the option to buy their used car as part of a trade-in. In Sweflew promotion promises to pay high prices for used Toyota Ci isMfor purchasers of a new car. “The dealer then sells the used crs fg Ytgalf Profit. To ensure a reasonable profit, the dealer needs to be abl ict the price that the dealership will get for the used cars, For that ita were collected on all previous sales of used Toyota Corollas at x ip. The data include the sales price and other information on thi g age, mileage, fuel type, and engine size. A description of cach @f thgse Variables is given in Table 6.1. A sample of this dataset is shown in: , ‘The total m cords in the dataset is 1000 cars (we use the first 1000 oCorolla.x1s). After partitioning the data into training ( lidation (40%) sets, we fit a multiple linear regression model been, ¢ Output variable) and the other variables (as predictors) using ly the @aining set. Figure 6.1 shows the estimated coefficients, as computed by jer.' Notice that the Fuel Type predictor has three categories (Petrol, Diesel, and CNG). We therefore have two dummy variables in. the model: Petrol (Q/1) and Diesel (0/1); the third, CNG (0/1), is redundant given the information mn the first two dummies. Inclusion of this redundant variable will cause typical regression software to fail due to a mulfcollinearity error since the redundant { variable will be a perfect linear combination of the other two (see Section 4.5). The regression coefficients are then used to predict prices of individual used Toyota Corolla cars based on their age, mileage, and so on. Figure 6.2 shows a sample of predicted prices for 20 cars in the validation set, using the estimated model. It gives the predictions and their errors (celative to the actual prices) Tj vome version: of XLMiner, the intercept in the coefficients ble is called “constant ren.” ESTIMATING THE REGRESSION EQUATION AND PREDICTION TABLE 6.2 PRICES AND ATTRIBUTES FOR USED TOYOTA COROLLA CARS. (SELECTED ROWS AND COLUMNS ONLY) a Fuel Auto- Quart Price Age Kilometers Type -HP_Metallic matic CC__Doors Tax Weight. 13500 23 46986 Diesel «= «90st 2000 320165, 137502372937 -—~iesel«= 90120002065 13950 24 © «41711 Diesel«= 90S 22000320165, 14950 26 © 48000 Diesel «= «9020002165, 137503038500 -—‘iesel «= «90020008 12950 32 61000-Diesel «90's 2000s tt 16900 27 © 94612 Diesel «= 801000. 18600 30° 75889 esl «9012000 21500 27 19700 Potrol« 2s. 12950 2371138—Ciesel «691900 20950 25 © 31461 Petrol«s182,=SsS S800 19950 2243610 Petrol «192, 19600 25 «32189 Petrol««192,- (800, 21500 31 23000 Petrol’ «192,180 22500 «323432 Petrol« 192 22000 28 «= 18739 Petrol «192 22750 30 © 34000 Petrol «192 00) 3 1795024 «= BA7I6 = Petrol «10S 3 16750 24 25563 Patrol 00s 16950 3064359 —Petrot 1600 3 15950 3067660 Petrol 0° 1600 3 16950 2943005 Petrol 1 16003 15950 28 56349 Petrol So 16003 16950 28 © 32220 —Petral 0 16003 16250 2925813 0 1600 3 15950 2528450 0 16003 174952734545 0 16003 15750 29 4tat5 0 16005 0 16095 1195039 98823 197 for these 80 cars. the right we get overall measures of predictive accuracy. Note thatthe aystage error is $111. A boxplot of the residuals (Figure 6.3) shows that 5196 of the errors are approximately +$850. This error magnitude tbe sina relative to the car price, but should be taken into account when the profit. Another observation of interest is the large positive the application. Measures such as the average error, and error percentiles. are ed to assess the predictive performance of model and to compare models. We discuss such measures in the next section. This example also illustrates the point about the relaxation of the normality assumption. A histogram or probability plot of prices shows a right-skewed distribution. In a descriptive/explanatory modeling case where the goal is to obtain a good fit to the data, the output variable would be transformed (e.g., by taking a logarithm) to achieve a more oO als (underpredictions), which may or may not be a concer, depending 145 146 MULTIPLE LINEAR REGRESSION Regression Model coeticent] sud eror | estatite | evaue iis ‘ “anyzai|_seaasenaia) ~iase2es) 0352013 le aensees7 isa 976) =28.0851525|_9 36-111 ladistea | osars0sy =0019505| 00002365487] ~840076258| 3.34835 ist rorexmate| 13680054) saessiso) 5275892772] 6316840337, 527610) ss 309532058 269.0696679|0.83598572| 0.403500] [09587009] 0.217964815) 092752) 22.900512| 2.45903667] 6.2136:8874|_5.59t 19 32saessa|_1512800956| 550010228 1036-16 [sel type, ese 126 24002] 536.7550725| 0.200776985| 009612] fuel Type Petrol [26708733] s2002139] 5135005687] 3.2207 FIGURE 6.1 ESTIMATED COEFFICIENTS FOR REGRESSION MODEL: 5 CAR ATTRIBUTES SS i (b) Validat ~. Report eats BEF as err | average oor 25.2] r410.9T998s| 110 9%asT | (A) PREDICTED PRICES (AND ERRORS) FOR 20 CARS IN VALIDATION SET, AND (8) SUMMARY PREDICTIVE MEASURES FOR ENTIRE VALIDATION SET “normal’ variable. Although the fit of such a model to the training data is expected to be better, it will not necessarily improve predictive performance. In this example the average error in a model of log(price) is -$160, compared to $111 in the original model for price. VARIABLE SELECTION IN LINEAR REGRESSION 147 ‘8000 6000 4000 2000 Residual 2000 4000 6.4 VARIABLE SELECTION IN LINEAR eo) Mon 2 regression equation to have many variables available ’igh speed of moder algorithms 19s, ikd# tempting in such a situation to other to select a subset? Just use all the Reducing the Number of Predictors ‘A frequent problem in data mining is predict the value ofa dependent variabl to choose as predictors in our model for multiple linear regression ca take a kitchen-sink approach: variables in the model Another considerati hope that a previo found that custon ionship will emerge. For example, a company if purchased anti-scuff protectors for chair and table It may be expensive or not feasible to collect a fall complement of predic- for future predictions. é We may be able to measure fewer predictors more accurately (e.g., in surveys). © The more predictors there ate, the higher the chance of missing values in the data. If we delete or impute cases with missing values, multiple A predictors will lead to a higher rate of case deletion or imputation. © Parsimony is an important property of good models. We obtain more insight into the influence of predictors in models with few parameters, 148 MULTIPLE LINEAR REGRESSION O 1 Estimates of regression coeificients are likely to be unstable, due to multi- collinearity in models with many variables. (Multicollineariy is the presence of two or more predictors sharing the same linear relationship with the outcome variable.) Regression coefficients are mote stable for parsimo- nious model. One very rough rule of thumb is to have a number of cases @ n larger than 5(p + 2), where p is the number of predictors. ‘Ie can be shown that using predictors that are uncorrelated with the, w pendent variable increases the variance of predictions, # It can be shown that dropping predictors that are actually on) the dependent variable can increase the average error (bias) Ry . ‘The last two points mean that there is a trade-ol 10 few and too many predictors. In general, accepting some bigs can rétkuce the variance in predictions. This bias-variance trade-off is particularlyataportant for large numbers of predictors, since in that case it is very likely that CBfere are variables in the model that have small coefficients relative tofth@yuimlird deviation of the noise variables will improve the predictions jees the prediction variance. This type of bias—variance trade-off is Mebsic Mpect of most data mining procedures for prediction and classifica she of this, methods for reducing the are often used. and also exhibit at least moderate oaths ther variables. Dropping such duce the number of predictors should always be to Ivis important to understand what the various predictors set of predictors should be reduced to a sensible set that reflects at hand, Some practical reasons for predictor elimination are the cexpemé of collecting this information in the future, inaccuracy, high correlation with another predictor, many missing values, or simply irrelevance. Also helpful examining potential predictors are summary statistics and graphs, such as frequency and correlation tables, predictor-specific summary statistics and plots, and missing value counts. The next step makes use of computational power and statistical significance. In general, there are two types of methods for reducing the number of predictors in a model. ‘The fist is an exhaustive search for the “best” subset of predictors by fitting regression models with all the possible combinations of predictors. The second is to search through a partial set of models. We describe these two approaches next. N VARIABLE SELECTION IN LINEAR REGRESSION 149 Exhaustive Search The idea here is to evaluate all subsets. Since the number of subsets for even moderate values of p is very large, after the algorithm. creates the subsets and runs all the models, we need some way to examine the y most promising subsets and to select from them. Criteria for evaluating and . comparing models are based on metrics computed from the training data. One + popular criterion is the adjusted R?, which is defined as Ww of adjusted R indicate better fit. Unlike R?, which does not a the number of predictors used, adjusted R? uses a penalty o1 predictors. This avoids the artificial increase in R? that can resul increasing the number of predictors but not the amount formation, It can be shown that using R2, to choose a subset is equivalent Picking the subset that minimizes 6? Another criterion that is often used for subsgt se known as Mallow's C, (Gee formula below). This criterion sma full model (with all predictors) is unbiased, although it may dicCOH that, if dropped, would | reduce prediction variability. With this model is unbiased, the average C,,valu e number of parameters p+ 1 ( number of predictors + 1), th ubset. So a reasonable approach to identifying subset models to examine those with values of C, that are near p+ 1 an estimate of the error? for predictions at the x-values observe ining sec. Thus good models are those that have values of G, ne: at have small p (Le, are of small size). C, is i computed from th al SSE 2094+ In, (63) where Ned cnet value of o in the full model that includes all n SS Ie is important to remember that the usefulness of this approach ui ds Heavily on the reliability of the estimate of 6? for the full model. This that the training set contain a large number of observations relative to tumber of predictors. Finally, a useful point to note is that for a fixed size of i cht, Rig and C, all select the same subset. There isin fact no difference among them inthe order of merit that they ascribe to subsets of a fixed size, ‘This is good to know if comparing models with the same number of predictors, but often we want to compare models with different numbers of predictors. “Fn particular, ics the sum of the MSE standandized by dividing by a2 150 MULTIPLE LINEAR REGRESSION

You might also like