Linear Regression
Linear Regression
Linear Regression Linear Regression estimates the coefficients of the linear equation, involving one or more independent variables, that best predict the value of the dependent variable. For example, to predict a salesperson's total yearly sales (the dependent variable) from independent variables such as age, education, and years of experience. Example. Is the number of games won by a basketball team in a season related to the average number of points the team scores per game? A scatterplot indicates that these variables are linearly related. The number of games won and the average number of points scored by the opponent are also linearly related. These variables have a negative relationship. As the number of games won increases, the average number of points scored by the opponent decreases. With linear regression, you can model the relationship of these variables. A good model can be used to predict how many games teams will win.
The linear regression model assumes that there is a linear, or "straight line," relationship between the dependent variable and each predictor. This relationship is described in the following formula. yi=b0+b1xi1+...+bpxip+ei where yi is the value of the ith case of the dependent scale variable p is the number of predictors bj is the value of the jth coefficient, j=0,...,p xij is the value of the ith case of the jth predictor ei is the error in the observed value for the ith case The model is linear because increasing the value of the jth predictor by 1 unit increases the value of the dependent by bj units. Note that b0 is the intercept, the model-predicted value of the dependent variable when the value of every predictor is equal to 0.
For the purpose of testing hypotheses about the values of model parameters, the linear regression model also assumes the following: The error term has a normal distribution with a mean of 0. The variance of the error term is constant across cases and independent of the variables in the model. An error term with non-constant variance is said to be heteroscedastic. The value of the error term for a given case is independent of the values of the variables in the model and of the values of the error term for other cases.
Example The Nambe Mills company has a line of metal tableware products that require a polishing step in the manufacturing process. To help plan the production schedule, the polishing times for 59 products were recorded, along with the product type and the relative sizes of these products, measured in terms of their diameters. We can use linear regression to determine whether the polishing time can be predicted by product size. Before running the regression, we should examine a scatterplot of polishing time by product size to determine whether a linear model is reasonable for these variables.
To produce a scatterplot of time by diam, from the menus choose: Graphs Scatter/Dot...
Click Define
Select time as the y variable and diam as the x variable. Click OK. These selections produce the scatterplot
To see a best-fit line overlaid on the points in the scatterplot, activate the graph by double-clicking on it. Select a point in the Chart Editor. Click the Add fit line tool, then close the Chart Editor
The resulting scatterplot appears to be suitable for linear regression, with two possible causes for concern
To run a linear regression analysis, from the menus choose: Analyze Regression Linear
...
Select time as the dependent variable. Select diam as the independent variable. Select type as the case labeling variable. Click Plots
Select *SDRESID as the y variable and *ZPRED as the x variable. Select Histogram and Normal probability plot. Click Continue. Click Save in the Linear Regression dialog box
Select Standardized in the Predicted Values group. Select Standardized in the Residuals group, Click Continue. Click OK in the Linear Regression dialog box
These selections produce a linear regression model for polishing time based on diameter. Diagnostic plots of the Studentized residuals by the model-predicted values are requested, and various values are saved for further diagnostic testing.
Coefficients
This table shows the coefficients of the regression line.
It states that the expected polishing time is equal to 3.457 * DIAM - 1.955. If Nambe Mills plans to manufacture a 15inch casserole, the predicted polishing time would be 3.457 * 15 - 1.955 = 49.9, or about 50 minutes.
The regression and residual sums of squares are approximately equal, which indicates that about half of the variation in polishing time is explained by the model. The significance value of the F statistic is less than 0.05, which means that the variation explained by the model is not due to chance. While the ANOVA table is a useful test of the model's ability to explain any variation in the dependent variable, it does not directly address the strength of that relationship.
The model summary table reports the strength of the relationship between the model and the dependent variable R, the multiple correlation coefficient, is the linear correlation between the observed and model-predicted values of the dependent variable. Its large value indicates a strong relationship
R Square, the coefficient of determination, is the squared value of the multiple correlation coefficient. It shows that about half the variation in time is explained by the model.
As a further measure of the strength of the model fit, compare the standard error of the estimate in the model summary table to the standard deviation of time reported in the descriptive statistics table. Without prior knowledge of the diameter of a new product, our best guess for the polishing time would be about 35.8 minutes, with a standard deviation of 19.0
. .
With the linear regression model, the error of your estimate is considerably lower, about 13.7
A residual is the difference between the observed and modelpredicted values of the dependent variable. The residual for a given product is the observed value of the error term for that product. A histogram or P-P plot of the residuals will help you to check the assumption of normality of the error term The shape of the histogram should approximately follow the shape of the normal curve. This histogram is acceptably close to the normal curve.