0% found this document useful (0 votes)
15 views

Module V

Uploaded by

Deekonda Anusha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
15 views

Module V

Uploaded by

Deekonda Anusha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 19
MULTIPLE REGRESSION AND PREDICTION so far, in this chapter we have only discussed regression prediction based on uvo variables—dependent and independent. In tues, here we have made use of linear correlation for deriving two regression equa- tions, one for predicting ¥ from X scores, and the other for predicting YX from ¥ scores. However, in practical situations in the studies made in cation and psychology, quite often we find that the dependent able is jointly influenced by more than uvo variables, e.g. academic performance is jointly influenced by variables like intelligence, hours Hevoted per week for studi ity of teachers, and. facilities qu available in the school, parental education and socio-economic status. In such a situation, we have to compute a multiple correlation coefficient (R) rather than a mere linear correlation coefficient (r). Accordingly, the line of regression is also to be set up in accordance with the concept of multiple R. Here, the resulting regression equation is called the multiple regression equation. Let us now discuss the setting up of multiple regression equation and its use in predicting the values of the dependent variable on the basis of the values of two or more independent variables. Setting up of a Multiple Regression Equation Suppose there is a dependent variable X; (say academic achievement) which is controlled by or dependent upon two variables designated as NX» and Xy (say, intelligence and number of hours studied per week). The multiple regression equation helps us predict the values of X) (i.e., it gives X, as the predicted value) by knowing the values of Xy and Xs. The equation used for this is as follows: X, = by3Xq + b13.2X3 (in deviation form) where X, = (X-M), X2= Q-4 Hence the equation becomes = (Xs - Ms) (X, — My) = bigs Xe — Mz) + bis2(Xs ~ Ms) or i X, = dog Xo + bisgXs + M, — Bia.3Me - = byo3Xq + biy9Xy + K (in the score form) when K is a constant and is equal to My ~ bigs M2 ~ b13.2 My. In this equation, Scanned with CamScanner fas = Predicted value of dependent variable 123 = Multiplying constant or weight for the xX, _ ates 2 bi32 = Multiplying constant or weight for the Xs val 3 Value, Both bi23 and byyo are generally named as partial regress icients. The partial regression coefficient: BPSSION cocsy digg tells us how many units X, i 3 s any units X, incr evel i in Xq while Xy is held constant; and is bow a nee in Xz 3 is hel | constant; and bj3. tells us how many , . increases for every unit increase in X while Xy is held conn 7 j : if stant Computation of 6 coefficients or partial regression coeffic; icients (biz3 and 613.2) o, big = (2) bis. = (2): Here, 0}, 02 and 03 are the standard deviations for the distributio related to variable X,, Xz and Xs and j23 and fis» are called B coefficients (beta coefficients). These f coefficients are also called standard partial regression coefficients and are computed by using the following formulae: ig — 73 23 Bis = E 1-75 _ fis 723 Biss =~ 2 1-15 Steps to Formulate a Regression Equation le regression equation for predicting with the help of the given values of as follows: The steps for framing a multipl the dependent variable value X1 independent variables Xp and Xy can be summarized Step1. Write the multiple regression equation Digg Xo + bigaXs + K as where K=M = bis Mo - bis2 Me - bse Ms “on Step 2. Write the formula for the calculation of partial regres? coefficients Scanned with CamScanner tus = (22) bisa = (2 }one step 3. Compute the values of standard partial regression coefficients Bios and Ais: Bry = Ae By = Step 4. Put the values of 8,93 and j39 in the formulae for computing the values of bi2.3 and bj3», along with the values of 0, 02 and 03. Compute the value of K by putting the values of M;, Ma, Ms Step 5. X» and X3) and the computed values of bj2.3 (means of distribution X,, and b)39 in the equation K = M, ~ bi2.3Mo - bi32Ms3 Step 6. Put the values of bj»3, Xo (the given value of independent variable X9), 13.2, Xs (the given value of independent variable X3) and the value of constant K in multiple regression equation as given in step1. Now, the task of formulating regression equations can be illust- rated with the help of a few examples. Example 14.4: Given the following data for a group of students: Scores on achievement test 57S X_ = Scores on intelligence test Xs = Scores calculated showing study hours per week My, = 101.71, My = 10.06, Mg = 3.35 o; = 13.65, Oy = 3.06, 3 = 2.02 ny = 0.41, ry = 0.50, 3 = 0.16 (a) Make out a multiple regression equation involving the dependent variable X1, and independent variables X2 and X3. (b) Ifa student scores 12 in the intelligence test Xz, and 4 in Xs (study hours per week), what will be his estimated score in X, (achievement test)? Scanned with CamScanner Solution. Step 1. Write the multiple regression equation x K = M, - big3My - bis2My = dingXy + dysy Xa + K where Step 2. Obtain the partial regression coefficients o Step 3. Find the values of the standard partial regression coefficients 0.41-0.50x0.16 _ 0.41- 0.08 — (0.16)" = 0.0256 = 2:3300 _ 9 49g 0.9744 ° Brae = ExT fintis _ 0.50% 0.41% 0.16 _ 0.50 - 0.0656 cl 1- (0.16 1 0.0256 = 4844 _ 9 aug 0.9744 Step 4. Substituting the values of fi2,3 and fiz in the relations in step 2, we obtain 13.65 13.65 biog = 13:85 = 338 = 1.507 123 = 3G Airs 3.06 * 0.338 507 13.65 b, = x 0.446 = 3. 13.2 2.02 O14 Step 5. Compute the values of constant K: K= M ~ big3Mp ~ by39Msz = 101.71 - 1.507(10.06) — 3.014(3.35) = 101.710 - 15.160 - 10.097 = 101.710 - 25.257 = 76.45 Step 6. The multiple regression equation, as laid down in step 1. is X, = big Xp + diggXyt+ K ' . . vet Patting the values of biv3, biyy and K in the above equation, we 8° Scanned with CamScanner X, = 1.507(X2) + 3.014(Xy) + K = 1.507(Xy) + 3.014(X5) + 76.453 ‘The required multiple regression equation is X, = 1.507Xy + 3.014Xy + 76.453 Here, X2=12, X,=4 Hence the predicted value of Xj variable is X, = 1.507 (12) + 3.014 (4) + 76.453 18.084 + 12.056 + 76.453 106.593 = 107 (nearest whole number) Example 14.5: Given the following data for a group of students: X, = Scores on an intelligence test Scores on a memory sub-test X2 X3 = Scores on a reasoning sub-test M, = 78.00, Mz = 87.20, Mz = 32.80 o, = 10.21, 0, = 6.02, o3 = 10.35 rip = 0.67, ns = 0.75, 195 = 0.63 (a) Establish’ a multiple regression equation involving the depe- ndent variable X, and two independent variables Xz and X35. on memory sub-test and a (b) If a student obtains a score of 80 what can be his expected score of 40 on reasoning sub-test, score in total intelligence test? Solution. Step 1. Write the multiple regression equation X, = bigXo + bis2X3 + K where K = My - big3M2 - bis9Ms Step 2. Compute partial regression coefficients: 10.21 =a Pies i = 2 Bs = (i) bes = ie Bis = G09 7 _o _ 1021 gf (i) bisg = os Ais2 = 10.35 Bis2 Scanned with CamScanner Step 3. Calculate the stindard partial regression coefficients. Step 4. Put the values of Aiys and fjsy in the relations given jn Step 2 and obtain 10.21 bys = x 0.327 6.02 1.7 x 0.327 = 0.5559 = 0.556 approximately O21, 0.543 10.35 = 0.986 x 0.543 = 0.5353 = 0.535 approximately Step 5. Compute the values of constant K K = M, ~ biz3Mp — bi32M5 = 78.00 — (0.556 x 87.2) - (0.535 x 32.8) = 78.00 - 48.483 - 17.548 8.00 - 66.031 = 11.969 = 12 approximately Step 6. The multiple regression equation, as laid down in Step 1, is X, = biy3Xo + biyeXy + K = .556X, + .535Xy + 12 Step 7. ‘The predicted value of X, variable is Xx, = 5 56 x 80 + 535 x 40 + 12 = 44.480 + 21.40 + 12 = 77.88 = 78 (approximate) Standard Error of Estimate With the help of multiple regression equation, we y to predict or on ; values of the mate the value of X, (the dependent variable), when the values © Scanned with CamScanner independent variables Xp and Xs ... are giver actual value of X; and the predicted or estimated value X, is known The difference between the the Standard error (SE) of the estimate ye ndard er estimate and can be computed by the g (estimated X;) or G13 = oy JI-Reoy 23 Here, 01 is the standard deviation of X;, which is the dependent ble, and Rj23 is multiple correlation coefficient (correlation varial This can be computed by using formula between X, and Xy + Xs). Ne +3 — 2% Ns = As discussed in Chapter 13, it can also be computed with the help *s (Beta coefficients) which are computed during the course of establishment of multiple regression equation. ‘The formula for computing multiple regression coefficient w.th the help of beta is Rigs = VArstia + Aiso%s or Ri, = Basti + Asots ‘The SE of the estimate can be computed by using th formula: or Gia3 = Hl - Re o (estimated X1) help of examples. of e following The above formula can be illustrated with the represents the scores on ‘and Xg indicates scores ariables). The other the Example 14.4, %1 achievement test (dependent variable) and X2 on intelligence and study hours (independent v: related values needed are: (i) 0, SD of the scor ryg = 0.50 Bisy = 0446 Example 14.6: In of - Ue (i) m2 = 0.41, (ii) Bizs = 0-338, Let us now compute the value of Rios: Rios = As 72 + As: = 938 xa 446 x .50 36158 n13 = 13858 + .22300 = Scanned with CamScanner ‘Then, @ (estimated Xj), or orn = oy l= Ries = 13.65 /1—.36158 = 13,65 /.63842 = 13.65 x .799 = 10.9 ‘The relaed given and computed data in Example Example 14. 14.5 are as follows: (i) 9 = 10.21 (i) ry = 0.67, ry = 0.75 (iii) Bigy = 0.327, Biy2 = 0.543 Let us first compute the value of Rigs: Roy = Bost + Bis2%13 = (0.327 x 0.67) + (0.543 x 0.75) = 0.219 + 0.407 = 0.626 Then, o (estimated X)), or 195 = Fy - Rios 10.21 1 - 0.632 " 10.21 /0.368 = 10.21 x .606 " 6.187 SUMMARY 1, The coefficient of correlation helps us in finding the degree and direction of association between two variables. Howevet, the concepts of regression lines and regression equations help us predict the value of one variable when the values ofa correlated variable or variables are known to us. 2. In simple regression based on r, there are two regression equations: ¢, (i) Y- M, = r>(X - M,) o, value of Y variable (This equation helps us predict the scor corresponding to any value of the X variable.) Scanned with CamScanner (ii) X — My = ae (v= M,) In the above equations, My and M, represent the mei meal istibutionlandloMandlG¥areliieetandaeIflsrinioe of these distributions , ae ens Multiple regression is based on multiple corre used to p edict the values of the dependent variable when the values of two or more independent variables (being associated with dependent variable) are known to us. A general mult ste regression equation has the following formula: m ion R. Tt is = bigs Xo + bisyXy + K where K= M, ~ byy3 Mz - bx Ms Here, 6)23 and 3. are partial regression coefficients. These values are computed as bes = 2 Bes bi32 = where By3 = Bisa = and My, My, Ms, 01, 0% ¢3 are the means and standard deviation of the distribution Xj, Xz and Xs. Now, from this equation, we can predict the value of X), the dependent variable, when we are given the values of Xy and Xy (the independent variables). There remain gaps and differences between the predicted values or scores and the observed values or scores. This deviation of X, (the predicted score) from X, (the actual score) is called error in prediction This error can be computed in the form of SE of the estimate by using the following formulae: (i) The SE of the estimate for pre dicting Y from X is Scanned with CamScanner (ii) The SE of the estimate for predicting X and Y is On = oft, mate for predicting dependent variahte rhe SE of th P from the given values of independent v Xq is @ (estimated X}) or where gj is the SD of the correlation coeffi EXERCISES What are the regression lines in a scatter diagram? How would you use them for the prediction of variables? Explain with the help of an example. Given the following data for two tests: History (X) Civics (Y) Mean = 25 Mean = 30 SD =1.7 SD = 1.6 Coefficient of correlation Try = 0.95 (a) Determine both the regression equations. (b) Predict the probable score in Civics of a student whose score in History is 40. (c) Predict the probable score in History of a student whose score in Civics is 50. From the Scatter diagram in Figure 7.8, (a) Calculate both the regression equations. (b) Predict the probable score on X when Y = 100. A group of five students obtained the following scores on two achievement tests X and ¥: Students A B iG D E Scores in 10 11 12 9 8 X test Scores in 12 18 20 10 (10 Y test (a) Determine both the regression equations. Scanned with CamScanner cm (b) Ifa student scores 15 in test X, predict his probable score in test Y. (c) Ifa student scores 5 in test Y, predict his probable score in test X. What is a multiple regression equation? How is it used for predicting the value of a dependent variable? Illustrate with the help of an example. A researcher collected the following data during the course of his study. Dependent Independent Independent variable X, variable Xo variable Xx M, = 78 Mz = 55 a = 16 o;,= 10 nN = .70 73 = .80 To3 = .50 (a) Set up the multiple regression equation for predicting the value of dependent variable for the given values of both the independent variables. (b) If X» = 60 and Xz = 40, predict the value of X, A researcher on psychology wanted to study the relationship of physical efficiency and hours per week devoted to practice with the performance in athletics. He obtained the following results during the course of his study: Performance Physical Hours practised in athletics (X) efficiency test (Xo) _ per week (Xs) M, = 73.8 My = 19.7 Mz = 49.5 go = 9.1 03 = 17.0 Tg = 465 N13 To, = .562 (a) Set up the multiple regression equation for predicting performance in athletics on the basis of scores on physical efficiency test and hours per week practice. (b) If X; = 20 and Xy = 42, predict the value of X}. Scanned with CamScanner Multivariate ANOVA (MANOVA) Multivariate ANOVA (MANOVA) extends the capabilities of analysis of variance (ANOVA) by assessing multiple dependent variables simultaneously. ANOVA statistically tests the differences between three or more group means. For example, if we have three different teaching methods and we want to evaluate the average scores for these groups, we can use ANOVA. However, ANOVA does have a drawback, It can assess only one dependent variable at a time. This limitation can be an enormous problem in certain circumstances because it can prevent we from detecting the effects that actually exist. MANOVA provides a solution for some studies. This statistical procedure tests multiple dependent variables at the same time. By doing so, MANOVA can offer several advantages over ANOVA. ANOVA limitations Regular ANOVA tests can assess only one dependent variable at a time in our model. Even when we fit a general linear model with multiple independent variables, the model only considers one dependent variable. The problem is that these models can’t identify patterns in multiple dependent variables. This restriction can be very problematic in certain cases where a typical ANOVA won't be able to produce statistically significant results. Comparison of MANOVA to ANOVA Using an Example MANOVA can detect patterns between multiple dependent variables. It sounds complex, but graphs make it easy to understand. An example that compares ANOVA to MANOVA: Suppose we are studying three different teaching methods for a course. This variable is our independent variable. We also have student satisfaction scores and test scores. These variables are our dependent variables. We want to determine whether the mean scores for satis and tests differ between the three teaching methods. The graphs below display the scores by teaching method. One chart shows the test scores and the other shows the satisfaction scores. These plots represent how one-way ANOVA tests the data—one dependent variable at a time. Scanned with CamScanner Individual Vaive Plot of Test a ' po 2 na t i = Ce Individual Value Plot of Satstaction: a : 3 |- : : i m 63 : ? "UY + ; Both of these graphs appear to show that there is no association between teaching method and either test scores or satisfaction scores. The groups seem to be approximately equal. Consequently, it’s no surprise that the one-way ANOVA P-values for both test and satisfaction scores are insignificant (0.923 and 0.254). ‘The teaching method isn’t related to either satisfaction or test scores, How MANOVA Assesses the Data The patterns we can find between the dependent variables and how they are related to the teaching method. The graph between the test and satisfaction scores on the scatterplot and use the teaching method as the grouping variable. This multivariate approach represents how MANOVA tests the data. These are the same data, but sometimes how we look at them makes all the difference. Scanned with CamScanner Scatterplot of Test vs Satisfaction 307 ‘Method eo et 3.06 a2 3 3.05 . 3.084 i % 303 y 3.024 301 . 3.00 299 . 28 299° «300 «30130230338 3.05 306 The graph displays a positive correlation between Test scores and Satisfaction. As student satisfaction increases, test scores tend to increase as well. Moreover, for any given satisfaction score, teaching method 3 tends to have higher test scores than methods 1 and 2. In other words, students who are equally satisfied with the course tend to have higher scores with method 3. MANOVA can test this pattern statistically to help ensure that it’s not present by chance. In our preferred statistical software, fit the MANOVA model so that Method is the independent variable and Satisfaction and Test are the dependent variables. The MANOVA results are below. General Linear Model: Test, Satisfaction versus Method MANOVA for Method ge2 me-05 n= 21.0 Test OF Criterion Statistic F Mun Denom Wilks’ 0.51094 8.778 «4 eB Lawley-Horelling 0.95877 10.275 486 Pillai's 0.40977 7.297 4 = 90 Roy's o.9ses1 Even though the one-way ANOVA results and graphs seem to indicate that there is nothing of interest, MANOVA produces statistically significant results—as signified by the minuscule P- values, We can conclude that there is an association between the teaching method and the relationship between the dependent variables. Scanned with CamScanner Benefits of MANOVA: Use multivariate ANOVA when our dependent variables are correlated. The correlation structure between the dependent variables provides additional information to the model which gives MANOVA the following enhanced capabilities: © Greater statistical power: When the dependent variables are correlated, MANOVA can identify effects that are smaller than those that regular ANOVA, can find, © Assess patterns between multiple dependent variables: The factors in the model can affect the relationship between dependent variables instead of influencing a single dependent variable. As the example in this post shows, ANOVA tests with a single dependent variable can fail completely to detect these patterns Limits the joint error rate: When we perform a series of ANOVA tests because wwe have multiple dependent variables, the joint probability of rejecting a true null hypothesis increases with each additional test. Instead, if we perform one MANOVA test, the error rate equals the significance level. > Discriminant Analysis (Wilk’s Lambda) Wilks’ lambda (A) is a test statistic that’s reported in results from MANOVA, discriminant analysis, and other multivariate procedures. Other similar test statistics include Pillai’s trace criterion and Roy’s Ger criterion, + InMANOVA, A tests, if there are differences between group means for a particular combination of dependent variables. It is similar to the F-test statistic in ANOVA. Lambda is a measure of the percent variance in dependent variables not explained by differences in levels of the independent variable. A value of zero means that there isn’t any variance not explained by the independent variable (which is ideal). In other words, the closer to zero the statistic is, the more the variable in question contributes to the model. We would reject the null hypothesis when Wilk’s lambda is close to zero, although this should be done in combination with a small p-value. ‘© Indiscriminant analysis, Wilk’s lambda tests how well each level of independent variable contributes to the model. The scale ranges from 0 to 1, where 0 means total discrimination, and | means no discrimination. Each independent variable is tested by putting it into the model and then taking it out — generating a A statistic ‘The significance of the change in A is measured with an F-test; if the F-value is greater than the critical value, the variable is kept in the model. This stepwise procedure is usually performed using software like Minitab, R, or SPSS. The following SPSS output shows which variables (from a list of a dozen or more) were kept in using this procedure. Scanned with CamScanner Analysis 1 Stepwise Statistics Variables Entered Removed? © Tike Lamba Bade step_| enerea | staiste | on oo oa [Saisie [at ma aa 1 —[REIGHT ro 7 T]_ sr | 36817 T]_ 57000] 000 2 [canopy 220 2 2| sro | 21510 4| 112000] 000 3__|coneow. 265, 2 21 srooo | _ 17265 51 110000 |__000 "ALeach sep, the vaiabe hat minimis te overall WIke Lambda is ened » 4 Maximum numberof steps i 10. Minimum patialF to enters 3.68 Masrnur partial Ft remove Is 2.74 F lve tolerance, or VIN insufficient for futher computation. SPSS Wilk’s Lambda output, Formula [El 7! |H+E| Winx 1 —2.in the denominator is the proportion of variance in dependent variables explained by the model's effect. Caution should be used in interpreting results as this statistic tends to be biased, especially for small samples. Output Components Wilks’ lambda output has several components, including: “Sig” or significance (p-value). If this is small, (i.e. under .05) reject the null hypothesis. “Value” column in the output: the value of Wilk’s Lambda, “Statistic” is the F-statistic associated with the listed degrees of freedom. It would be reported in APA format as F (df, df2) = value. For example, if we had an f-value of 36.612 with | and 2 degrees of freedom we would report that as F (1,2) = 36.612, Scanned with CamScanner > Factor analysis Factor analysis is a way to take a mass of data and shrinking it to a smaller data set that is more manageable and more understandable. It’s a way to find hidden patterns, show how those patterns overlap and show what characteristics are seen in multiple patterns. It is also used to create set of variables for similar items in the set (these sets of variables are called dimensions). It can be a very useful tool for complex sets of data involving psychological studies, socioeconomic status and other involved concepts. A “factor” is a set of observed variables that have similar response patterns; They are associated with a hidden variable (called a confounding variable) that isn’t directly measured Factors are listed according to factor loadings, or how much variation in the data they can explain. ‘The two types: exploratory and confirmatory. ‘© Exploratory factor analysis is if we don’t have any idea about what structure our data is or how many dimensions are in a set of variables. Confirmatory Factor Analysis is used for verification as long as we have a specific idea about what structure our data is or how many dimensions are in a set of variables. Factor Loadings Pett Eee coer Scanned with CamScanner Not alll factors are created equal; some factors have more weight than others. In a simple example, imagine our bank conducts a phone survey for customer satisfaction and the results show the following factor loadings Variable Factor 1 Factor 2 Factor 3 Question 1 0.885 0.121 -0.033 Question 2 0.829 0.078 0.157 Question 3 0.777 0.190 0.540 ‘The factors that affect the question the most (and therefore have the highest factor loadings) are bolded. Factor loadings are similar to correlation coefficients in that they can vary from -1 to 1. The closer factors are to -1 or 1, the more they affect the variable. A factor loading of zero would indicate no effect. Multiple Factor Analys This subset of Factor Analysis is used when our variables are structured in variable groups. For example, we might have a student health questionnaire with several items like sleep pattern addictions, psychological health, or leaming disabilities The two steps performed in Multiple Factor Analysis are: 1. Principal Component Analysis is performed on each set of data. This gives an eigenvalue, which is used to normalize the data sets. 2. The new data sets are merged into a unique matrix and a second, global PCA is performed. Performing Factor Analysis Factor Analysis is an extremely complex mathematical procedure and is performed with software. ‘+ Instructions for Stata, = Minitab. + SPSS. Scanned with CamScanner Confirmatory Factor Analysis allows we to figure out if a relationship between a set of observed variables (also known as manifest variables) and their underlying constructs exists. It is similar to Exploratory Factor Analysis, The main difference between the two is: ‘+ If we want to explore patterns, use EFA. ‘* If we want to perform hypothesis testing, use CFA. EFA provides information about the optimal number of factors required to represent the data set. With Confirmatory Factor Analysis we can specify the number of factors required. For example, CFA can answer questions like “Does my ten-question survey accurately measure one specific factor?”. Although it is technically applicable to any discipline, it is typically used in the social sciences. Exploratory Factor Analysis (EFA) is used to find the underlying structure of a large set of variables. “It reduces data to a much smaller set of summary variables. EFA is almost identical to Confirmatory Factor Analysis(CFA). Both techniques can (perhaps surprisingly) be used to confirm or explore. Similarities are: + Assess the internal reliability of a measure. + Examine factors or theoretical constructs represented by item sets. They assume the factors aren’t correlated. «Investigate quality for individual items. There are, however, some differences, mostly concerning how factors are treated/used. EFA is basically a data-driven approach, allowing all items to load on all factors, while with CFA we must specify which factors to load. EFA is a good choice if we don’t have any idea about what common factors might exists. EFA can generate a large number of possible models for our data, something that may not be possible if a researcher has to specify factors. If we do have an idea about what the models look like, and we want to test our hypotheses about the data structure, CFA is a better approach. Scanned with CamScanner

You might also like