Prof. Dr. Moustapha Ibrahim Salem Mansourms@alexu - Edu.eg 01005857099
This document provides an overview of simple linear regression. It discusses how regression analysis was originally coined by Sir Francis Galton to describe relationships between variables. It defines key regression terminology like explanatory variables, populations, samples, and regression equations. It also outlines important assumptions of linear regression like linearity, independence, homoskedasticity, and normality of errors. The document is intended to explain the basic concepts and assumptions underlying simple linear regression analysis.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
41 views110 pages
Prof. Dr. Moustapha Ibrahim Salem Mansourms@alexu - Edu.eg 01005857099
This document provides an overview of simple linear regression. It discusses how regression analysis was originally coined by Sir Francis Galton to describe relationships between variables. It defines key regression terminology like explanatory variables, populations, samples, and regression equations. It also outlines important assumptions of linear regression like linearity, independence, homoskedasticity, and normality of errors. The document is intended to explain the basic concepts and assumptions underlying simple linear regression analysis.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 110
Prof. Dr.
Moustapha Ibrahim Salem
[email protected] 01005857099 Simple Linear R egression ?”Why “Regression • The term was coined by Sir Frances Galton in his 1885 address to the BAAS, • in describing his findings about the relationship between height of c hildren and of their parents. • He found that the heights tended to be to less extreme than their pa rents – closer to the mean, so tall parents had tall kids, but less extre mely tall, while short couples had short kids, but less extremely shor t. (The gradient of the Kids on parents line was 0.61) He called this: • "Regression Towards Mediocrity In Hereditary Stature," (Galton, F. (1 886)) Galton’s data • Inverting this logic might predict that parents are more extreme th an heir children. • This is not so : the gradient of the parents on kids line was 0.29, ev en less than the 1.0 expected fro m equality. • Remember: the best fit line does not just transpose when X and Y swap! Types of Regression Models Regression Models
Explanatory Variable 1 Explanatory Variables +2
Simple Multiple
Linear Non-Linear Linear Non-Linear
Equation for a line - getting notation straight • In order to use regression analysis effectively, it is im portant that you understand the concepts of slopes a nd intercepts and how to determine these from data values. Equation for a line - getting notation straight • In linear algebra, the equation of a straight lin e was often written y = mx + b where m is the slope and b is the intercept. • In some popular spreadsheet programs, the a uthors decide to write the equation of a line a s y=a + bx. Now a is the intercept, and b is the slope Equation for a line - getting notation straight • Statisticians, for good reasons, have rationalized this notation and usually write the equation of a line as y = βo + β1x or as Y = b0 + b1X. (the distinction betwe en βo and b0 will be made clearer in a few minutes). • The use of the subscripts 0 to represent the interce pt and the subscript 1 to represent the coefficient f or the X variable then readily extends to more com plex cases Populations and samples • All of statistics is about detecting signals in the face of noise and in estimating population parameters from samples. Regression i s no different. • First consider the population. The correct definition of the popul ation is important as part of any study. • Conceptually, we can think of the large set of all units of interest. On each unit, there is conceptually, both an X and Y variable pres ent. We wish to summarize the relationship between Y and X, an d furthermore wish to make predictions of the Y value for future X values that may be observed from this population. Populations and samples • If this were physics, we may conceive of a physical law betwee n X and Y , e.g. F = ma or 𝞺Density= w/v. (Deterministic model) • However, in chemical engineering, the relationship between Y and X is often much more tenuous. (Probabilistic model). • If you could draw a scatterplot of Y against X for ALL elements of the population, the points would NOT fall exactly on a strai ght line. Rather the value of Y would fluctuate above or belo w a straight line at any given X value. Populations and samples • We denote this relationship as Y = βo + β1X + ε where now βo and β1 are the POPULATION intercept and slope respec tively. • The term ε represent random variation of individual unit s in the population above and below the expected value. I t is assumed to have constant standard deviation over th e entire regression line (i.e. the spread of data points in th e population is constant over the entire regression line). Populations and samples Populations and samples Populations and samples • Of course, we can never measure all units of t he population. So a sample must be taken in o rder to estimate the population slope, popula tion intercept, and population standard deviati on Populations and samples • It is NOT necessary to select a simple random sample from the entire population. • The bare minimum that must be achieved is th at for any individual X value found in the samp le, the units in the population that share this X value, must have been selected at random. Populations and samples • This is quite a relaxed assumption! • For example, it allows us to deliberately choose values of X from the extremes and then only at those X value, rando mly select from the relevant subset of the population, rath er than having to select at random X values from the popu lation as a whole. • In other words, we can for example select X values to be 3 , 5, 9, and 12 .... then measure the Y values corresponding to each X. Populations and samples Populations and samples Population & Sample Regression Models Obtaining Estimates • To distinguish between population parameters and sample es timates, we denote the sample intercept by b0 and the sampl e slope as b1. The equation of a particular sample of points is expressed : Obtaining Estimates • How is the best fitting line found when the points are scattere d? Many methods have been proposed (and used) for curve fi tting. Some of these methods are: • least squares • least absolute deviations • least median-of-squares • least maximum absolute deviation • baeysian regression • fuzzy regression Obtaining Estimates Obtaining Estimates
• we minimize the SQUARED deviation in the VERTICAL direction
Obtaining Estimates • The estimated intercept (b0) is the estimated value of Y when X = 0. • In some cases, it is meaningless to talk about values of Y when X = 0 because X = 0 is nonsensical. For example, in a plot of income vs year, it seems kind of silly to investiga te income in year 0. • In these cases, there is no clear interpretation of the inte rcept, and it merely serves as a placeholder for the line. Obtaining Estimates • The estimated slope (b1) is the estimated change in Y per unit change in X. For every unit change in the horizontal direction, the fitted line increased by b1 units. • If b1 is negative, the fitted line points downwards, and the increase in the li ne is negative, i.e., actually a decrease. • As with all estimates, a measure of precision can be obtained. As before, t his is the standard error of each of the estimates. Again, there are comput ational formulae, but in this age of computers, these are not important. • As before, approximate 95% confidence intervals for the corresponding p opulation parameters are found as estimate 2 × se Obtaining Estimates • Formal tests of hypotheses can also be done. Usually, these a re only done on the slope parameter as this is typically of mos t interest. • The null hypothesis is that population slope is 0, i.e. there is n o relationship between Y and X (can you draw a scatterplot sh owing such a relationship?). More formally the null hypothesis is H0: β1 = 0 • notice that the null hypothesis is ALWAYS in terms of a popula tion parameter and not in terms of a sample statistic. Obtaining Estimates Inference about the Slope: t Test • t test for a population slope Is there a linear relationship between x and y? • Null and alternative hypotheses H0: β1 = 0 (no linear relationship) H1: β1 ≠ 0 (linear relationship does exist) • Test statistic Inference about the Slope: t Test Inference about the Slope: t Test Regression Analysis for Descrip tion Regression Analysis for Descrip tion Obtaining Estimates • Confidence and prediction i ntervals • The prediction interval predi cts in what range a future in dividual observation will fall. • Thus there is a 95% probabili ty that the true best-fit line f or the population lies within the confidence interval. Assumptions • Linearity: • regression analysis assumes that the relationship between Y and X is l inear. Make a scatterplot between Y and X to assess this assumption. • Correct sampling scheme • The Y must be a random sample from the population of Y values for e very X value in the sample. • Fortunately, it is not necessary to have a completely random sample f rom the population as the regression line is valid even if the X value s are deliberately chosen. • However, for a given X, the values from the population must be a sim ple random sample Assumptions • No outliers or influential points • All the points must belong to the relationship – there should be no unusual points. • The scatterplot of Y vs X should be examined. I f in doubt, fit the model with the points in an d out of the fit and see if this makes a differen ce in the fit. Assumptions Assumptions • Equal variation along the line (homoskedacity) • The variability about the regression line is similar for all val ues of X, i.e. the scatter of the points above and below the fitted line should be roughly constant over the entire line. • Independence • Each value of Y is independent of any other value of Y. Th e most common cases where this fail are time series data w here X is a time measurement. In these cases, time series a nalysis should be used Assumptions • Normality of errors: • The residuals (the difference between the vale of Y and the po int on the line) must be normally distributed Assumptions • X measured without error: In regression, it can turn out that that the X value may not be known exactly. This general problem is called the “error in variables” problem and has a long history in statistics. • It turns out that there are two important cases. (1) If the value reported for X is a nominal value and the actual value of X varies randomly around this nomin al value, then there is no bias in the estimates. Assumptions (2) However, if the value used for X is an actual measurement of t he true underlying X then there is uncertainty in both the X and Y direction. • In this case, estimates of the slope are attenuated towards zer o (i.e. positive slopes are biased downwards, negative slopes bia sed upwards). More alarmingly, the estimate are no longer con sistent, i.e. as the sample size increases, the estimates no longe r tend to the true population values! • This latter case of “error in variables” is very difficult to analyze properly and there are not universally accepted solutions. F-Test of Model Significan ce Residual plots • After the curve is fit, it is important to examine if the fitted cur ve is reasonable. This is done using residuals. • The residual for a point is the difference between the observe d value and the predicted value, i.e., the residual from fitting a straight line is found as: Residual Plots Residual Plots • Typical residual plots are illustrated below – b ut note that with small datasets, the patterns will not be as clear cut. • Therefore, with small datasets, don’t over an alyze the plots - only gross deviations from ide al plots are of interest. Residual Plots Residual Plots Residual Plots
• Lets have another look at the major warning si
gns in residual plots Residual Plots Quiz ... Comment on the p lot Quiz ... Comment on the p lot Quiz ... Comment on the p lot Quiz ... Comment on the p lot Probability Plots Probability Plots • The probability plot is a graphical technique for assessing wh ether or not a data set follows a given distribution such as the normal distribution. • The data are plotted against a theoretical normal distribution i n such a way that the points should form approximately a strai ght line. Departures from this straight line indicate departures from the specified distribution. • If the residuals are not normally distributed, the t-tests on re gression coefficients, the F-tests, and the interval estimates ar e not valid. This is a critical assumption to check. Probability Plots • Stragglers at either end of the normal prob ability plot indicate outliers. • Curvature at both ends of the plot indicates long or short distributional tails. • Convex, or concave, curvature indicates a la ck of symmetry. • Gaps, plateaus, or segmentation indicate cl ustering and may require a closer examinati on of the data or model. • Of course, use of this graphic tool with very small sample sizes ( less than 100 points) is unwise Example - Yield and fertil izer • We wish to investigate the relationship betw een yield(Liters) and fertilizer (kg/ha) for to mato plants. • Interest also lies in predicting the yield when 16 kg/ha are assigned. The level of fertilizer were randomly assigned to each plot. At the end of the experiment, the yields were meas ured and the following data were obtained. • In this study, it is quite clear that the fertilize r is the predictor (X) variable, while the resp onse variable (Y) is the yield. Example - Yield and fertil izer Example - Yield and fertil izer Example - Yield and fertilizer - KYPLOT Analysis • The ordering of the rows is NO T important; however, it is oft en easier to find individual dat a points if the data is sorted by the X value and the rows for fu ture predictions are placed at the end of the dataset. Example - Yield and fertilizer - KYPLOT Analysis • Use the Statistics-> Regression Analysis -> Simple Regression platform to start the analysis. Specify the Y and X variable as n eeded. Example - Yield and fertilizer - KYPLOT Analysis • Use the Statistics-> Regres sion Analysis -> Simple Re gression platform to start t he analysis. Specify the Y a nd X variable as needed Example - Yield and fertilizer - KYPLOT Analysis Example - Yield and fertilizer - KYPLOT Analysis • A new window will pop up, with the following regress ion results: Example - Yield and fertilizer - KYPLOT Analysis Example - Yield and fertilizer - KYPLOT Analysis Example - Yield and fertilizer - KYPLOT Analysis • A new window will pop up, with the following regression resul ts Example - Yield and fertilizer - KYPLOT Analysis • A new window will pop up, with the following regression resul ts: Example - Yield and fertilizer - KYPLOT Analysis • At this stage, it would be also useful to draw a scatter plot of the data (refer t o previous KYPLOT tutorials) • The relationship look approximately line ar • there don’t appear to be any outlier or i nfluential points • The scatter appears to be roughly equal across the entire regression line • Residual plots will be used later to chec k these assumptions in more detail Example - Yield and fertilizer - KYPLOT Analysis • The Fit menu item allows you to fit the least-squa res line. • The actual fitted line is drawn on the scatter plot, • and the straight line equation coefficients, (here c alled A1 for the intercept and A2 for the slope) of t he fitted line are printed below the fit spread she et. Example - Yield and fertilizer - KYPLOT Analysis Example - Yield and fertilizer - KYPLOT Analysis • The estimated slope is the estimated change in yield when the amount of fertili zer is increased by 1 unit. I n this case, the yield is exp ected to increase (why?) b y 1.10137 L when the fertil izer amount is increased by 1 kg/ha. Example - Yield and fertilizer - KY PLOT Analysis Example - Yield and fertilizer - KYPLOT Analysis Example - Yield and fertilizer - KYPLOT Analysis • The estimated standard error for b1 (the estimated slope) is 0.1 32 L/kg. This is an estimate of the standard deviation of b1 over all possible experiments. A standard error can also be found for it as shown in the above table. • We interpret this interval as “being 95% confident that the true increase in yield when the amount of fertilizer is increased by o ne unit is somewhere between (.837 to 1.365) L/kg.‟ • Be sure to carefully distinguish between β1 and b1. Note that th e confidence interval is computed using b1, but is a confidence i nterval for β1 - the population parameter that is unknown . Example - Yield and fertilizer - KYPLOT Analysis • In linear regression problems, one hypothesis of interest is if the true slo pe is zero. This would correspond to no linear relationship between the re sponse and predictor variable (why?) In many cases, a confidence interval tells the entire story. • KYPLOT produces a test of the hypothesis that each of the parameters (the slope and the intercept in the population) is zero. The output is reproduce d again below: Example - Yield and fertilizer - KYPLOT Analysis • The hypothesis testing proceeds as follows. 1. Specify the null and alternate hypothesis: • H0: β1 = 0 (no linear relationship) • H1: β1 ≠ 0 (linear relationship does exist) • Notice that the null hypothesis is in terms of the popul ation parameter β1. This is a two- sided test as we are i nterested in detecting differences from zero in either di rection. Example - Yield and fertilizer - KYPLOT Analysis Example - Yield and fertilizer - KYPLOT Analysis Example - Yield and fertilizer - KYPLOT Analysis Example - Yield and fertilizer - KYPLOT Analysis Example - Yield and fertilizer - KYPLOT Analysis Example - Yield and fertilizer - KYPLOT Analysis
• The residuals are simply the difference between the actual da
ta point and the corresponding spot on the line measured in t he vertical direction. The residual plot shows a no trend in the scatter around the value of zero. Example - Yield and fertilizer - KYPLOT Analysis
• The above normal probability plot of residuals was cr
eated by NCSS software. Transformations • In some cases, the plot of Y vs X is obviously non-linear and a t ransformation of X or Y may be used to establish linearity. • Often a visual inspection of a plot may identify the appropriat e transformation. There is no theoretical difficulty in fitting a li near regression using transformed variables other than an un derstanding of the implicit assumption of the error structure. The model for a fit on transformed data is of the form Transformations • Note that the error is assumed to act additively on the tra nsformed scale. All of the assumptions of linear regression are assumed to act on the transformed scale – in particula r that the population standard deviation around the regre ssion line is constant on the transformed scale. • The most common transformation is the logarithmic transf orm. It doesn’t matter if the natural logarithm (often calle d the ln function) or the common logarithm transformatio n (often called the log10 transformation) is used. • Transformations Some common transformations
After the regression model is fit, remember to interpret the estimates
.of slope and intercept on the transformed scale Transformations - Example: Monitoring D ioxins • An unfortunate byproduct of pulp-and-paper production us ed to be dioxins - a very hazardous material. This material was discharged into waterways with the pulp-and-paper e ffluent where it bio accumulated in living organisms such a crabs. Newer processes have eliminated this by product, bu t the dioxins in the organisms takes a long time to degrade. • Government environmental protection agencies take sampl es of crabs from affected areas each year and measure the amount of dioxins in the tissue. The following example is ba sed on a real study Transformations - Example: Monitoring D ioxins • Each year, four crabs are captured from a monitoring station. The liver is excised and the livers from all four crabs are composite d together into a single sample. The dioxin s levels in this composite sample is measur ed. As there are many different forms of di oxins with different toxicities, a summary measure, called the Total Equivalent Dose (TEQ) is computed from the sample. • Here is the raw data. Transformations - Example: Monitoring D ioxins • As with all analyses, start with a preliminary plot of the data o btained using the Graph -> Create Graph platform. Transformations - Example: Monitoring D ioxins • The preliminary plot of the data shows a decli ne in levels over time, but it is clearly non-line ar. Why is this so? In many cases, a fixed fracti on of dioxins degrades per year, e.g. a 10% dec line per year. This can be expressed in a non-li near relationship: Transformations - Example: Monitoring D ioxins
• where C is the initial conce
ntration, r is the rate reduc tion per year, and t is the el apsed time. • If this is plotted over time, this leads to the non-linear pattern seen in the plot. Transformations - Example: Monitoring D ioxins Transformations - Example: Monitoring D ioxins Transformations - Example: Monitoring D ioxins Transformations - Example: Monitoring D • ioxins If you want a detailed regression analysis then you should perf orm the regression separately from the Staistics - > Regressio n Analysis -> Simple Regression menu Transformations - Example: Monitoring D ioxins Transformations - Example: Monitoring D ioxins Transformations - Example: Monitoring D • ioxins Have a look at the npp. It lo oks even better than the on e obtained in the fertilizer y ield example, but it looks “t ailed”. However, r2 for the npp is >0.98, i.e. the residu als are accepted to be nor mal for this sample size Transformations - Example: Monitoring D ioxins • Possible remedies for the failure of these tests include using a tran sformation of Y such as the log or square root, correcting data-reco rding errors found by looking into outliers, adding additional indep endent variables, using robust regression, or using bootstrap meth ods. We will discuss these methods briefly in class. • The fitted line is: log (TEQ) = 80.5 – 0.0394 (year) The intercept (80. 5) would be the log(TEQ) in the year 0 which is clearly nonsensical. • The slope (-.0394) is the estimated log(ratio) from one year to the next. It would mean that the TEQ in one year is only 91.3% of the T EQ in the previous year or roughly a 9% decline per year. Explained and Unexplained Varia tion Explained and Unexplained Varia tion • SST = total sum of squares Measures the variation of the yi values around their mean y • SSE = error sum of squares Variation attributable to factors other than the relationship between x and y • SSR = regression sum of squares Explained variation attributable to the relationship between x and y Explained and Unexplained Varia tion Explained and Unexplained Varia tion • Coefficient of Determination, R2 • The coefficient of determination is the portion of the total vari ation in the dependent variable that is explained by variation i n the independent variable. • The coefficient of determination is also called R-squared and i s denoted as R2 Explained and Unexplained Varia tion Explained and Unexplained Varia tion Explained and Unexplained Varia tion Explained and Unexplained Varia tion Explained and Unexplained Varia tion End of Chapter 5