0% found this document useful (0 votes)
241 views

An Example of Statistical Data Analysis

Uploaded by

Jod JD
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
241 views

An Example of Statistical Data Analysis

Uploaded by

Jod JD
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 145
Tutoria An example of statistical data analysis using the R environment for statistical computing D G Rossiter Version 1.2; May 5, 2010 Copyright © D G Rossiter 2008 - 2010 dissemination of the work as a whole (not parts) freely permitted if this origin copyright notice is included. Sale or placement on a web site where payment must be made to access this document is strictly prohibited. ‘To adapt or translate please contact the author (nttp://sw.ite.n1/personal/ rossiter), I rights reserved. Reproduction and Contents Introduction Example Data Set 2.1 Loading the dataset . -. 2.2 A normalized database structure* . . Research questions ‘Univariarte Analysis 4.1 Univariarte Exploratory Data Analysis . 4.2. Point estimation; inference of the mean 43° Answers Bivariarte correlation and regression 5.1 Conceptual issues in correlation and regression 5.2 Bivariarte Exploratory Data Analysis 5.3 Bivariarte Correlation Analysis 544 Fitting a rogression line 5.5 Bivariarte Regression Analysis 5.6. Bivariarte Regression Analysis from scratch* 5.7 Regression diagnostics 5.7.1 Fit to observed data 5.7.2 Large residuals. 5.7.3 Distribution of residuals 5.74 Leverage* . 5.8 Prediction 5.9 Robust regression” 5.10 Structural Analysis* .. . . . ee 5.11 Structural Analysis by Principal Components* 5.13 Non-parametric correlation 5.14 Answers One-way Analysis of Variance (ANOVA) 6.1 Exploratory Data Analysis . 62 Onoway ANOVA... 63 ANOVA as a linear model* 64 Means separation* . 65 One-way ANOVA from scratch 6.6 Answers Multivariate correlation and regression 7.1 Multiple Correlation Analysis 7.1.1 Pairwise simple correlations 7.12 Pairwise partial correlations 7.2. Multiple Regression Analysis 7.3. Comparing regression models 7.3.1 Comparing regression models with the adjusted R? u 16 16 16 18 a 2 a 30 32 36. 42 45 46 49 50 35 50 61 62 63. 65 66 a a 7.3.2 Comparing regression models with the AIC . . . 7.3.3 Comparing regression models with ANOVA... . 74. Stepwise multiple regression* . . 7.5 Combining discrete and continuous predictors 7.6 Diagnosing multi-colinearity 7.7 Visualising parallel regression* 7.8. Interactions* 7.9. Analysis of covariance* . 7.10 Design matrices for combined models" TAL Answers 8 Factor analysis 8.1 Principal components analysis 8.1.1 The synthetic variables* . 81.2 Residuals* 8.1.3 Biplots* 8.14 Screeplots*. 8.2 Factor analysis* 83. Answers 9 Geostatisties 9.1 Postplots . . . 9.2 ‘Trend surfaces 9.3 Higher-order trend surfaces 9A Local spatial dependence and Ordinary Kueing 9.4.1 Spatially-explicit objects 9.4.2 Analysis of local spatial structure 9.4.3 Interpolation by Ordinary Kriging 9.5 Answers eves . 10 Going further References Index of R concepts A. Derivation of the hat matrix 2 2 4 76 83, 84 a 92 95 96. 98 105 108 110 13, us us 15. 121 121 125, 12T 129 134 136 136 un ut 1 Introduction ‘This tutorial presents a data analysis sequence which may be applied to en- vironmental datasets, using a small but typical data set of multivariate point, observations. It is aimed at students in geo-information application fields who have some experience with basic statistics, but not necessarily with statistical computing. Five aspects are emphasised: 1. Placing statistical analysis in the framework of research questions; 2. Moving from simple to complex methods: first exploration, then selection of promising modelling approaches; 3. Visualising as well as computing; 4. Making correct inferences; 5. Statistical computation and visualization, ‘The analysis is carried out in the R environment for statistical computing and visualisation [15], which is an open-source dialect of the $ statistical computing language. It is free, rans on most computing platforms, and contains contribu tions from top computational statisticians, If you arc unfamiliar with R, sce the monograph “Introduction to the R Project for Statistical Computing for use at ITC" [29], the R Project’s introduction to R [27], or one of the many tutorials available via the R web page! On-line help is available for all R methods using the method syntax at the command prompt; for example 7la opens a window with help for the 1m (fit linear models) method. Note: These notes use K rather than one of the many’ commercial statisti programs because H is a complete statistical computing environment, based on | modem computing language (accessible to the user), and with packages con- tributed by leading computational statisticiane. K allows unlimited fexibility and sophistication, "Press the button and fll m the box” ie certainly faster ~ but az with Windows word processors, “what you see is all you get”. With R it may be ‘abit harder at Gret to da simple things, but you are nat limited. K ie completely Tree, can be froely-dstributed, rune on all deektop computing platform, is rege larly updated, is well-documented both by the developers and user, is the subject of several good statistical computing texts, and has an active user group. An introductory textbook with similar intent to these notes, but with a wider set of examples, is by Dalgaard [7]. A more advanced text, with many interesting applications, is by Venables and Ripley [34]. Fox [12] deals extensively with regression using R, mostly with social sciences datasets. ‘The tutorial follows a data analysis problem typical of earth sciences, natural and water resources, and agriculture, proceeding from visualisation and exploration, ‘through univariate point estimation, bivariate correlation and regression analysis, multivariate factor analysis, analysis of variance, and finally some geostatistics, In each section, there are some tasks, for which a possible solution is shown as some R code to be typed at the console (or cut-and-pasted from the PDF version Tavepi [Tene x project one Optional sections Going farther 2. Example of this document, or loaded from the accompanying .R R code files). Then there are some questions to answer, based on the output of the task. Sample answers are found at the end of each section, Some readers may want to skip more advanced sections or those that explain the mathematics behind the methods in moro dotail; these are marked with an asterisk“ in the section title and in the table of contents, ‘These notes only seratch the surface of R’s capabilities. In particular, the reader is encouraged to consult the on-line help as necessary to understand all the options: of the methods used. Neither do these notes pretend to teach statistical inference; the reader should refer to a statistics reference as necessary; some good choices, depending on your background and the application, are Brownlee [3), Bulmer [1], Dalgaard [7] (general); Davis [8] (geology), Wilks [38] (meteorology); Snedecor and Cochran [30], Steel et al. (33) (agriculture); Legendre and Legendre (16) (ccology); and Webster and Oliver [37] (soil science). See also §10, “Going further”, at the end of the tutorial, Data Set This data set, fully described in Yemefack (3 and summarized in Yemefack et al [40], contains 147 soil profile observations from the research area of the Tropen- bbos Cameroon Programme (TCP), representative of the humid forest region of southwestern Cameroon and adjacent areas of Equatorial Guinea and Gabon, ‘Three fixed soil layers (0-10 em, 10-20 cm, and 30-50 cm) were sampled. The data set is from two sources. First, 45 representative soil profiles were described and sampled by genetic horizon, Soil characteristics for each of the three fixed lay ers were computed as weighted averages using genetic horizon thickness. Second, 102 plots from various land use/land cover types were sampled at the three fixed depths. Bach of these samples was a bulked composite of five sub-samples taken with an auger in a plot diagonal basis. For both data sets, samples were located purposively and subjectively to represent soil and land use types. Laboratory analysis was by standard local methods [22] For this exercise, we have selected three soil properties: 1, Clay content (code Clay), weight % of the mineral fine earth (< 2 mm); 2. Cation exchange capacity (code CEC), emol (kg soil)! 3. Organic carbon (code 0¢), volume % of the fine earth, ‘These three variables are related; in particular we know from theory and many detailed studies that the CEC of a soil depends on reactive sites, either on clay colloids or on organic complexes such as humus, where cations (such as K+ and Ca*+) cam be easily adsorbed and desorbed [21, 31]. ‘The CEC is important for soil management, since it controls how much added artificial or natural fertiliser or liming materials will be retained by the soil for a long-lasting effect on crop growth. Heavy doses of fertiliser on soils with low CEC will be wasted, since the extra nutrients will leach In addition, for each observation the following site information was recorded: ‘¢ East and North Coordinates, UTM Zone 32N, WGS84 datum, in meters (codes ¢ and n) ‘¢ Elevation in meters above sca level (code elev) # Agro-ecological zone, arbitrary code (code zone) # Reference soil group, arbitrary code (code wrbt) # Land cover type (code Le) The soil group codes refer to Reference Groups of the World Reference Base for Soil Resources (WRB) , the international soil classification system [11]. ‘These are presented in the text file as integer codes which correspond to three of the 31 Reference Groups identified worldwide, and which differ substantially in their properties and response to management [10} 1. Acrisols (from the Haplic, Ferralic, and Plinthic subgroups) 2. Cambisols (from the Ferralie subgroup) 3. Ferralsols (from the Acri-ferrie and Xanthie subgroups ) 2.1 Loading the dataset [Note:The code in these exercises was tested with Sweave (17, 18] on I version 2.10.1 (2000-12-14), sp package Version: 0.9-52, getat package Version: 0.966, land Lattice package Version: 0.17-26 running on Mac OS X 106.3. So, the ext land graphical output you see here was automatically generated and incorporated into HETEX by running actual code through K and its packages. Then the TEX document was compiled into the PDF version you are now reading. Your output tay be slightly different on different versione and on diferent platforms ‘The dataset was originally prepared in spreadsheet and exported as a text “comma-separated value” (CSV) fle named obs.csv. This is a typical spreadsheet product with several inadequacies for processing in R, which we will fix up as we g0 along. This a tedious but necessary step for almost every dataset; so the techniques shown here should be useful in your own projects. ‘Task 1: Start the R program and switch to the directory where the dataset is stored, . ‘Task 2: Examine the contents of the CSV file. . ‘You can do this with a plain-text editor (not a spreadsheet) such as (in Windows) Notepad or Wordpad or (on Mac OS) TextEdit. We can also examine a file from within R, with the fi1e. show method: > fie. shou(tebe. xv") 102658, 326959, 587, 701686, 326772, 628,02", "3" 703688, 322133, 840, FY, 71,75,80,12.6, 8.2, 7-4, 3.200,1.700,1.000 '61,59,85,21.7,10.2, 6.6, 6.980,2.400,1.300 "146",986554, 339816, 485, 147" /688608, 359578, 435, 3.800,2,000,1,000 "700,1,600,0.750 QL: What is the format of the first line? What does it represent? Jump to Ale Q2: What is the format of the following lines? What do they represent? Jump to AQe ‘Task 3; Load the dataset into R using the read.csv method” and examine its structure. Identify each variable from the list above. Note its data type and (if applicable) numerical precision . > obs < read. cev(*ebs. coy > ser(obs) $e + int 702638 701659 705488 703621 703388 707334 681328 681508 581230 683989 $m + Smt 326959 526772 322133 39508 322846 324551 311602 311295 311053 311685 $ elev : int 657 528 840 707 670 780 720 657 600 720 $ zone samt 2211211221 Suri int 3333553333 SLC: Factor 4/8 levels “BF wR 3344443344 $ 72°71 61 95 47 49 63 59°46 62 $ Th 75 59 62 56 53 66 66 56 63 $ 7 80 66 61 53 57 70 72 70 62 $ 13.6 12.6 21.7 11.6 14.9 18.2 14.9 14.6 7.9 14.9, $ 10.18.210.2 8492167472 5.7 6.8 $ 65S SEIS TESS $ 6.98 3.10 4.4 5.31 4.95 4.5.2.3 7.34 8 2415 1.2 8.2 2.15 1.42 1.35 2.54 $ 31.26 0.8 “7 a9 ‘a wrapper for the very general vead table method 139° "194" "a5" "136" "137" "138" "199" “140° 143" "144" "145" "46" 9147 Listy "ast visa (aan) state 142 Each variable has a name, which the import method read. csv reads from the first line of the CSV file; by default the first field (here, the observation number) is used as the row name (which can be accessed with the row names method) and is not listed as a variable, The suffixes 1, 2, and § on the variable name roots. Clay, CEC, and OC refer to the lower boundary of three depths, in dm; e.g, OCS is the organic C content of the 30-50 em (3-5 dm) layer. Each variable also has a data type. The import method attempts to infer the data type from the format of the data. In this case it correctly found that LC is, factor, i.e. has fixed set of codes. But it identified zone and wrbt as integers, when in fact these ate coded factors. That is, the ‘numbers’ 1, 2, ...are just codes. R should be informed of their correct data type, whieh is important in linear models (§5.5) and analysis of variance (§6). In the ease of the soils, we can also change the uninformative integers to more meaningful abbrevations, namely the first letter of the Reference Group name: > ebaszone < as. factor obstzone) > ebafurdi < ractor(obasurdi, labels = e('a", Ye¥, *f")) Q3: What are the names, data types and numerical precision of the clay contents at the three depths? Jump to AS « Q4: What are the names, data types and numerical precision of the cation exchange capacities at the three depths? Jump to Ad You can save this as an R data object, so it can be read directly by R (not imported) with the Lead method; this will preserve the corrected data types. > save(obe, file = "obs. ADs! ‘You can recover this dataset in another R session with the command: > lead(fite = “obs ROaea") 2.2 A normalized database structure* If you are familiar with relational database theory, the structure of our dataset may have bothered you, because it mixes the sample depth with the variable. For example, there are three fields for clay content (Glay1, Clay2, and ClayS), and similarly for organic C and CEC, How could we plot, for example, clay against CEC for all the horizons together? There are several shortcuts but the most general solution is to change the database structure into a normalized set of relational tables: 1. The observation points, with a primary key that uniquely identifies the observation , with attributes that apply to the whole observation, namely: (a) the coordinates ¢ and n (b) the elevation elev (c) the agro-ecological zone zone (@) the soil group wrbt (©) the land cover class LC 2. The layers, with a primary key made up of the primary key from the first table and the layer identification (1, 2, or 5), with attributes that apply to the horizon, namely: (a) Clay (b) cee (o) 0c Note that the first ficld of this primary key is also the foreign key into the first table, For convenience we will also keep the original database structure for many of the analyses in this note. ‘There are soveral ways to do this in R; we will use the very flexible reshape method. However, we first need to assign an observation ID to each record in the original table to use as the primary key. Here we can just use the row number: > plot. td < t:din(obe) £1) Now we make the first table from those attributes of the observations that do not vary with layer: > tuobs <= ebind(plet.ad, ebs[, 1:6]) > ser(t.ebs) ‘date. crane! 147 obs. of 7 variables $ plotid: ant 12345578910 $e ant 702638 701659 703488 703421 703358 702394 681928 681508 681230 GA3869 $n ant 326959 926772 327133 322608 3228¢6 524561 311602 911295 311053 311685 $ elev = int 967 678 840 707 670 780 720 657 600 720 $ zone Factor #/ 4 levels "1",°2" aiziaaad Suro: Factor #/ 3 levels "a", e" 333323 SLC: Factor #/ 8 levels "BF" aaaagaaas Now we reshape the remainder of the ficlds into “long” format, boginning the plot ID (which is repeated three times, once for each layer), and dropping the fields ‘that do not apply to the layers: tulayers < cbindiplot.id = repiplot.id, 3), reshape(obs, direction = "long", drop = c("e", wre, "LC"), varying = List(eC*Clayi", *ciay2", slays"), e(°C8Ci", *Cec2", *cECS"), e(*061", 162", "OCS")), eames = eC"t*, "2", "5"))) anes (t. ayers) (2:5) <- e("layer", "Clay", "CEC", *06") tulayersblayer < as. tactor(t.layersslayer) ste(e layers) Sete", zene", Answers ‘date. trane! AGL obs, of 6 varsables: $ set 12345678910 + Factor v/ 3 levele “I8,°2"°6% LLanaaaLag $ ant 72 71 61.55 47 49°63 59 46 62 $ CH: num 13.6 12.6 21.7 12,6 14.9 18.2 14.9 14.6 7.9 14.9 $0C = num 5.5 5.26.98 9.196.4 5.91 4.55 6.52.3 7.96 ssa im 12345678910 ‘Tho reshape method automatically created a new field id to uniquely identify the sample in the “long” format; there are 441 of these. It also created a field ‘times to identify the vector from which each sample originated; this name due to reshape's prim: with time series data. We renamed this field Layer. We now have a relational database structure, from which we can build temporary dataframes for a variety of queries. Finally, we remove the temporary variable, and save the normalized data to a file as an R object: > miplot. sa) > save(t.obe, t.layers, file = AL: The first line is a list of quoted field (variable) names, separated by commas. For example, the first field is named “e". There are 15 field names Return to Ql + ‘AB: The other lines are the observations (1.147); each observation isa list of values, one per field. There aro 16 felds; the fist is the observation ID, which has uo name on the first line Return to. Q2 + Clay2, and ClayS; those are integers (type int): their precision is 1%, ie. they are spociied to the nearest integer percent. Return to Q3e ‘AS: The clay contents are Clay Ad: The cation exchange capacities are CECI, CEC2, and CECS; these are floating-point numbers (type aun); their precision is 0.1 emol* (ke soil)! Return to Qa + 3 Research questions A statistical analysis may be descriptive, simply reporting niarizing a data set, but usually it is also inferential; that is, statistical proce- dures are used as evidence to answer research questions. The most important of these are generally formulated by the researcher before data collection; indeed the sampling plan (number, location, strata, physical size) and data items should be motivated by the research questions. Of course, during field work or analysis other questions may suggest themselves from the data. visualizing and sum- ‘The data set for this case study was intended to answer at least the following 1, What are the ualues of soil properties important for agricultural production and soil ecology in the study area? In particular, the organic matter content, (OM), proportion of clay vs. sand and silt (Clay), and the cation exchange capacity (CEC) in the upper 50 cm of the soil.” = OM promotes good soil structure, easy tillage, rapid infiltration and reduced runoff (hence less soil loss by surface water erosion); it also adsorbs nutrient cations and is a direct source of Nitrogen: © The proportion of clay has a major influence on soil structure, hard- ness, infiltration vs. runoff; almost all the nutrient cations not adsorbed ‘on the OM are exchanged via the clay; © CEC is a direct measure of how well the soil can adsorb added cations from burned ash, natural animal and green manures, and artificial fertilizers. 2. What is the inter-relation (association, correlation) between these three variables? How much total information do they provide? 3. How well can CEC be predicted by OM, Clay, or both? 4, What is the depth profile of these variables? Are they constant over the first 50 cm depth; if not, how do they vary with depth? 5, Four agro-ecological zones and three major soil groups have been identified by previous mapping. Do the soil properties differ among these? If so, how much? Can the zones or soils groups be grouped or are they all different? 6. Fach observation is located geographically. Is there a trend in any of the properties across the region? If so, how much variation does it explain, in which direetion is it, and how rapidly does the property vary with distance? 7. Before or after taking any trend into account, is there any local spatial dependence in any of the variables? ‘These statistical question can then be used with knowledge of processes and causes to answer another set of research questions, snore closely related to prac tical concerns or scientific knowledge: 8, Is it necessary to do the (expensive) lab. procedure for CEC, or can it be predicted satisfactorily from the cheaper determinations for Clay and OM. (or just one of these)? 9. Is it necessary to sample at depth, or can the values at depth be calculated from the values in the surface layer? If 50, the cost of soil sampling could be greatly reduced. 10. Are the agro-ccological zones and/or soil maps a uscful basis for predicting soil behaviour, and therefore a useful stratification for recommendations? LL. What soil-forming factor explains any regional trend? 12. What soil-forming factor explains any local spatial dependence? TNote that the oF data Faded many’ more soll properties. Finally, the statistical questions can be used to predict: 13. How woll can CEC be predicted by OM, Clay, or both? 1d. What ate the expected values of the soil properties, and the uncertainties of these predictions, at unvisited locations in the study arca? ‘The last question can be answered by a predictive map. 4 Univariarte Analysis Here we consider each variable separately. 4.1 Univariarte Exploratory Data Analysis ‘Task 4: Summarise the clay contents at the three depths . ‘To save typing, we first attach the obs data frame; this makes the field names in the data frame visible in the outer R environment; e.g. when we type Clayl, this field of the attached frame is accessed; otherwise we would have had to type obsSClay > sunmary(Clayt) Min, tet Qa. Median Mean 3rd gu, Max > sunmary(c1ay2) Min. tat gu. Medien Mean Srd gu. Max 0 27.0 36.0 6.7 47.0 78.0 > summary (clays) Min. ist Qu. Median Mean Grd. Qu. Max 16.0 96,5 46,0 46.7 54.0 80.0 QS: What does the summary say about the trend of clay content with depth? Jump to AS « Q6: What evidence does the summary give that the distribution is somewhat symmetric? Jump to A6 + ‘Task 5: Visualise the distribution of the topsoil clay content with a stem-and-leat plot and a histogram, . > sten¢clayt) The decinal point 8 1 digit(s) to the right of the ‘ooozaz2aaaa964¢ 55555567788889009 ‘ooo21s1222275338a646 22233304 ssssses7e99 02334 ss609 123, > hise(clayt) Histogram of Clayt QT: Does the distribution look symmetric? skewed? peaked? Jump to AT « T's easy to produce a much nicer and informative histogram AlLR graphics, including histograms, can be enhanced. Here we change the break points with the breaks argument, the colour of the bars with the col argument, the colour of the border with the border argument, and supply a title with the main argument; we then add a rug plot (with, what else?, the rug method) along, the x-axis to show the actual observations. > rag(clay!) | breaks = 209(0, 96, by ~ 8), col ~ “darkgray", iin © "Clay proportion in surface soil, weigh 10 Clay proportion in surface soll, weight % Frscueney Note the use of the seq (“sequence”) method to make a list of break points. The main= argument is used to specify the main title; there is also a sub= argument for a subtitle. Note: Ta see the lst of named colours, use the colors command with no ange ‘ment: colora(). There are many other ways to specify colouse; see Rossiter 29, $55] and 7eotors, Finally, we display a histogram with the actual counts, We first compute the histogram but don’t plot it (plet=F argument), then draw it with the plet com- mand, specifying a colour ramp, which uses the computed counts, and a title. Then the text command adds text to the plot at (x, y) positions computed from the class mid-points and counts; the pos=3 argument puts the text on top of the bar. > h& hist(Clayl, breaks = 509(0, 96, by = 8), plot > atria) Last of 7 $ breaks = mum [1:13] 0 8 16 26 32 40 48 56 64 72 S counts > int [1:12] 0 22 31 31 28 17 105 30 $ intensities: min [1:19] 9 0.0187 0.0266 0.0264 0.0258 $ density: mix [1:12] 0 0.0187 0.0266 0.0264 0.0258 $ mids run [1:12] 412 20 28 36 44 52 60 68 76 $ mane esr “Cleyi" Sequidise —: Logi TAVE actr(s, "elass")= chr “histogran” > plot(h, col = heat-colers(Length(hfmide)) [length(nfcount) ~ 4° rankGhscount) + 1), ylim > (0, maxCaSccunt) + 5), main = "Clay proportien tn aurtace #041, weight 1X", + aub = "Counts sho above bar, actual values ahoun with rug plot) > rug(cayt) > texeChimids, bScount, hécount, pos > mh) ) u Clay proportion in surface soll, weight % We can s should be e that there are a few unusually high values sxamined Task 6 The record for these 10 see if there is anything unusual about it Display the entire record for observ: topsoil over 65%. ‘There are (at least) two ways to do this condition to select rows: > obsiciayt > 65, 3 tions with clay content of the First we show the easy way, with a © elev zene urbl Le Clay! Clay? ClayS cet cEc2 acs 1 702638 226059 6572 2 701859 326772 628 2 106 696707 327780 6232 1 71 758 a0 Wo can get the samo effect by identifying the rows and then using > Gx < whien(clayt > 65)) Ly 1 2 108 > obstix, J th 72 4 7813.6101 7.1 26 8.2 74 fF 67 70 75.22.0190 11.0 hese as row © melev zone urbl Le Clay! Clay? ClayS cEc1 cEc2 acs 1 702638 226059 6572 2 To1es9 326772 628 2 106 696707 327780 6232 1h 2 74 1 7 75 +r 67 70 7819.6 10.1 7.1 90326 82 74 73 22.0 19.0 11.0 2 Q8: Which are the unusual observations? Is there any evidence of errors in data entry? Why or why not? Jump to AS Other exploratory graphics There are several other ways to view the distribu- tion of a single variable in a histogram: 1. We can specify the number of histogram bins and their limits. 2. We can view the histogram as a probability density rather than a frequency (actual number of cases); this makes it easier to compare histograms with difforont numbers of observations 3, We can compare the actual distribution of a variable to a theoretical dis- tribution with a quantile-quantile plot. 4, We can fit empirical kernel density estimator curves, which give a more-or- less smoothed continuous approximation to the histogram, ‘Task 7: Show the distribution as a boxplot. Plot the histogram with bins of 5% clay, with kernel density estimators. Make a quantile-quantile plot for both the normal and lognormal distributions. . > par(afroe = (2, 22) > boxplot (Clay, notch = T, horizontal = T, main = *Boxplot of Clay 0-10ca") > hige(Clay!, freq" F, breaks ~ eq(0, 100, 5), main ~ "Probability density for Clay O-10em") > Lines (density (Clayt), ud = 2) > Lanes(density(Clayt, adj ~ 0.5), Id = 1) > Lanes(densizy(Clayt, adj = 2), vd = 1.5) > qqera(Clay!, main = "QQ plot for Clay 0-10ca ve Hormal dietrabutien", ylab = "Clay 1, 0-i0cn") qqline(Clayt, col = 4) qqnorm(Iog (Clay), main = "QQ plot for Clay 0-10cm vs lognormal distribution", ‘ylab = "log(Ciay 1), 0-10en") qalineCieg(ciayi), ca! > 4) par(strow = c(t, 1)) 13 ‘The boxplot (upper-left) matches the histogram: the distribution is right-skewed, ‘The three largest observations are shown as boxplot outliers, ie. they are more than 1.5 times the inter-quartile range (width of the box) larger than the 34 quartile. This is just a technical measure: they are bozplot outliers, but this does not necessarily mean that they are part of a different population, In particular, 1 few boxplot outliers are expected in the direction of skew. Q9: Does the distribution look normal or lognormal? What does this imply for the underlying natural process? Jump to A9 « Exercise 1: Repeat the analysis with the clay content of the 10-20 em or 30- 50 cm layers; comment on any differences with the distribution in the 0-10 cm layer: 4.2 Point estimation; inference of the mean When computing summary statistics (§4.1), we calculated a sample mean; this is simply a descriptive statistic for that sample. If we go one step further, we can ask what is the best estimate of the population mean, given this sample from that population; this is an example of point estimation. We may also make inferences the true (but unknown) mean of the population: is it equal to a known u 4.3 Answers value, or perhaps higher or lower? For small samples, inference must be based on the (-distribution. The null hy pothesis can be a value known from theory, literature, or previous studies. Task 8 | Compute the best estimate of the population mean of topsoil clay content from this sample, its 99% confidence interval, and the probability that it is not equal to 30% clay. . > t.test(Clayl, mu = 90, cont.tevel = 0.99) One Sanpie t-test deca: clay! t= 1.1067, de = 146, prvelue = 0.2702 alternative hypothesis: true mean is not equal to $0 98 percent confidence anterval 26.272 34.272 sample estinates: wean of x Q10; What is the estimated population mean and its 99% confidence interval? Express this in plain language. What is the probability that we would commit a Type I error if we reject the null hypothesis that the population mean is 30% clay? Jump to AIO« Sometimes we are interested in the mean with relation to a set threshold value; this usually comes from external considerations such a regulations or an existing classification system, Q1L: What is the probability that the true population mean is less than 35% clay? (Hint: use the alternative="1ess" argument to the t.test method.) Jump to All « AB: Teincreasos with depth, as evidenced by the mean, quartiles inchuding the median, and maxinnm. Reta to Q5 + AG: The mean and median are almost eq Return to Qo + AT» Both the stem-and-leaf plot and the histogram show that, compared to a normal distribution, this is skewed towards positive values and with a lower peak. Return to Qre ‘AB; Observations 1, 2, and 106 in the list of 147 observations have surface clay contents over 65%. These seem consistent with the clay contents of the other layers, so there is, zo evidence of « data entry error. Return to Q8« (AO: Te is not normal; especially at the lower tail of the normal distribution where values are too high. This implies that clay content of the topsoil does not simply reflect an addition of many small errors. It is not lognormal; especially at the upper tail of the lognormal distribution where values are too low. This implies that clay content of the topsoil does not simply refiect a multiplication of many small errors. So, there should be some underlying process, ie an identifiable cause of the variation across the sample set. Return to Q9* ‘AIO: The best estimate of the mean is 31.3% clay. With only 1% chance of being wrong, we assert that the true mean is between about 28.3 and 34.3% clay. We can not reject the null hypothesis; if we do, there is about 4 0.27 probability (more than 1 in 4) that we are wrong. Return to QI0 + ‘AIL: p = 0,00073, ie. itis almost certain that the mean is below this threshold, Return to QU + 5 Bivariarte correlation and regression Now we consider the relation between two variables. This is the cause of much confusion and incorrect analysis, 5.1 Conceptual issues in correlation and regression Correlation and various kinds of regression are often misused, There are several 00d journal articles that explain the situation, with examples from earth science applications [20, 35]. A particularly understandable introduction to the proper use of regression is by Webster [36], whose notation we will use. Bivariate correlation and regression both compare two variables that refer to the sarne observations, that is, they are paired. ‘This is the natural order in a data frame: each row represents one observation on which soveral variables were measured; in the present ease, the cobrdinates, clay contents, organic matter contents, and CEC at three depths, so we can use the sample to ask about the relation between these variables in the whole population. First we discuss the key issues; then we resume the analysis in §5.2. Description vs. prediction, relation vs. causation Regression analysis can be used for two main purposes: 1. To describe a relation between two or more variables; 2. To predict the value of a variable (the predictand, sometimes called the dependent variable or response), based on one or more other variables (the predictors, sometimes called independent variables. 16 So the analyst must first decide whether the results of the analysis will be used predict or not. These can lead to different mathematical procedures. Another pair of concepts which are sometimes confused with the above are related to the philosophtcal issues of knowledge: 1. The relation between two or more variables, often dese: as the correlation (‘co-relation’); bed mathematically 2. The causation of one variable by another ‘This second pair is a much stronger distinction than the first. The issue of causation must also involve some conceptual model of how the two phenomena are related. Statisties can never prove causation; it can only provide evidence for the strength of a causative relation supported by other evidence. ‘Types of models A simple correlation or regression relates two variables only; ‘a muultiple corzelation or regression relates several variables at the same tie. ‘Modelling and interpretations are much trickier in the multivariate case, because Of the inter-relations between the variables, ‘A linear relation models one variable as a linear function of one or several other variables. That is, a proportional change in the predictor results in a propor- tional change in the predictand or the modelled variable. Any other relation is, non-linear, but there is controversy over the use of this term. In particular, a polynomial model, where one variable is modelled as a sum of one or more powers of one or more other variables, is termed curvilinear and is usually considered a linear model. Non-linear relations may be linearisable by means of a transformation of one ‘or more variables, but in many interesting cases this is not possible; these are intrinsically non-linear. Fixed vs. random variables* An important distinction is made between pre- dictors which are known without error, whether fixed by the experimenter or measured, and those that are not. Webster [36] calls the first type a “Gauss linear model”, because only the predictand has error, whereas the predictor is a ‘mathematical variable, as opposed to a random variable which is measured with error. The regression goes in one direction only, from the mathematical predictor to the random response, and is modelled by a linear model with error: Yi BX +e of which the simplest case is a line with intercept: Yi = Bot Bix + & Note that there is no error associated with the predictors x;, only with the predictand yj. Thus the predictors are assumed to be known without error, or at least the error is quite small in comparison to the error in the model. An exaraple of this type is a designed agricultural experiment where the quantity of fertiliser 7 added (the predictor) is specified by the design and the crop yield is measured (the predictand); there is random error ¢; in this response. ‘An example of the second type is where the crop vield is the predictand, but the predictor is the measured nutrient content of the soil, Here we are modelling the relation as a bivariate normal distribution of two random variables, X and ¥ with (unknown) population means jx and ty, (unknown) population variances ox and Gy, and an (unknown) correlation pxy which is computed as the standardised (unknown) covariance Cov(X.Y) X ~ Nibx.ox) Y ~ Nuy,oy) pxy = CoviX,Y)/oxoy In practice, the distinction between the two models is not always clear. The predictor, even if specified by the experimenter, can also have some measurement error. In the fertiliser experiment, even though we specify the amount per plot, there is error in measuring, transporting, and spreading it. In that sense it can bbe considered a random variable. But, since wo have some control over it, the experimental error can be limited by careful procedures. We can not limit the error in the response by the same techniques. 5.2 Bivariarte Exploratory Data Analysis ‘The first question in the analysis is the relation between clay content in the three layers. We could have several specific questions: 1. Are the clay contents between layers positively, negatively, or not related? Exg. if the topsoil is high in clay, does that imply that the subsoil is high also? low? or that we can’t tell 2. How can we explain this relation? Le., what does it imply for soil formation in this area? 3. How well can we predict the subsoil clay from the topsoil? If we can do this, it would save fieldwork (having to auger half a meter from the surface) and laboratory work (having to analyse another sample). 4, What is the predictive equation? Note that the second question, requiring a conceptual model and support from other information, is much harder to answer than the first, requiring only a ‘mathematical manipulation. ‘Task 9: View the relation between layers. : Here are two ways to show the same scatterplot; in the second we specify the plotting limits of each axis, We also show the inverted plots. In all graphs we show the 1:1 line and the means of both variables. We plot these all in the same figure, using the mfrow argument to the par (“graphics parameters”) method on the open graphies device 1s (Note: to see all the possible printing characters (the pe run the command example plot.) par(atrov = £(2, 2)) plet(Clay!, ciay2) abline(o, 1, Ity = 2) title(*derault axes, subsoil vs. copscil ablinece ablane(e plee(Clay?, Clay2, x1im = ¢(0, 100), ylin = (0, 100), eh = 20) abtine(o, 1, 1ty = 2) titte(*axes 0,100, subsoil va. tepsei1") abline(h = nean(clay abline(y = nean(clayt)) plot (Clay2, Clayt) ablane(o, 1, 1ey = 2) title("default axes, topsoil ve. subsoil") abline(h = mean(Clayt)) Setine(e = mean(Clay2)? plot (Chaya, Clayt, xlin ~ 6(0, 190), ylia = <0, 109), abiine(o, 1, Ity = 2) title(Maxes 0.100, cepseil vs abline(h = nean(clayt)) sblane(e = nean(Clay2)) par(etrow = c(t, 1)) argument to plot), QI2: Describe the relation in words. Jump to Al2 « 19 Task 10: Optional but interesting: View the relation between layers, showing, whether it is the same for each of the three soil classes (code wrb1) . We can show the soil class either by symbol or by colour (or both); here we compare the three methods on one plot: par(atrou = £(2, 2)) plot (Clayt, Clay2, xlim = e(0, 80), ylim = e(0, 80) eh = a8.nuseric(wrbt)) abtine(o, 1, Tey = 2) abline(h = nean(clay2)) Setine(e = nean(Clayt)? legend (60, 20, legend = Iavels(erbi), peh = t:nlevelsCerb!), plot (Clayt, Clay2, xLim = e(0, 80), yi poh = 20, col’ = funlevels(wrbt)) egend(50, 20, Legend = Levels (arb!) bey = a) abtine(o, 1, Iey = 2) abline(h = nean(Clay2)) Sbline(e = wean(Clay1)) PlotiClayt, Clay2, xim = €(0, $0), ylim = (0, 60), sbuine(o, Tey abline(o = mean(Clayt)) Legend(60, 20, Levels(erb1), ped col = Innlevela(erbs), bey par(atrow = c(t, 1)) 20 Note the use of the Levels method to extract the soil codes for use in the legend, and the use of the as.aumeric method to convert use with the col= and peh» graphics parameters. ¢ soil code to an integer for QIS: Is there any difference in the relation between soil classes? Jump to A13, 5.3 Bivariarte Correlation Analysis If the two variables to be correlated are numeric and relatively symmetric, we use the standard Pearson's product-moment correlation, ‘The sample covariance is computed as ony 1+ ») Cov(X,¥) = war 2a dvi FM) and then the sample Pearson's correlation coeffictent is computed as: rxy = CoviX,Y)isx + sy ‘Task 11: Compute the Pearson's correlation between the clay contents of the topsoil and subsoil. Test whether this correlation is significant. . First we compute the correlation from the sample covariance (computed with the cov method) and standard deviations (computed with the sd method), to show how the definition works, then we use the cox te confidence interval. method to compute a > sun (cotayd = mean(Ciay2)) » (Chayt = menn(Clayl)))/Clongth(ciay2) [11 190.76 > cov(Clayt, Clay2) [1 190.76 > ad¢ctayt) 2) 13.936 (1) 14,626 > cov(Clay!, Clay2)/(ea(Clayt) + 24(ctay2) [1 0.9958 > cor(Clayt, Clay2) Li) 0.9388 a > cor-tese(Clayt, 292) Penrvoats prmiuct-aoamat correlation daca: Clayl and Chay? = 31.964, af = 145, prvalue < 2.20718 srnative hypothesie: true correlation is not equal te 0 96 percent confidence anterval 0.91208 0.95227 sample estinates: Q14: According to this test, what is the probability that the two clay contents, in the population from which this sample was taken, are in fact not correlated? Jump to AI « Q15: What is the best estimate for the population correlation coefficient? With only 5% probability of being wrong, what are the lowest and highest values this coefficient could in fact have? Jump to AlS « 5.4 Fitting a regression line When we decide to consider one of the variables as as a response and the other as a predictor, we attempt to fit a line that best describes this relation, There three types of lines we can fit, usually in this order: 1. Explorator ; non-parametric, 2. Parametric 3. Robust relation. The second fits ‘The first kind just gives a “smooth” impr according to some optimality criterion; the classic least-squares estimate is in this class. ‘The third is also parametzie but optimises some criterion that protects against a few unusual data values in favour of the majority of the data. A common non-parametric fit is the LOWESS (“locally weighted regression and smoothing scatterplots”) [34], computed by R method levess. This has a user- adjustable parameter, the smoother’s “span”, which is the proportion of points in the plot which influence the smooth at each value; larger values result in a smoother plot, ‘This allows us to visualise the relation either up close (low value of parameter) or more generally (high). The default is 2/3. ‘the default smooth line. ‘Task 12: Plot subsoil ys. surface soil clay witl the soil type of each point by its colour, For comparison, plot the least-squares fit with a thin dashed line (2.b. this is not explained until 5.5). . > plot(Clay2 ~ Clay3, pch = 20, col = as.numeric(erbt)) > legend(60, 20, legend ~ Ieveis(erbi), pen = 20, col = tonlevele(srb1), 2 + by = ey > Lanes (lovess(Clay1, Clay2), ud = 2, cel = *blue") > abline(in(Clay2 * Clay), ity = 2) Q16 : What is the difference botween the “best” line and the smooth fit? Does the smooth fit provide evidence that the different soil types might have different relations between subsoil and surface soil clay content? Jump to AI6 + Task 13; Plot subsoil vs. surface soil clay with the default smooth line and with 1/10, 1/2, and all the points contributing to the ft . > ploe(Clayt, clay2, ped = 20, cel ~ as.nuserte(urbt)) > Legend (60, 20, legend = levels (erbt), peh = 20, cel = as.nuneric(urbt) + bey = Ma) > for (fin (0.1, 0.5, 2/3, 1)) £ Lines(levess(Clay1, Clay2, f= 1), Id 2 ablineCin(ciay2 ~ clayt), 1ty = 2 23 QI7: What happens as the smoothness parameter changes? Which value gives the best visualisation in this case? Jump to Al7 + 5.5 Bivariarte Regression Analysis Both subsoil and topsoil clay are measured with the same error, so the bivariate normal model is appropriate. That means we can compute the regression in both directions. Subsoil as predicted by topsoil We may want to predict subsoil clay from topsoil clay. If we can establish this relation, we wouldn’t have to sample the subsoil, just the topsoil; this is easier and also saves laboratory analysis, Task 14; Compute the least-squares regression of subsoil clay on sur clay . ‘The Im method by default computes the least-squares solution: > inat < In(clay2 ~ Clayt) > suumary(1n2t) An(formula = Clay2 ~ cleyt) Reesduals Yan 19 Median 39 Max. Estimate Std. Error & value Pr(>ltl) CEntercept) 6.0354 1.0514 5.74 5.3er08 ow 24 0.9821 0.0307 31.96 < 26-16 ve8 Sagniz. codes: 0 ‘ese! 0.001 ter! 0.01 124 0,08 SF OL Residual standard error: 6.17 on 145 degrees of freedon Multiple A-squared: 0,876, Adjusted Resquared: 0.875, Festatistie: 1026003 on 1 and 145 DF, pevalue: <2e-16 QI8: What is the best predictive equation for subsoil clay, given topsoil clay? Jump to AIS « QI: Express this in plain language: (1) How much subsoil clay is predicted for 4 soil with no topsoil clay? (2) How much does subsoil clay increase for a given increase in topsoil clay? Jump to A19 « 20; How much of the total variation in subsoil clay among the 147 samples is explained by topsoil clay? Jump to A20« ‘Visualising the regression Here we show what the regression line looks like, and visualise the sense it which it is the “best” possible line. Task 15 : Plot the least-squares regression line on the scatterplot of subsoil topsoil clay, showing the residuals (distance to best-fit line) . > plot(Clayt, clay2, ped = 20) > tatle(*Ordinary least-squares regression, subsoil vs. topsoil clay") > abline(in21) > segnenta(Clay!, Clay2, Clayl, #4teed(in21), 2ey = 2) Q21: What would happen to the total longth of the residual lines ifthe “best-fit” regression line were moved up or down (changed intercept) or if its slope were changed? Jump to A21 « Reversing the regression: topsoil as predicted by subsoil As explained above, mathematically we can compute the regression in either direction ‘Task 16 Compute the regression of topsoil on subsoil clay . ‘This is the inverse of the previous regression, > am12 < im(Chay? ~ clay2) An(formila = Clay clay2) Min 1G Median 3g “ATANS 2.8940 -0.0967 2.7950. 18.4450 cootticionts Estinate Sté, Error ¢ value Pr(>It!) (intercept) 1.4969 1.1028 1.36 0.18 cay? 0.8817 0.0278 34.96 ax © var(Clay)) > s2x La) 194.21 > aay < var(clay2) > aay (a) 213.92 > any < var(Clayt, > say ay?) 1] 190.76 Note that the variances are of similar magnitude. We can compute the variances and co-variances directly from their definitions, just to check that R is doing the right thing. This is a nice illustration of R’s implicit vector ari > sua((Clay! ~ mean(Clayt))-2)/Clengen(Clayt) 1) a1 194.21 > sun((Chay2 ~ mean(Clay2))“2)/(lengea(Clay2) = 1) 1] 213.92 > sun((Clay2 ~ mean(Clay2)) + (Clayt ~ mean(Clayt)))/Clengen(Clays) ~ oD [a 190.76 Task 19 : Compute the slopes and intercepts (y on x and x on y) . For the regression of 37 on x, these are estimated by from the sample covariance and variances as: Byx = sxv/sp Bye = I~ Byxk For the inverse regression, i. x on 9, the estimates are: Bxy = swvis} xy = %- Bxyd Note that in b by the ceases the regression line passes through the centroid estimated means, ic. (%3). We compute these in R from ¢ the sample variances and covariances calculated > tyx < axy/azx > ty 1) 0.98212 > aye < mean(Ciay2) ~ byx * nean(Clayt? > ae (a) 8.0356 > tay < axy/a2y > tay 1] 0.89166 > any < mean(Clayi) ~ bay * mean(Clay2) > ay 28 La) -1.29¢9 ‘These are the same coefficients we got from the Im method, 5.7 Regression diagnostics 571 ‘The Im method will usually compute a fit, ic. give us the mathematical answer to the question “What is the best linear model to explain the observations?". The model's adjusted R? tells us how well it fits the observations overall; this is the highest-possible R? with the given predictor. However, the model may not be statistically adequate: ¢ The fit may not be equally-good over the whole range of observations, ie. the error may not be independent of the predictor; © The assumptions underlying least-squares regression may not be met, in particular, that the residuals are normally-distributed. So for any regression, we should examine some diagnostics of its success and validity. Here we look at (1) the fit to observed data; (2) untistally-large residuals; (8) distribution of residuals; (4) points with unusual leverage. Fit to observed data ‘The first diagnostic is how well the model fits the data; this is the suecess of the model conditional on the sample data; this does not yet say how well the model is expected to fit the entize population, ‘The fitted method applied to a linear model object returns values predicted by a model at the observation values of the predictor. This is applied to an abject, saved from the Im method. Similarly, the resid method returns the residuals, defined as the fitted values less the actual values, Task 20 : Compute the predicted subsoil clay content for each observation, and compare it graphically with the observed subsoil clay content. . 4(1n21), Clay2, pch = 20, xlab = Fitted", “Observed”, xiim = (5, 85), ylim = c(5, 95), + ain = “Observed ve. Fitted Clay , 0-I0en") > abline(o, 1) > segnente(ritved(In21), Clay?, tatved(in2i), ritted(ta2t)) 29 ‘Observed vs. Fitted Clay %, O-10em Note: The segnente method draws cegments from the (x, ¥) coordinates given Dy the fret two arguments, to the (x, 7) coordinates given by the second pair In thie example, all four ate equal-tength vectors (one for each observation), <0 the method acts on each pair in turn. The beginning point of each segme at (fitted, observed), while the second is at (ftted, fitted), go that the lie is the vertical residual Q27: What should be the relation between the predicted and observed? Do you see any discrepancies? Jump to A27 « 5.7.2 Large residuals absolute residuals, defined for observation x; as e: = 4 ~ 34 (observed less d value) give the discrepancy of each fitted point from its observation. If any are unusually large, it may be that the observation is from a different population, or that there was some error in making or recording the observation. ‘These residuals are interpretable directly in terms of the response variable. Task 21: Prepare a plot showing the residuals plotted against predicted val- ues, along with horizontal lines showing +3,+2,-+1 standard deviations of the residuals, . ‘The plot produced by the following code also gives the observation number (index: in the data frame) of each observations with unusual residuals; we find these with the which method. For each of these, we then display their observation number actual topsoil and subsoil clay, fitted (predicted) subsoil clay, and the residual. Note also the use of the col graphies parameter to draw the error lines in different colours depending on the number of standard deviations (abs method) > plot(titted(im2i), restd(in21), poh = 20, xlab = *Fite +0 ylab = ‘Residual’, sain = “Hegression Residuals vs. Fitted Values, subsoil clay 1°) > adres & ad(residuale(ia21)) > for (j in -3:3) abline(h = j + sqrt (var(resid(Ia2t))), + cel = aba) + 1) > ix © uhich(abs(resid(In21)) > 2 + sdres) > texe(¢ieted(iazi) [ix], reaid(In2t) [ix], £2, pas = 4) > ebind(obslix, e("Clay!", *Clay2")], rit ~ round¢rivted(im2t) (4x), +L), resid = reund¢resid(1s21) (ix), 1)) Clay! Clay2 fit resid ime 4022.7 a7 3 iat 42 647.3 33.7 138 26.7 14.3 194260 47.3 12.7 Ms 30 18 35.5 -17.5 > em(sdres, 1x) Regression Residuals vs ited Values, subsoil clay % at He a Q28 : What should this relation be? Does the plot show the expected relation? 29 Jump to A28 « Which observations have the highest positive and negative residuals? Are these large enough to have practical significance for soil management? Jump to A206 5.7.3 Distribution of residuals Regression residuals should be approximately normally-distributed; that is, the regression should explain the structure and whatever is left over (the “residue”) should just be noise, caused by measurement errors or many small uncorrelated factors. ‘This is precisely the theory of the normal distribution. The normality of residuals can be checked graphically and numerically. A simple way to see the distribution of residuals is with a stem plot or histogram, using the stem function: Task 22: Make a stem plot of the residuals, : > sten(residuals(im2t), scale = 2) Tao decimal point is at the | aris -15 | =5 | “4 | 1312 -2 | =u -10 | a “8 | ea <7 | 9881 os | 98732 “5 | 8887 “4 | seeezesani -3 | 9877778855 -2 | 988a87754a¢72110 “1 | ses7esaa2 -0 | assessaa2 © | 13s0ssssseseo 1 | s3tasssssse77s 2 | 03seaasss777777 3 | 2asa6e 4 | o2easacore 5148 6 | 456 71 458 aia o14 101 n1 2i7 13 | 337 ais a | a1 wis Note that without the scale-2 optional argument to stem, the wide spread of residuals causes the stem plot to have one bin for each two integers, which is hard 32 to interpret. Q80: Are the residuals symmetrically-distributed? Do they appear to have a normal distribution (“bell-shaped curve”)? Jump to A30« ‘The most useful graphical tool to examine the normality of the residuals is the normal quantile-quantile (“Q-Q") plot of the regression residuals; this shows how quantiles of the residuals match to what they would be if they were taken from 4 normal distribution with mean 0 (by definition of “sesidual”) and standard deviation calculated fr sample residuals ‘Task 23: Make a normal quantile-quantile (“Q-Q") plot of the regression resid uals. . > qaline(residuale(2n21)) Normal a-a Plt Q31: Do the residuals fit the normal distribution? Where are the discrepancies, if any? Jump to Al « ‘The hypothesis of normality can be tested with a variety of methods; these are not too useful because they often show that the null hypothesis of normality can be rejected, but the departures from normality may not be too severe ‘Task 24: Use the Shapiro-Wilk normality test of the hypothesis that the resid- uals are normnally-distributed, . ‘The R function is named shapire. test: > shapire. test (residual (n21)) ‘Shap: Wilk normality teat residuals(1=21) 0.9689, p-value = 0.002052 ‘This test computes a statistic (*W") and then compares it against a theoretical value for a normal distribution. The item in the output that can be interpreted is the p-value that rejecting the null hypothesis of normality is a Type I error. Q82: With what probability can we assert that these residuals are not normally distributed? Jump to A32 5.7.4 Leverage” Observations that are far from the centroid of effect on the estimated slope; they are said to have high leverage, by analogy with a physical lever. They are not necessarily in error, but they should be identified and verified; in particular, it is instructive to compare the esti line with and without the high-leverage observations. ‘the regression line can have a large fed regression ‘The leverage is measured by the hal value, which measures the overall influence of a single observation on the predictions. Appendix A explains how this is derived. Computing the leverage with R- We can find the hat values for any model with the hatvalues method. Values more than about three times h, which is the average leverage (k-+1)/n, where k is the number of coefficionts in the model, are quite influential. Task 25; Find the high-leverage observations for the regression of subsoil on topsoil clay. Compare these against the highest and lowest values of the predictor Plot the hat values against predictor value. Re-fit the model without the high- loverage observations and compare the two model coefficients. . ‘To compute a model with only some of the observations, use the optional subset. argument to the Im method. The subset can either be inclusive (e.g. seql1:20] to fit the model from the first twenty observations) or exclusive (e.g. ~¢(1,5) t0 use all the observations except the first and fifth). > par(ofrov = (1, 2)) > Aitev < which(h > 3 + mean(h)) 12 > clayl(ai.tev) La) 12 1 63 67 > (nore list (Clayl)) [(lengeh(Clayl) ~ §):Lengeh¢clayi)} ls 10 1108 2 4 > (gore. tier (Chay3)) (1:5) (a) 94 99 294 92 116 > plot(clayt, ») > abline(y = nean(Clayt), Ity = 2) > abline(h = 2eq(1:3) * mean(h), col ~ (1, 4, 20) > points(Clayi(e1 tev), hatvalues(in21) [as tev], pcb = 20, > text (Clay! (ha. lev), hatvalues(1n2t)(hilev), paste(eb + biter), adj 1.5) > plot(Clayt, clay2) spline nai) points(Clay!{hi.Tev), clay2{hi ev], peh text (Clay fbi. tev], Clay2[hi tev], paste(*ebe", hi-tev), aaj > 18) anzi.2 < Im(Cley2 ~ Clay!, subset = -bi-tev) ound (coefficients (1n21), Caarercept) clay 6.035 0,982 > round(coetfictenta(In21.2), 3) Cintercept) clay > par(atrow = e(1, 2) Q33; What is the relation between predictor value and its leverage? ASS Jump to Q34 + In this particular case, do the high-leverage predictor values appear to influence the regression line? Jump to A3d « ‘Task 26: Compare the fits of the two models, both as RMSE and R®, The RMSI can be computed directly from the model residuals; the R? as 1 — (RSS/TSS), where RSS is the residual sum of squares (after the model fit), and TSS is the total sum of squares of the predictand (here, Clay2), before the model fit. > age (eun(residuats(1n21)~2)/Tength(residuals(i=21))) (1) 5.1386, > sqrt (sun (residuals (1n21.2) 2)/lengeh (residuals (2e21.2))) in) 5.1973, > 1 ~ (sun(residuats(1m2t) 2)/sun¢(Clay2 ~ mean(Clay2) La) 0.87872 > 1 ~ (sun(residuats(1n2t.2)°2)/sun((Chay2I-hi.1ev} ~ mean(Clay2{-hi.tev])}"2)) Li) 0.85305 Q35 : Does removing the high-leverage points from the dataset improve or worsen the fit in this case? Jump to A35 « 5.8 Prediction As we saw above, the best predictive equation of subsoil clay, given topsoil clay was Clay2 = 6.04 + 0.98 - Clay, and the proportion of the variation in subsoil clay not explained by this equation was 1 - 0.8749 = 0.1251. But what does ‘that mean for a given prediction? ‘There are two sources of prediction error: 1. The uncertainty of fitting the best regression line from the available data; 2. The uncertainty in the prediction, even with a perfect regression line, be- cause of uncertainty in the process which is revealed by the regression (i. the inherent noise in the process) ‘These correspond to the confidence interval and the prediction inteveral, respec tively. Clearly, the second must be wider than the first, ‘The estimation variance depends on the variance of the regression s$. but also ‘on the distance of the predictand from the centroid of the regression, (%, 5). The further from the centroid, the more any error in estimating the slope of the line will affect the prediction: a = a ehe geP) Note that at the centroid, this reduces to s$[(m + 1)/m], which for large n is very close to 5.5. 36 ‘The variance of the regression is computed from the deviations of actual and estimated values: 12 Faz D O-w Shx ‘Task 27: Compute the subsoil clay predicted by the model if surface soil day is measured as 55%, along with the confidence interval for this prediction. Q36 : Low much subsoil clay is expected if surface soil clay is measured as 55%? What is the confidence interval, based on the standard error of the regression line? Give a verbal description of the confidence interval. Jump to A36 « We can calculate this dizectly from the regression equati > round(6.0954 + 0.9821 « 58, 0) 1) 60 ‘To compute the confidence interval, we could use the regression equations directly. But it is easier to use the predict method on the fitted model object, because this method can also compute the standard error of a fit, which can then be used to construct a confidence interval for that fit using the ¢ distribution: > pred < predict(Ina!, data.frane(Clay! = §5), s0.fi¢ = TD Last of 4 Naned nun 60 cranes") = chr 2 ‘un 0.865 145 sar $ residual scale: num 5.17 > round(predstie + gt(e(0.025, 0.975), predédt) + predéee. te a) Ui) 58.4 61.7 ‘To predict many values (or even one), we call the predict method on the fitted ‘model object with a list of values of the predictor at which to predict in a data frame with a predictor variable named the same as in the model. This method also computes the confidence interval for the specific prediction (using the standard error of the fit and the t value computed with the model ogroes of freedom), as well as the prediction interval, both to any confidence (default 0.95) ‘Task 28: Using the data. frame method, make a prediction data frame from ‘the minimum to the maximum of the data set, at 1% increments, Using the predict method on the predict values and the 95% confidence interval of nn data frame, compute the predicted he best regression, for all clay contents from the minimum to the maximum of the data set, at 1% increments. Examine the structure of the resulting object Using the predict method on the prediction data frame, compute the predicted values and the 95% prediction interval of the best regression, for all clay contents from the minimum to the maximum of the data set, at 1% increments, Examine the structure of the resulting object. . > plrane < dave trane(Clay! = seq(in(Clayt), max(Clay1) +) by =D) > pred.c < predice(in21, ptrane, interval 3 Level = 0.95) > ser(pradc) ‘contidence" nus (1:63, 1:5) 16.9 16.8 17.8 18.8 19.8 r(e, "dinnanes")=List of 2 $: ebr (1:63), $5 eh (1:3) “se* er > pred.p < predict (in21, prrane, interval = *predietion® 9 even = 0.95) > ser(pred.p) nus (1:63, 1:3] 16.9 16.8 17.8 18.8 19.8 ‘dimnanes")=List of 2 $5 ebr (1:63), var $5 chr (1:3) “4 Q37: What are the predicted subsoil clay content, the confidence limits, and the prediction limits, for a topsoil with 55% clay? Explain in words the difference between confidence and prediction limits Jump to A37 « > pred. cIS5 +1 ~ nin(clayt), J tit lw ape 60,082 98.382 61.722 > pred.plss +1 ~ min(Clayt), sie lee ope 60,052 49.680 70.433 Note: A note on this R code: the first prediction, conneeponding to stray position 11s formin(CLay1), so that the prediction for 55% is at postion 6§#1-min(Clayl), ive. 46 in the prediction array ‘Task 29: Graph the best-fit line predicted values and the 95% confidence interval the best regression for the prediction frame. Also show the 95% prediction interval, ie. the band in which 95% of the predicted values are expected to be. m also plot the observed values, . > plot(ptranesclayt, type = ‘a, predel, "ete +) Ylab = "Clay 20-30ea*, x1im = e0, 80), ylim + 80)) > Lunes(pfraneSClays, pred.c(, "fat"], wt 20, lay 0-10e0", ) > Lanes ptraesclay!, pred.cl, "Ier"), col = 2, 1ud = 1.5) > Lanes(ptraneSClay1, pred.c(, ‘upr*)| cel = 2, Iu = 1.5) > Lanes(ptranesClay1, pred.p(, "Ier*)| col = 4, ud = 1.5) > Lines pfrareSlay1, pred.p(, "upr*l, col = 4, Tud = 1.5) points(Clay1, Clay2) 8 8 a z ‘A note on this R code: The predict method returns an matrix with rows as the cases (prediction points) and columns as named dimensions, which we then access by constructions like prea, "apr" Q38 : Why did we not compute the prediction for higher or lower values? Jump to AB Q39 : Explain the meaning of the confidence intervals. For what parameter of the regression are they giving the confidence? Why are these lines curved? Jump 10 A396 Q40 : Explain the meaning of the prediction intervals. As a check, how many and what proportion of the actual data points in our sample fall outside the prediction interval? Give that there are 147 samples, how many of these would you expect to find outside the 95% prediction interval? Jump to Ad0« 5.9 Robust regression* Many of the problems with parametric regression can be avoided by fitting a so-called “robust” regression line, There are many variants of this, well-explained by Birkes and Dodge [2] and illustrated with $ code by Venables and Ripley [34] Here we just explore one method: 19s in the MASS package; this fits a regression 39 to the “good” points in the datasct (as defined by some criterion), to produce 1 regression estimator with a high “breakdown” point. This method has several ‘tuneable parameters; we will just use the defaults. ‘Task 30: Load the MASS package and compute a robust regression of subsoil on surface soil clay content. Compare the fitted lines and the coofficient of determi- nation (R2) of this with those from the least-squares . > requirestass) > indir < Iga(Clay? ~ clayt) > amat < In(Clay2 ~ Clay1) > elasa(in2t.r) (1) "igs" > elass(an2i) nm > sunmary(1n24.2) Length Node sng 1 enaracter coetticients 2 sumer Ddestone 2 nunerse fitted. values 147 nunerie call 2 call xlevele ° lise nodal 2 data.trane List > sunmary(in21) cana In(formsla = Clay? ~ Chayt) Min 19 Medien Qa “17.498 3.483 0.443 2.662 17.269 couttscients Estimate Sta. Error ¢ value Pr(>lt]) CEntercept) 6.0354 1.0514 5,74 5,30-08 ane clay 0.9821 0.0207 31.95 < 26-16 v4 Signit. codes: 0 ‘ese! 0.001 ter! 0.01 'e! 0.08 1 0.1 tt Residual standard error: 6.17 on 145 degrees of treedon Multiple A-squared: 0.876, Aajusted Resquared: 0.875, Festatiscic! 1020-03 on 1 and 149 DF, prvalue: <2e-16 > coatticiente(1n21.r) intercept) clay 40 > coorticients(in21) (intercept) clay > 1 ~ sun(residuats(In21.r)°2)/sun (Clay? ~ nean(Clay2))“2) (] 0.84802 > 4 ~ sun(restduate(In21)"2)/sun( (Clay? ~ nean(Clay2))°2) (1) 0.87572 Notice how the two summary methods produce very different output; this illus- trates R's object-oriented methods, where objects of different classes can use the same generic methods Task $1: Plot the least-squares and robust lines on the seatterplot of subsoil vs. topsoil clay . > plot(Clay2 ~ Clayl, data = obs, x1ab 4°" ytab = *Subeosi clay 2°, main > abline(in2t, ty = 2) > abline(in2i.r, ey = 1) > Legend (50, 30, legend = (Rob dey = 1:2) Topas) clay 1" wo regressions”) |) "Least-squares Two regressions: Q41: What is the difference between the two fitted lines? Which model has the Better internal fit? Why is this? What scoms to be the advantage of the robust line in this case? Jump to Adl « 4 5.10 Structural Analysis® In §5.5 we saw that the regression of two variables on each other depends on which variables is considered the predictor and which the predictand, If we are predicting, this makes sense: we get the best possible prediction. But sometimes wwe are interested not in prediction, but in understanding a relation between two variables. In the present example, we may ask what is the true relation between topsoil and subsoil clay? Here we assume that this relation has a common cause, ic. that soil formation processes affected both the topsoil and subsoil in some systematic way, so that there is a consistent relation between the two clay contents. This so-called structural analysis is explained in detail by Sprent [32] and more briefly by Webster [36] and Davis ( (9, pp. 214-220] and [8, pp. 218- 219) In structural analysis we are trying to establish the best estimate for a structural or law-léke relation, ie. where we hypothesise that y = ot + Bx, where both x and y are mathematical variables. This is appropriate when there is no need to predict, but rather to understand, ‘This depends on the prior assumption of a ‘true linear relation, of which we have a noisy sample, x Y x+E yen ‘That is, we want to observe X and Y, but instead we observe x with random error E and Y with random error m, These errors have (unknown) variances of and 0}, respectively; the ratio of these is crucial to the analysis, and is symbolised as A A = apioz a ‘Then the maximum-likelthood estimator of the slope, taking Y as the predictand for convention, is: 5 1 S$)? + 4AsRy (sf — AS¥) + VisF — (2) Equation 2 is only valid if we can assume that the errors in the two variables are uncorrelated. In the present example, it means that a large random deviation for a particular sample of the observed subsoil clay content from its “ideal” value does not imply anything about the random deviation of the observed topsoil clay content from its “ideal” value. ‘The problem is that we don't have any way of knowing the true error variance ratio A, just as we have no way of knowing the true population variances, co- variance, or parameters of the structural relation. We estimate the population variances 0, of and covariance oxy from the sample variances s2, #3, and co- variauce sy, but there is nothing we've measured from which we can estimate the error variances or their ratio. However, there ate soveral plausible methods to estimate the ratio 42 ‘¢ If we can assume that the two error variances are equal, A = 1. This may be a reasonable assumption if the variables measure the same property (e.g both measure clay content), use the same method for sampling and analysis, and there is an a prior’ reason to expect them to have similar variability (hoterogencity among samples). ‘# The two error variances may be estimated by the ratio of the sample vari- ances: A = s}/s2. ‘That is, we assume that the ratio of variability in the measured variable is also the ratio of variability im their errors. For exam- ple, ifthe set of topsoil clay values in a sample is twice as variable as the set of subsoil clay values in the same sample, we would infer that the error vvatiance is also twice as mich in the subsoil, so that A= 2. But, these are two completely different concepts! One is a sample variance and the cother the variance of the error in some randora process. Using this value of A computes the Reduced Major Azis (RMA) [9, pp. 214-218), which is popular in biometrics. @ The variance ratio may be known from previous studies. In the present example, we notice that s/s ~ 1.10; that is, the set of subsoil samples is about 10% more variable than those from the surface soil. We could take this as the error variance ratio as well. This is not so far from 1, which is also reasonable, since both variables measure the same property. Task $2; Write an R function to compute By.x, given the structural predctand, the structural predictor, and the ratio of tho enor vaianees A. Apply this to the structural relation betieen subsoil and topsoil clay, assuming equal variances, tnd then estimating the error variance ratio frou the sample, : eqni8 < function(y, &, Jamba) ¢ 2 var(y) ~ lambda + vars) var, y) Gat aqee(a'? +4 lambda # €°29)/2 4 > feqnts(ciay2, Clayt, 1) La) 1.0880 ogni (Ciay2, Clay, var Ciay2)/var(Clayt)) a) 1.0895, ‘The first estimate, with A . is the orthogonal regression. Note that it is numerically between the slope of the regression of 7y on x and the inverse of the slope of the regression of x on ¥ ° amziscoett 2} cuayt 99212 s/an2t$esest (21) Clayt 1 0182 4a > eqats(clay2, Clayt, 1) 11 1.0830 We can plot all of these on one graph. First, we compute the coefficients directly, then we plot the graphs, and finally put some text with the computed coefficients on them, 2x < var(Clayl) 22y < var(Clay2) sxy < var(Glayt, Chay2) bye eo any/ade ayx < nean(Clay2) ~ byx * nean(Clay!) brys < 1/(axy/s2y) any & mean(Clay2) ~ bxyi + nean(Clay!) b & eget (22y/224) a © mean(Clay2) ~ b + nean(Clayt) bp < eqni8(clay2, Clayt, 1) ap © sean(Clay2) ~ bp + ean (Clay!) plot(clay!, clay2, xlim = c(5, 80), ylam par(adj = 0.5) title("Correlation and regression", "Clay 10-20 en vs. Clay 0-10 cn") par(aaj = 0) abline(ayx, byx, col = 4, Tey = 1) vexe(40, 8, paste(*Ciay? on Clay!; be", round(bye, 4, a", round(ayx, 2), sep ="), col = 1) ablane(axyi, bxyt, eal = 2, ty = 2) text (40, 12, paste(*Clayt on Clay2; 4," a=", round(axyi, 2), sep = abline(a, b, col = 3, Ity = 9) tere(40, 16, paste ("Proportional variance; b=", round, 4), 8) ae", rounds, 2), sep =", col = 3) ablinetap, bp, col = 4, ity = @) vexe(40, 20, paste (Equal variance; bo", round(bp, 4), sy act, rounaCap, 2), sep ="), col = 4) (5, 70, paste("e = ", roundCcor(Clay!, Chay2), 4), rend =", round((cor(Glayt, Clay2)"2) * 100, 1), sep =") (8, 800) round(bays, D, col = 2) 44 5.11 Structural Analysis by Principal Components* ‘This optional section presents another way to caleulate structural equations in the case that error variances are equal. We use principal components analysis, which is the basic multivariate data reduction technique, discussed in more detail in 8.2. Here it is just used to compute the orthogonal axes of the two variables for which we want the structural analysis First, we compute the principal components for the two variables, which are put, in a temporary data frame, in either order. Then we examine their loadings in the synthetic variables, which are the coefficients by which original observations are multiplied to produce the synthetic observations. The loadings for the first component, ot principal azis of the new space, give the the slope of the structural regression, in this case of Clay2 on Clayt; if we wanted to describe the structural relation of Clay1 on Clay2 we would simply invert the ratio. The intercept is computed from this slope and the centroid of the regression, as before. We also compute the proportion of the variation explained by the first axis, from the standard deviations of the two synthetic variables. > pe & princemp(cbind(Clay!, Clay2)) > pestoadings Leadings: Comp. 1 Comp.2 Clayt 0.589 -0.725, Clay? 0.725 0.688 comp. Conp.2 SS loadings 1.01.0 Proportion Var 0.80.8 Cumlative Ver 0.8 1.0 > b & poSloadingel*CIay2", “Conp.1"I/pc$londings(°Clay!", 45 + Hemp, De [1) 1.0880 > b < pofloadingsl2, 1)/peSioadings (1, 17 >a La) 1.0880 > a © mean(iay2) ~ 8 + mean(Clay) [11 6196 > pessder Comp. Comp.2 19.8084 3.6029 > as.numeric(round(pe$sdev(1]/sum(pcSsdev) ¥ 100, 1)) U1) s6 ‘The best structural equation, Clay2 = 3.82 + 1.053 - Clayl, is the same as that computed in §5.10 for the case A= 1, Le. equal error variances. 5.12 A more difficult case ‘The example of §5.2 ~ §5.11 was fairly easy to analyse. There is indeed a strong linear relation between the two variables which is only slightly affected by soil type. In this section, by contrast, we examine a “messicr” bivariate relation, to which ‘we will return for a more satisfactory multivariate analysis in §7.2. As explained in that section, we know from theory and many detailed studies that the cation exchange capacity (CEC) of the soil depends on reactive sites where cations (such as K+ and Ca*t) can be easily adsorbed and desorbed, ‘There are such sites both, con clay particles and humus; here we examine only the bivariate relation with clay. ‘Task $3; Exazine the selation between topsoil clay (as the predictor) and topsoil cation exchange capacity (as the predictand); first using all points and thon showing the soil type. . par(atrou = c(t, 2)) plet(Glayt, Ceci) plot(Clayl, CECI, pch = az.muneric(erbi) + 14, col > aunerie(inbt), ex = 1.5) title(*CBe vs clay, topscil") Legend(10, 26, Levels(arb1), peh = as.nuneric(levels(urbt)) + 14, col ~ az-nuneric(levela(erbi))) par(afrow = c(t, 10) 46 Q42: Is there an apparent relation between clay and CEC? Is this consistent across clay contents? Is it linear? Do there seem to be differences between soil types? Jump to Ad2« ‘Task $4: Compute the bivariate correlation (95.3) of topsoil CEC and clay content. . > cor. test (CECT, Clay!) Pearson's product-moment correlation CEC: and clay 18.0962, af = 145, p-value = 2.107e-13, alternative hypothesis: true correlation is net equal to 0 9% percent confidence interval sample estinates: 0.86786 Q43 : How strong is the linear correlation? Compare the confidence interval with that from the correlation between topsoil and subsoil clay. Jump to Ad3 « ‘Task 35: Compute and plot the bivariate regression (§5.5) of topsoil CEC on clay content, . > model.cect <= In(CkCt ~ Clay!) can An(formala = CEC ~ clayt) Min 19 Medion 3g Max ~6.706 3.381 -0.645 2.201 14.196 cootticients Estinate Std. Error t value Pr(>lt]) ntercept) 4.8262 0.8620 5.8. 1.08-07 #4 Clayt 0.2039 0.0252 8.1 2.16-19 vee Signif. codes; 0 less! 0.001 tee! 0.01 'et 0,05 1.1 tt Residual standard error: 4.24 on 145 degrees of freeden Multiple Resquared: 0.311, Adjusted Resquared: 0.307 Featatiacie! 65. on 1 and 145 DF, prvalue: 2.13e-13 > ploe(clayt, rei) > abline(nedet cect) Q44 + What is the equation of the least-squares linear rogression? How much of the variability in CEC does it explain? What is the sproad of the residuals (unexplained variation)? Jump to Add « Task 36: Compute and plot the regression diagnostics (§5 plot (rieted(sodel.cect), CHCI, peh = 20, xlab = “Pieced lab = "Observed", XLim = range(CECI), ylis = range( tatde("Observed va. Fitted topeodl CEC") abline(, 1) segnents(fitted(nodol. cect), CECI, fitted(sedel. cect), fitted medal. cect)) 1)) adres < 2d(reaiduale(model cect)? plot (fitted(nodel.ceci), resid(model.cect), pch = 20, xlab = "Fitted", ylab = "Residual", main ~ "Regression Reeiduala ve. Fitted Values, topeosl for () 1m -3:3) abline(h = j + sqrt (var(eeeid(nodel .cect))), eel = abaj) + 1) 3x © which (abs(resid (model. cect)) > 2 © sdres) vexe(fitted(nodel. cect) [ix], resid(nodel. cect) [ix], ax, 48 + pore > ebind(ebs(ix, e(*Clay", *lay2")}, #1 = reund(titted (node) cect) (12) 4D, resid = round(resid(aodel. cect) (4x), 1)) Glayt clay? tiv resid 134 8014.6 mo aL om 6 75 TR % 35 9.8 19, 7 4 5013.8 > m(sdres, ix) > qqnorn(residuale(nedel.cect)) > qgline residuals node! cect)) Observed vs. Fitted topsoil CEC isual residuals? Do the res Q45: Are there uni iduals suggest unequal variance in the predictand throughout the range of the predictor (“heteroscedascity")? Jump to AS 6 ‘We will rotum to this example for a more satisfactory multivariate analysis in §7.2. It tums out that much of the violation of the assumption of normal regression residuals can be accounted for by other variables, 5.13 Non-parametric correlation Clearly, the relation between topsoil CEC and clay is not bivariate normal, so the parametric (Pearson's) correlation computed above is not a valid measure of their association. So, the Pearson’s coefficient should not be reported. 49 ‘The alternative is a non-parametric measure of bivariate association. The most common is a rank correlation, and the most common of these is Spearman's p, where the ranks of ‘0 variables are correlated as with the Pearson’s test. ‘The rank fan uns the ranks of each observation: > head(cBet, n= 10) 1) 13.6 12.6 21.7 11.6 14.9 18.2 14.9 148.6 7.9 18.9 > head(rank(OECt), m = 10) 1] 118.0 98.0 140.0. 87.8 125.0 152.0 123.0 121.0 40.0 123.0 > head(layt, n = 10) 1] 72°11 61 95 47 49 53 59 46 62 > head(rank(Clayt), 2 = 1 (1) 147.0 146.0 142.0 197.5 128.0 120.5 144.0 141.0 126.6 143.0 The first paired observation is (CEC, clay) = (13.6,72); these have ranks (115,147) respectively of the 147 observations ‘The Peatson’s correlation of the ranks is the Spearman's p: > cor rank ect), rank(Ctayt)) (1) 0.57298 > cor(OBCt, clayl, aethod = *epearaen") 11 o.sras8 Note that in this case p > 17 > cor cect, cayt) (1) 0.55796 5.14 Answers ‘AI2® The relation is strong, linear, and positive. An inerease in clay in the surface is accompanied by a proportional increase in clay in the upper subsoil. Return to QI2 AIS: The relation looks similar; Soll type 1 (circles, black) has lower values in both layers, Soil type 3 (crosses, green) has all the high values in both layers, but it does not appear that a different line would be fitted through each class separately. There are only three samples of soil type 2 (diamonds, red). Return to QI3 + ‘AIA: The correlation is almost certainly different from zero, since the p-value for the alternative hypothesis of no relation is almost zero. Thus there is almost certainly a 1 to QU + relation Ret AIS; If this sample is representative of the population, the most likely value for the correlation of clay in the 0-10 and 10-20 em layers, over the whole population, is estimated from this sample as 0.936. If we repeated the sampling operation (147 samples) large number of times, in 95% of the eases we would expect to find the sample correlation coefficient between 0.912 and 0.953. This is a very high positive linar correlation, since the maximum is 1. Return to Qi5 « 16: The smooth line has « considerably steeper slope than the “best” Ine at low clay values (till about 22%) and then a slightly shallower slope at high values, The first part of the line fits soil type I and the second soll type 3, Return to QI6 + ‘AIT: At low values of the parameter the line is very erratic; at high values it misses the ‘sepazate “pieces” of the relation, The default value in this case gives a good visualisation, Return to QI7 « (0 QIS + ‘AIS: Clay? = 6.04 + 0.98 » Clay: Ret A19 + Subsoil clay is abou there is 6% higher than topsoil clay on average, so that i no topsoil clay, we predict 6% subsoil clay. For cach 1% extra topsoil clay, there is an increase in 0.98% in subsoil clay. Return to QI9 + N20: This is given by the adjusted R-squared: 0.8749, Note that this is a bit smaller than the value computed by simply squaring the correlation coefficient: > eor(cay2, Chayi)"2 (1) 0.87572 Return to Q206 A21: The total squared length of the lines would increase; the shortrest possible squared lengths are shown here. Return to Q2I « ‘N22; Clayi = 1.49 + 0.89 + Clay2 Return to Q22 + ‘N23; Topsoil clay is about 15% lower than subsoil clay on average, s0 if there is no subsoil clay, we predict less than zero topsoil clay, which is not physically-possible. For each 1% extra subsoil clay, there is an increase in 0.89% in subsoil clay. Return to Q23 A24: This is given by the adjusted R-squared: 0.8749; it is exactly the same as the reverso regression Return to Q24 + ‘N25; The errors are different in the different directions, leading to different east-square estimates of the best Bt Return to Q25 + ‘A260 : The errors are different in the different directions, leading to different least-square ‘estimates of the best ft Return to Q26 « 27: They should be identical, ie. fall on the 1:1 line. OF course they are not because of error. In any case they should be symmetric about a 1-1 line (ie. the length of the zesidual segments should be approximately equal above and below the line) throughout the range In this case, low predicted values are consistently too high (Le. the observed was lower than predicted, below the line). In addition, there are several points that are quite oorly-ftted. Return to Q27 + ‘A28 > The residuals should ideally all fall on the 0 horizontal line; of course they do znot because of error. However, in any case they should be symmetric about this line throughout the range, and have the same degree of spread. Jn this case, the low predicted values all have negative residuals (as we saw above). Also, values in the 30-40% predicted range genorally have positive residuals. The spread seems about the same. There are two residuals more than 3 standard deviations from the mean, one positive and ‘one negative. There are five positive and one negative residual between 2 and 3 standard deviations from the mean, Return to Q28 « ‘N29: Observation 128 has a residual of +17-3: from a topsoil content of 17% we predict 22.7% but 40% was measured; this is much higher than predicted, Observation 145 has ‘residual of -17.3: from a topsoil content of 30% we predict 35.5% but only 18% was ‘measured; this is much lower than predicted, and indeed one of the few cases where subsoil clay is substantially lower than topsoil. Absolute residuals above about 8% clay ‘content are indeed significant for management. Return (0 Q29 + AGO: The residuals are more or less symmetric about 0 but there are large “tails” in both directions with some extreme values: ~17.5,-13.2 and +13.7,14.3,17.3. Even eer 0 the distribution does not appear regular: there are too many residuals in the ~2 and +2 bine compared to the adjacent ~1 and +1 bins Return to Q30 + AST: For the most part, the residuals follow the theoretical normal distribution well they are near the normal line and thicker near the middle. However, two low values are tunder-predicted (i.e. their value is below the line: these values should only be found at lower quantile), and about five are over-predicted (i.e. their value is above the line values this great should only be found at a higher quantile). S do not fit the normal distribution, but the bulk of the residuals do. Re tails of the residuals nto Q31 + ‘Shapiro-Wilk test shows that almost certainly (p ~ 0.002) we would not commit a Type I error if we reject the null hypothesis; ie. the residuals are most likely y-distributed. However, this is not so serious since they are symmetrically- ind the positive and negative departures from normality are somewhat bal- Return to Qa2 ‘A33; The highest leverages are the furthest from the mean of the pred to 83 + ctor. Return ‘AS4 = No, the line that would be fitted without them seems to go right through these points Return (0 Q34 + ‘ASS: The high-loverage points are consistent with the others, so removing thei ‘4 model with poorer fit: higher RMSE and lower R2. Return (0 Q35 + ABO : The prodicted value is 6.0354 + 0.9821 = 55 = 60.0509 which rounds to 60 (remember the precision of measurement). The confidence limits are round (predStit-q¢ (.975,predSaf)*pred8se.tit,1) — 58.4 and round (pred$tit+ge (.975,predSat)*predSse.tit, 1) = 61.7. Return to Q36 + A37 + If the topsoil clay content is measured as 55%, the predicted subsoil clay content (rounded to the nearest %) is 60%; the 95% confidence limits are 58.4% and 61.7% (same as previous auswer). The 95% prediction limits are 49.7% and 70.4%, The confidence limits are the 95% confidence of the expected value of the predictand (subsoil clay) at a predictor value of 55% topsoil clay. The prediction limits are the limits within which we expect 95% of all measured values of the predictand (subsoil clay) at a predictor value of 55% topsoil clay. That is, if we aneasured 100 locations with 55% topsoil clay, we expect that 95 of the subsoil clay contents would be between 49.7% and 70.4% Return to are ‘ASS; The equation is only calibrated in the range of the predictor. Return to Q8S + G9; The confidence bands refer to the best-fit regression line. nt Q99 « A40 : The prediction bands refer to the predicted values at each value of the predictor. In our sample, 8 of the 147 points, ie. 5.4%, are outside these bands. This agrees vory well with the 95% confidence interval (0.05 x 147 = 7.35) Return to QUO « 6 Oneway A41: The robust line fits the soils with 10-50% topsoil clay better than the least-squares Jine. The high end of the range (about 10% of the dataset) has relatively lower subsoil clay than predicted by the robust line. The goodness-of-fit as measured by the explained sum of squares is by definition best for the least-squares fit. However, the robust fit is only a bit poorer in this sense vet fits the bulk of the data better. Return to Ql + ‘Ad2: There is clearly a positive relation: in general, higher clay soils have higher CEC. However this is far from linear, in particular both high-clay and high-CEC soils show large diserepancies: some high-CEC soils have low clay and some high-clay soils have relatively low CEC. Soil type J has lower values overall but also (it appears) a stocper relation: CEC seems to increase more per unit clay increase than for soil 8, Return to Qa ‘AB The correlation of r = 0.558 is moderate (R® = 0.311) but far less than for the ‘lose relation between topsoil and subsoil clay (r = 0.936, R® = 0.876); the confidence interval is also quite wide (0.435 ,..0,660) compared to the two clays (0.912...0.953), showing the weaker relation Return to Q43 « A442 The leastosquares line is CECL = 4.826 + 0.204 = Clayi, This implies that even with zoro clay there would be some CEC, suggesting that there is another source of CEC than just clay. Only about 31% of the variability of CEC is explained by clay; this also suggosts another source. Residuals range from ~6.7...+ 14.2 which is a substantial spread, given that the range of CEC itself is only 3...29. Return to Qi « ‘Aa5 Not only is the fit poor, but the regression residuals are not eveniy-distributed There are many more extreme positive than negative residuals, showing that high-CBC soils are poorly-modelled from clay alone; eight residuals (numbered in the residual plot) fare more than two standard deviations from zero. The residuals are far from normally distributed (see the normal Q-Q plot). However, there does not appear to be any het- ‘eroscedascity; the spread of residuals seems consistent over the range of the fitted values. ‘Thus the linear mode! is not adequate and should not be used for prediction, — Return to Q45 + Analysis of Variance (ANOVA) Analysis of Variance (ANOVA) is used to determine how much of the variability in some property is explained by one or more categorical factors. As with regression, this does not necessarily inaply a causal relation from the factors to the response. However, it does supply evidence to support a theory of causation that can be Justified with a conceptual model. ‘The simplest ANOVA is one-way, where the fotal variance of the data set is compared to the residual variance after each observation's value is adjusted for the mean for the one factor. In the eurrent data set, we ean ask how much the clay content varies among the four zones (variable zone). Clearly, the zone itself doesn't cause clay to vary, but it is certainly reasonable that some other factor associated with the zone could bbe at least a partial cause. Here the zones are defined by elevation and relief, with limits corresponding to differences in parent rock and also with orographic rainfall.’ Because the surface soil is more subject to human disturbance and local erosion, ‘we will model the clay in the lower subsoil (30-50 cm) 6.1, Exploratory Data Analysis ‘Task 37: Visualise the clay content of the lower subsoil by zone. . jemselves, and then mntent It is always a good idea to look first at the data values ¢ summarise or graph them. So first we sort the observations by their elay c land sce which zones scom to be associated with lower or higher values. The sort method sorts one vector. The order method lists the indices (row or observation numbers in the data frame) that would produce this order, and these indices can be used as subscripts for another variable or the € [} 16 16 39 20 20 20 25 24 25 25 25 25 27 27 27 27 28 30 31 31 32 22] 52 92 32 33 35 33 33 53 95 54 36 35 35 35 36 36 37 37 37 38 38 43] 38 38 38 39 G0 40 40 40 40 40 40 40 a1 41 41 42 42 42 42 43 43 54] 43 45 43 43 44 a4 44 44 06 44 44 45 a5 45 45 45 45 45 45 45 66 85] 46 46 46 47 47 AY 47 47 47 48 48 48 48 48 48 49 50 50 51 51 52 [108] 52 63 53 69 64 64 64 96 55 55 5b 65 56 58 57 GT 57 S7 ST 57 ST [127] 57 8 68 $8 58 60 61 62 62 65 65 66 66 65 70 70 70 72 73 78 80 > erder(ciays) 132 32 8 16 1) 125 194 39 91 106 196 145 114 64 90 124 193, IT] M1 63 38 82 36 46 6S 65 37 41 43 502 53] 14 18 35 67 130 20118 131 28 47 6B 116 fag] St 7 74 101 122 141 21 SS 58 115 87 100 65] 71 75 107 58 69 75 B1 95 102 108 25 59 (81) 103 110 148 30. 44 92 147 8B 95 98 128 138, 97] 42 72119 94 87136 11 27 8S a 5 45 (1s) 77 12 28 39 97 12 24 6 si 62 78 79 [129] 113 320 129 83 10121 64192 3:497 139 ras) 10s 22 > ClayS order (c1ayS)) [1] 16 16 19 20 20 20 25 24 25 25 25 25 27 27 27 27 28 30 31 31 32 22] 32 32 32 33 33 33 33 38 33 34 34 36 36 35 36 36 37 37 57 38 38 3] 38 38 38 30 40 40 40 40 40 40 40 40 a1 21 41 42 42 42 42 43:43 64) 43 43 43 13 44 4 44 44 44 44 40 45 05 05 45 45 45 45 45 45 06 (5) 46 46 46 47 47 47 47 47 47 48 48 48 48 48 48 49 50 50 51 51 52 [106] 52 63 53 59 54 54 54 54 55 55 55 55 56 58 57 S7 57 57 57 57 67 [127] 57 58 58 58 58 60 61 62 62 65 65 66 66 65 70 70 70 72 75 78 80 > zone [order (Clay5)) 11) 444464426666644200444454044655463 SS] 114553333533524433334358553353233 65] 35533333433353953533555333222422 caused by the mountain forcing moist warm air to rise into sphere, resulting in precipitation (or) 424532252532231423252225122533222 [io] 2222112221221222222 Levels: 1234 ‘We can use the by method to compute any statistic for each level of a factor. Here ‘we do it for the range. The second example illustrates the use of an anonymous function to compute the width of the range. The function (x) will be called with the vector of clay contents for each zone in turn, so that max(x) and min(x) will operate on this vector. > ey(clays, zone, range) zone: [1] 35 70 zone: 2 [1] 23 80 zone: 3 a) 32 87 zone: 4 (1) 16 58 > by(clayS, zene, functien(s) max(x) ~ minx) Li) 36 la) 87 zone: 3 11 25 zone: 4 [a] 38 Q46 : Do there appear to be differences between zones in the data values and their ranges? If'so, describe them in words. Jump to Ad6 « Second, we can visualise this with the boxplot method, dividing the response variable (here, ClayS) by a factor (here, zone), using the same syntax as im. In addition, if we select the notch=T option, this method will show whether the class medians are significantly different > bexplot(ClayS ~ zone, notch = T, horizontal = 7, xlab = “Clay J, 30-S0cn", + jab = *zone") Q47: Do there appear to be differences between zones in the data values and their ranges? If so, describe them in words. Jump to Ad7 « Q48 : Are there an equal number of samples in each zone? (Hint: use the by method to divide by zones, with the length method to summarise.) Jump to Als Q49; Does there appear to he a difference between zones in the data distribu- tion? If'so, describe it in words. Jump to A49 ‘Task 38 : Compare the data summaries for clay contents of the lower subsoil by zone; this is the numerical confirraation of the boxplo > by(Clays, zene, summary) Min. ist gu Median Mean 3rd gu. Max. 35.0 49.2 58.0 85.0 83.0 70.9 Min, tet Qa. Median Mean 3rd gu, Max Min. ist Qu. Median Mean Sed Qu. Max. 32.0 40.0 44.0 43.8 46.0 87.0 zone: 4 Min. ist Qu. Median Mean Grd Qu. Max 16.0 25.0839. 6.2 Oneway ANOVA Based on the boxplots and descriptive statistics, it seems that the zone “explains” some of the variation in clay content over the study arca. The technique for deter- mining how much is explained by a factor is the Analysis of Variance (ANOVA) Rs lm method is used for ANOVA as woll as for regression; in fact it is just another form of the same linear modelling. Task 39: Calculate the one-way ANOVA of subsoil clay on zone, and display the ANOVA table. . > sunmary (132) In(formila = ClayS ~ zone) Mon 19 Median 39 Max -92.960 -5.396 0.159 3.159 24.050 Cootticionts Estimate Sta. Error ¢ value Pr(>ltl) (intercept) $5.00 3.21. 17.14 < 26-18 one zoned “23.67 3.88 6.67 .20-10 oe Signit. codes: 0 ‘ese! 0.001 few! 0.01 910,08 SF OLE Residual standard error: 9,08 on 143 degrees of freedon Multiple R-squared: 0,512, Adjusted R-squared: 0.502 Festatistic: 60,1 on 2 and 143 DF, prvalue: <2e-16 Q50; How much of the total variation is explained by zones? Jump to A50« ‘The summary for a categorical model shows the class means: # The estimate on the first line, here labelled (Intercept) with value 55.0, is the mean for the first-listed class, here zone 1 # The estimate on the second line, here labelled zone2 with value 0.95, is the difference between the mean of the second-listed class; in this example the mean for zone 2 is 55.0 + 0.95 = 55.95. @ The remaining classes are computed as for the second-listed class. We can also see this result as a classical ANOVA table, by using the aev method: > sunmary(aov(lnz)? Dr Sun Sq Mean Sq F value PGF) zone 3199804130 60.1 coetticients(aov(inz)) intercept) zone? zoneS_zoned Q51: How likely is it that this much of the variability could have been explained by chance? Jump to 51 « 52 : Can you determine the zone means from the ANOVA table? Jump to A526 From what was the reported probability value computed? Jump to A53 Q54: How were the variances of the model terms estimated? Jump to A54 « 6.3 ANOVA as a linear model* We can sce that ANOVA is just a special case of linear models by examining the design matrices of a regression and an ANOVA model; these are also called the model matrices, We'll look at observations 1 fh 22, since they come from different zones. Note for experts: the second matrix shown here is produced with the contrasts options set to contr treatment; other choices of contrasts may be preferred; sce the help for ?contrasts and 7options. R code: podel.natyax(ln21) (15:22,) seateax(lmz)(15:22,] R console output: (intercept) clay! "7 120 se 1 3 8 1 on 20 1 on (intercept) zone? zoned zoned 2 2 0 0 1 °° 1 0 0 1 0 1 0 1 0 4 0 ‘These are just the predictor value, for each observation, of the linear equation which Im is trying to fit by least-squares; each is matched with the value of the predictand for the same observation. In the first dosign matrix, there is ax in- tercept and the topsoil clay content, In the second design matrix, there is an intercept (which will be fitted by the mean of the first level of the factor, ic zone 1; mean(ClayS{zone ==1))) and three dummy variables, representing the deviations of the means of the remaining three zones from the mean of zone Observations from zone 1 (e4f. observation 15) have 0's for all of these; observa- tions from other zones have a single 1, corresponding to the zone (0. observation 6 has a 1 for dummy vatiable zone?) ‘The design matrix X is then used in the unweighted least-squares estimate, given the sample values as the response vector y: B- atx ixty which we can compute directly as: R code: X & nodel matrix(n21) Goeta & solveCe (LHI tel eek clays) K & model matrix nz) Geta < solveCcOL4I) Heh UO Kah Clays) peck, beta) Note the use of ‘Us, operator for matrix multiplication. 60 R console output: Cineercept) 18.7586 Caneercept) 95.000 zone? 0.950 zoned 11.159 zones 23.687 ‘These are the same coefficients as we found using 1m for bivariate regression (§5.5) and one-way ANOVA (§6 2). 6.4 Means separation* ‘ow we infer that the model is highly significant, but are all the means different? For any two zones, we could use an unpaired t-test. For the entire ANOVA table, ‘we could use Tukey’s HSD if the sample was approximately balanced (which it isn't here). Another method is to do all the pairwise (tests, but adjust for the number of comparisons; the pairvise.t.test method performs these tests. It niay be used with different assumptions about the individual class variances (ic. whether it is legitimate to pool them or not) and different degrees of correction. Here are examples of the use of pairwise.t.test, first with no correction, then with a correction, then without the assumption of equal class variances; output has been edited for brevity: > pairvise.t.test(ClayS, zone, p.adj = "none" Pairwise comparisons using t tests uith pooled SD ava: ClayS and zone 1 2 3 zor - P value adjustnent method: none > paireize.t.test(ClayS, zene, p.adj = "holm" Pasywice comparisons using t tests vith pooled SD ava: ClayS and zone 4 2.80-08 < 26-16 3.06-08 P value adjustnent method: hole > patrvise.t.teet(ClayS, Zone, p.adj ~ "none", pooled = FD Pairwise comparisons using t tests vith non-peoled SD 6. ava: ClayS and zone 2 23 2 0.8548 - - 30.0897 1.50-07 - P value adjustment method: none pairvise.t.test(ClayS, zene, p.adj = "holm", pool.sd = F) Pairwise comparisons using t vests with non-peoled SD daca: ClayS and zone value adjustaent mechod: hele Q55: What is the probability that the observed difference between zones 1 and 2 is due to chance, for all four methods? Jump to ASS « Q56 : How do the p values change if we adjust for multiple comparisons? How do they change if variances can not be pooled? Jump to A566 Q57: In the prosent case, is it reasonable to assume that class variances ean be pooled? Jump to AST « 6.5 One-way ANOVA from seratch* Iv is instructive to sco exactly how ANOVA works, The idea is to partition the total variance in 2 variable into the part attributable to some group (here the zone) and a residual ‘The unadjusted R? is directly computed as: R? = 1 ~ (Residual$s/Totalss) First we compute the grand mean and group means, to see that they're different. ‘Then, we compute the total sum of squares and the residual sum of squares after subtracting the appropriate group mean. We also compute the group sum of squares, ie. how much the groups explain, and check that this and the within group sum of squares equals the total sum of squares. Finally, we can compute the proportion of variation explained. > mean(clay5) Ua) as.68 62 > (means < by(ClayS, zone, mean)) Tn) 55 zone: 2 La) $5.95 zone: 3 (1) 43.241 1] 31.333, > (eas sum((Glays ~ nean(Clay5))"2)) ta) 2a > (ras < sun((Clays ~ neans(zone])-2)) (a) 11782 > (ges < sun(((neans ~ mean(Clay5))“2) + by(Clays, zone, + vengee))) 11 12380 > (gas + ras - +85) 1] 3.6380-12 > 4 - (res/tss) La) 0.51258 ‘These coraputations show quite well how R operates on vectors. For example, in the computation of gss, the group means (aeans) is a vector of length 4; since the grand mean mean (Clay6) is a scalar, itis subtracted from each element of the vector, resulting in another vector of longth 4; each clement is squared: the by method also results in a vector of length 4 (the number of abscrvations in cach class), the multiplication is clement-wise. Finally, the sux method sums the ‘relement vector Note that the R? = 0.513, exactly as reported by aov 6.6 Answers ‘N46: Zone thas most of the low clay coutents (with a fow in zone 2), zoue 3 is mediuan, while zones 1 and 2 have the highest clay contents. Zone 2 has the widest spread. Return to Q46 + ‘M47: There is a marked difference overall. Zones 1 and 2 are similarly high, zone 3 nto Q47 + lower and zone 4 lowest. Rett Ad8: No, zone I is soverely under-represented, and zone 3 has about half again as 63 imeny samples as zones 2 and 4. no Q48 + 49: Zone 2 has the widest range. It is mostly positively skewed, but has two boxplot outliers at the low end of the range. Zone I has very short boxplot tails, ic. the box with 50% of the values covers most of the range (but this is probably an effect of the small sample size). Zone 3 has a somewhat narrower range than the others and is symmetric. Zone 4 is slightly positively skewed. Return to Q49 + Return to Q50 + ‘ABI Return fo QSL + N52: Yes, The intercept is the mean of the first level of the factor, here zone 1 It is 55% clay. The factor estimates for the other levels are added to this to got the corresponding zone mean. For example, zone 2 is 35 + (~11,159) = 43.84 (compare to the descriptive statistics, above). Return (0 Q52 + ‘ASS; From the F ratio of the variances (zone and residual) and the two degrees of freedom associated with these model terms. n (0 Q53 + Abd: By the mean squared errors. n to Q54 « ASS: For the case where variances are pooled: 0.0013 and 0.0026; that is, in all cases the two means are significantly different at p=0.01. For the case where variances are not pooled: 0.0497 and 0.099%. That is, if we don't adjust for the number of comparisons, the difference is significant at p=0.05, and the Holm correction still shows a significant difference but only at p=0.10. Return to Q55 + ‘N56 In both cases the p-values increase, Le, it is more likely that the observed difference is just due to chance, Return to Q56 « A57: No. From the boxplot it is clear that 2 others, so the pool.2d=F option should be used. Ret 10 1 is much more variable than the 10 Q57 + 7 Multivariate correlation and regression In many datasets we measure several variables. We may ask, first, how are they inter-related? This is multiple correlation analysts, We may also be interested in predicting one variable from several others; this is multiple regression analysis 7.1 Multiple Correlation Analysis ‘The aim here is to see how a set of variables are inter-related. This will be dealt with in a more sophisticated manner in Principal Components Analysis (§8.1) 64 maa and factor analysis (§8.2) Pairwise simple correlations For two variables, we used bivariate correlation analysis (§5.3). For more vari- ables, a natural extension is to compute their pairwise correlations of all variables. As explained in the next section, we expect correlations between soil cation ex change capacity (CEC), clay content, and organic carbon content. ‘Task 40: Display all the bivariate relations between the three variables CEC, clay content, and organic carbon content of the 0-10em (topsoil) layer . > paira(-Clay! + Ci + CECt, data = obs) clays cect . Q58: Describe the relations between the three variables. Jump to A58 « ‘The numeric strength of association is computed as for any pair of variables with a correlation coefficient such as Pearson's. Since these only consider two variables ne, they are called simple coefficients. all pairs of variables CEC, clay, and OC in the topsoil. . We first must find the index number of the variables we want to plot, then we present these as a list of indices to the cov method: > nazes(obs) lsley" "zone cee" “CEC veri" ster soci *ocz" clay: We see the target variables at positions 10, 7 and 13, so > cov(obsleCia, 7, 1391) cect clayt oct cect 25.9479 39.608 5.6783 Clay! $8.6092 194.213 12.5021 oct 6.6783 12.502 2.2620 > eor(obsleCta, 7, 1891) cee: clayt oct ‘cect 1.00000 0.85786 0.74294 Chay! 0.85798 1.00000 0.59780 oct 0.74284 0.89780 1.00000 Q59: Explain these in words. Jump to A596 7.1.2 Pairwise partial correlations ‘The simple correlations show how two variables are related, but this leaves open the question as to whether there are any underlying relations between the entire set. For example, could an observed strong simple correlation between variables X and Y be because both are in fact correlated to some underlying variable Z? One way to examine this is by partial correlations, which show the correlation between two variables after correcting for all others. What do we mean by “correcting for the others"? This is just the correlation between the residuals of linear regressions between the two variables to be corre- lated and all the other variables. If the residuals left over after the regrossion are correlated, this can't be explained by the variables considered so far, so must be 1a true correlation between the two variables of interest. For example, consider the relation between Clay1 and C8¢1as shown in the scat terplot and by the correlation coefficient (7° = 0.55). These show a moderate positive correlation. But, both of these are positively correlated to O¢t (r = 0.56 and 0.74, respectively). Is some of the apparent correlation between clay and CEC actually duo to the fact that soils with higher clay tend (in this sample) to have higher OC, and that this higher OC also contributes to CEC? This is an swered by the partial correlation between clay and CEC, in both cases correcting, for OC. ‘We can compute partial correlations directly from the definition, which is easy in this case with only three variables. We also recompute the simple correlations, computed above but repeated here for comparison. It's not logical (although mathematically possible) to compute the partial correlation of Clay and OC, 66 since the “lurking” variable CEC is a result of these two, not a cause of either. So, we only consider the correlation of CEC with OC and Clay separately, > cor(residuale(in(CBCi ~ Clay!)), residuale(in(oct ~ clayt))) (1) 0.81538 > cor residuals(in(CECt ~ 0C1)), residuals(in(Clayt ~ 0¢1))) (11 0.21214 > coroner, oct) 11 0.74286 > cortonct, ctayt) (2) 0.55795 ‘This shows that CEC is only weakly positively correlated (r = 0.21) to Clay after controlling for OC; compare this to the much higher simple correlation (r = 0.56). In other words, much of the apparent correlation between Clay and CEC can be explained by their mutual positive correlation with OC. ‘We can visualize the reduction in correlation by comparing the scatterplots be- tween Clay and CEC with and without correction for OC: > par(efrov = e(1, 22) > par(adj ~ 0.5) > plot (CECt ~ Clayt, peb = 20, cex = 1.5, xlim = e(0, 100), + adab = "clay 2%, ylab = CEC, enol? (kg se41)-1") > abline(h = aean(CEC!), ley = 2) > sbline(y = nean(Clayl), Ity = 2) > title(*Sinple Correlation, Clay ve, CEC 0-10 ca") > text($0, 4, cox = 1.5, paste(*r =", round(cor(Clayt, + esc), 3))) > art © residuals(in(onct ~ 01) > ar? © residuals(in(Clayt ~ Oct) plet(mr.1 ~ mr.2, peh = 20, cex = 1.5, xu 50), slab = "Residuals, Clay ve. 0C, 7 ablineh = nean(ar.1), 1ey'= 2) ablane(e = nean(nr 2), 1ey = 2) title("Partial Correlaticn, Clay ve. CEC, correcting fer OC 0-10 ct eere(25, 6, cox = 1.5, paste(*r =", round(cor(ar. 1, r.2), 3))) par(aaj = 0) maar. 1, ar.2) par(atrow = c(t, 1)) = €¢-50, lab = “Reaidua cee ve. OC, emo? (he se8l)~ 67 ‘The two scattorplots show that much of the apparont pattem in the simple cor- relation plot (left) has been removed in the partial correlation plot (right); the points form a more diffuse cloud around the centroid By contrast, CEC is highly positively correlated (r = 0,62) to OC, even after controlling for Clay (the simple correlation was a bit higher, r= 0.74). ‘This suggests that OC should be the best single predictor of CEC in the topsoil; we will verify this in the next section ‘The partial correlations are all smaller than the simple ones; this is because all three variables are inter-correlated. Note especially that the correlation between OG and clay remains the highest while the others are considerably diminished; this relation will be highlighted in the principal components analysis. Simultancous computation of partial correlations Computing partial correla- tions from regression residuals gets tedious for a large number of variables. For- tunately, the partial correlation can also be obtained from either the variance~ covariance or simple correlation matrix of all the variables by inverting it and then standardising this inverse so that the diagonals are all 1; the off-diagonal are then the negative of the partial correlation coefficients. Tiere is a small R function to do t applied to the three topsoil variables: (and give the off-diagonals the correct sign), picor < function (x) € (var(x)) 54h < diag(1/sqrt (diag 300))) + picorsmat <= ~(6di Heh inv Jee sai) aingip.cor mat) <1 rounanes(p-cor.mat) <- colnanes(p.cor.nat) < colnanes(x) Feturn(p.cor.aat) ? peearebale(to, 7, 15)]) cre: clayt oct 68 7.2. Multiple Regression Analysis ‘The aim here is to develop the best predictive equation for some predictand, given several possible predictors, In the present example, we know that the CEC depends on reactive sites on clay colloids and humus. So it should be possible to establish a good predictive relation for CEC (the predictand) from one or both of clay and organic carbon (the predictors); we could then use this relation at sites where CEC itself has not been measured Note that the type of clay mineral and, in some cases, the soil reaction are also important in modelling soil CEC; but these are similar in the sample set, so we wil not consider them further. First, we visualise the relation between these to see if the theory seems plausible in this case. This was alzeady done in the previous section, §7.1. We saw that both predictors do indeed have some positive relation with the predictand. ‘To develop a predictive regression equation, we have three choices of predictors: © Clay content © Organic matter content © Both Clay content and Organic matter content ‘The simple regressions are computed as before; the multiple regression with more than one predictor also uses the 1m method, with both predictors named in the formula, ‘Task 42: Compute the two simple regressions and the one multiple regression, and display the summaries. Compare these with the null regression, ie. where every value is predicted by the mean, . > amcec.aull <= In(CECi ~ 1) > sunmazy(Incec. ull) cant An(gormila = cect ~ 1) Residuals Min 19 Median 3Q Max “8.20 -3.70 -1.10 1.90 37.80 Cootsicients Estimate Std. Error & value Pr(>ltl) CEntercept) 11.20 0.42 25.7 imcec.ee < In(cset ~ oct) > sunmary(Iacec. 0) 69 can an(formula = cect ~ 0¢1) Residuals Min 19 Median 3g Max Estinate Std. Error t value Pr(>Itl) (intercept) 3.570.830 5.82 3.60708 axe oct 2522 0.189 18.7 € 26-16 oe Signit, codes: 0 ‘eve! 0.001 fer! 0,01 'e! 0.08 ht ott a Residual standard error: 3.42 on 145 degr Multiple Reaquared: 0.552, ‘Adjusted R-equared: 0.849, > amcec.clay <= In(CECI ~ Clayi) > sunnary(Incec.elay) can An(Gormila = CECi ~ clayt) Min 19 Median 39. Max cootticients Estinate Std, Error t value Pr(>It!) CEntercept) 4.8262 0.8520 5.6 1.08-07 +x cayt 0.2039 0.0252 «8.1 2.46-19 vee Signit. codes; 0 tess! 0.001 tee! 0,01 'e! 0.05 10.1 tt Reaidual standard error: 4.24 on 145 degrees of freeden Multiple A-squared: 0.311, Adjusted Resquared: 0.307 Festatiscie: 65.5 on 1 and 145 DF, prvalue: 2-11-13 > amcec.ce-el < In(cEC! ~ get + Cray) > summary(Incec.oe-cl) cana An(formila = CE ~ oct + e1ayi) Min 1Q Median 3g Max “1.708 -2.016 -0.377 1.289 18.118, cootticiente Estinate Sté, Error t value Pr(olt!) (intercept) 2.7196 0.7179 3,79 0,00022 exe chayt 0.0647 0.0248 2.60 0.01015 + Signit. codes: 0 ‘eee! 0.001 tes! 0.01 16! 0.08 1.1 Lt a Residual standard error: 3.36 on 144 degrees ef treedon Multaple A-squared: 0.672, Adjusted Resquared: 0.556 Festatiatie! 96,3 on 2 and 148 DF, prvalue: <2e-16 Q60: How much of the total variability of the predictand (CEC) is explained by each of the models? Give the three predictive equations, rounded to two decimals. Jump to A60 « Q61: How much does adding clay to the predictive equation using only organic carbon change the equation? How much more explanation is gained? Does the model summary show this as a statistically-significant increase? Jump to A61 7.3 Comparing regression models Which of these models is “best”? ‘The sim is to explain as much of the varia- tion in the dataset as possible with as fow predictive factors as possible, ie. a parsimonious mode! 7.3.1 Comparing regression models with the adjusted R® Compare R. One measure which applies to the standar which decreases the apparent R2, computed for the number of predictive factors: linear model is the “adjusted” R2 2m the ANOVA table, to account where 1 is the number of observation and p is the number of coefficients Q62: What are the adjusted R® in the above models? Which one is highest? Jump to A62 « We can sce these in the model summaries (above); they can also be extracted from the model summary > sunmary(Incec, null fadj.r. squared io > sunmary(Incec.oc)$adj.r. squared 1] 0.54887 > sunmary(Incec.clay)$ad).r-squared (1) 0.30687 > suamary(Incec. oc. c1)S2dj.r. squared 1) 9.56618 7.3.2 Comparing regression models with the AIC Compare AIC A more general measure, which can be applied to almost any model type, is Akaike’s Information Criterion, abbreviated AIC. The lower value is better. > ATCCincee.nu11) (2) 898.61 > Atc(incee. 2¢) a) 182.79 > ATCCincee. clay) [11 865.98 > arcCimcee. 9¢.e1) (1) 178.09 Q63 : Which model is favoured by the AIC? Jump to A63 « 7.3.3 Comparing regression models with ANOVA ANOVA, F- test A traditional way to evaluate nested models (where one is a more complex version, of the other) is to compare them in an ANOVA table, normally with the more complex model listed first. We also compute the proportional reduction in the Residual Sum of Squares (RSS) > (a & anova(inces.0c.c1, Inces.ctay)) Analysis of Variance Table Mede1 1: cect ~ aci + clay Medel 2: cECi ~ Clay ResDE RSS DE Sum of Sq F PrO>F) 2 146 4622 215 2609-1988 87.8 <2e-16 sue Signit, codes: 0 ‘eee! 0.001 fer! 0,01 194 0,05 11 0.11 a > dats (2$RS5)/a$855(2) U1) 0.3787 ‘The ANOVA table shows that the second mode! (clay only) has one more degree of freedom (ie. one fewer predictor), but a much higher RSS (jvc. the variability not explained by the model); the reduction is about 38% compared to the simpler model. These two estimates of residual variance can be compared with an F-test. In this case the probability that they are equal is approximately zero, so it’s clear the more complex model is justified (adds information), However, when we compare the combined model with the prediction from organic matter only, we see a different result > (a & aneva(incec.ee.c1, Incee.0c)) 2 Analysis of Variance Table Medel 1: ceca ~ 0ca + clay Medel 2: CECi ~ 0c2 Res.DE RSS DE Sum of Sq F Prt Signit. codes: 0 ‘eee! 0.001 tes! 0.01 16! 0.08 1 0.11 a > dise(s$Rs5)/a$n5 (2) La) 0.045008 Q64: Which model has a lower RSS? What is the absolute and proportional difference in RSS between the combined and simple model? What is the prob: ability that this difference is due to chance, i.e. that the extra information from the clay content does not really improve the model? Jump to Ab « Regression diagnostics Before accepting a model, we should review its diagnostics (95.7). This provides insight into how well the model fits, and where any lack of fit comes from. Task 43; Display two diagnostic plots for the best model: (1) a normal quantile quantile (°Q-Q") plot of the residuals, Iden fitted observations and examine the relevant fields in the dataset, (1) predicted vs. actual topsoil CEC par(atrey = c(t, 2)) tmp < qquerm(residuals(Incec.ec.cl), pch = 20, main aqline(reiduale (incec. 9c. cl)) Sift < Cenpsx ~ empsy) exe enpSx, tapfy, ifelze((abe(dift) > 3), manesC4ise), "9, poe = 2) amtemp, dies) plot(het ~ t2eted(Incec.oc.cl), poh = 20, alia = (0, 30), ylim = (0, 30), xlab > "Fitved", ylab = "Observed", sain > "Observed vs, Fitted GEC, 0-10cs") abtine(o, 1) grid(col = "black"? par(nfrov = (1, 19) yrsal 0-@ plot, residuals fron In(CECt Nora eu omngeEes~0C1 cb ‘Obed Fed £6, 086m Q65: Are the residuals normally distributed? Is there any apparent explanation for these poorly-modelled observations? Jump to A65 « 74 Stepwise multiple regression* In the previous section, we examined several models individually, using our ex- pert judgement to decide which predictors to use, and in which order. Another approach is to let R try out a large number of possible equations and select the ‘0 some criterion, One method for this is stepwise regression, using the step method. ‘The basic idea of step is to specify an initial model object, as with Im, and then a scope which specifies how variables in the full model should be added or subtracted; in the simplest case we do not specify a scope and step tries to climinate all variables, one at a time, until no more can be eliminated without, increasing the AIC, explained above. ‘We will illustrate this with the problem of predicting subsoil clay (difficult sample) from the three topsoil parameters. Task 44: Set up a predict subsoil clay from all three topsoil variables (clay, OM, and CEC) and use step to see if all three are needed. . > Ame <- step(la(Clay2 ~ Clay! + cect + oct)) Clay2 ~ Clay! » cect + act De Sum of Sq RSS AIC none? 204 462 soc 4 81 3308 464 - cect 4 179 3403 468 = chayt 1 21078 24301 757 In this case wo soe that the full model has the best AIC (461.91) and removing fany of the factors increases the AIC, ic. the modal is not as good. However, removing either 0C1 or CEC1 doesn't increase the AIC very much (only to 468), so although statistically valid they are not so useful An example with more predictors shows how variables are eliminated. Set up a model to predict CEC in the 30-50 em layer from all (clay, OM, and CEC) for the two shallower layers, and use step to see if all six are needed. Note: this model could be applied if only the first two soi layers were sampled, and we wanted to prediet the CEC value of the third layer. > Ins < step(in(ClayS ~ Clay + cEC1 + oct + Chaya + CEC + +” 002, data = obs)? Start: A1C=420.7 layS ~ Clay! + cee + oct + clay2 + cEc2 + oc Dt Sum of Sq RSS ATC > ctayt 1 27 2355 420 none? 2ase 421 cee 4 2387 422 - cay 1 4102 501 Stop: Aro-4t8.75 ChayS ~ Clay! © det + clay? + ceca + ace Dt sum of Sq RSS AIC eck 4 11 2350 417 -0c2 4 12 2350 417 = cayt 1 31 2370 419 none? 2ase 418 ~ cee. 1 76 2415 421 = clay2 1 1966 4308 505 Step: arcnat70a Clays ~ Clays + Clay? + ceca + aca DE Sum of Sq RSS AIC -0c2 4 5 2355 418 - tay 1 a1 2971 417 none? 2380 417 2385 416 = chayt 1 36 2382 415. clay2 1 2911 4666 S14 ‘The original AIC (with all six predictors) is 420.7; step examines all the vari- ables and decides that by climinating CEC1 (a topsoil property) the AIC is most improved. The AIC is now 418.75; step examines all the remaining variables and decides that by eliminating Oct the AIC is most improved; again a topsoil property is considered unimportant. ‘The AIC is now 417.43; step examines all the remaining variables and decides that by eliminating 0¢2 the AIC is most improved, ‘The AIC is now 415.77 and all three remaining variables mnust be retained, oth- cerwise the AIC increases, The final selection includes both clay measurements (10 and 10-20 em) and the CEC of the second layer. Notice from the final output that Clay could still be eliminated with very little loss of information, which would leave a model with two properties from the second layer to predict the clay in the subsoil; or CEC2 could be eliminated with alittle more loss of information; this would leave the two overlying clay contents to predict subsoil clay. Either of these altematives would be more parsimonious in terms of interpretation, although statistically just a bit weaker than the final model discovered by step. 7.5 Combining discrete and continuous predictors In many datasets, including this one, we have both discrete factors (e.g. soil type, agro-ecological zone) and continuous variables (e.g. topsoil clay) which we show in one-way ANOVA and univariate regression, respectively, to be useful predictors of some continuous variable (e.g, subsoil clay). ‘The discussion of the design matrix and linear models (§6.3) showed that both one-way ANOVA on. ‘factor and univariate regression on a continuous predictor are just a cases of linear modelling. ‘Thus, they can be combined in a multiple regression, ‘Task 46 : Model the clay content of the 20-50 cm layer from the agro-ecological zone and measured clay in the topsoil (0-10 cm layer), first separately and then as an additive model . > amse <- In(Clays ~ zone) > susmary(1a82) can An(formula = Clays ~ zone) Residuals: Yon 19 Median 59 Mae 22.960 -5.396 0.188 3.159 24.050 cootticionts: Estinate Std. Error ¢ value Pr(oit!) (intercept) $5.00 3.21 17.14 < 2eri6 one zones -inis 3h 2.28 0.0013 oe zoned “23.67 3.88 6.67 8.2610 one Signiz. codes: 0 ‘eee! 0.001 te¥! 0.01 194 0,08 1 OLE Residual standard error: 9,08 on 143 degrees of freedon 76 Multiple R-squared: 0.613, Adjusted Resquared: 0.502 Festatiavic! $0.1 en 3 and 143 DF, prvalue: <2e-16 > amst <= im(Clays ~ clays) > susmary(1a51) can An(gormula = clays - clayt) Residuals Yin 1Q Median 3a Max -20.62581 -3,29070 0.00546 9.38746 14.16000 Cootsicients Estimate Std. Error © value Pr(Oltl) CEntercept) 18.7585 1.1556 15.2 Inde1 <= In(ctayS ~ zone + Clayt) > sunmary(In521) can An(formila = ClayS ~ zone + Chast) Residuals: Min 19 Medien 59) Ma 4.086 -2.994 0.380 3.138 13.888 Cootticients: Estimate Std, Error t value Pr(olt|) Contercopt) 19,9244 2.5054 6,65, zone? 5.6945 2.1060 2.70 ” zones 2.2510 21831 1.03 zoned -0.8594 2.5365 -0.26 chayt 0.7388 0.0082 16.28, ae Signiz. codes: 0 ‘ese! 0.001 few! 0.01 191 0,08 LF LTE Residual standard error: 5.39 on 142 degrees of treedon Multiple A-squared: 0,83, Adjusted Resquared: 0.825 Featatiatie: 173 on 4 and 142 DP, prvalue: <2e-16, Note the use of the + in the model specification, This specifics an additive model, where there is one regression line (for the continuous predictor) which is dis- placed vertically according to the mean value of the discrete predictor. This is sometimes called parallel regression. It hypothesizes that the only effect of the discrete predictor is to adjust the mean, but that the relation between the contin uous predictor and the predictand is then the same for all classes of the discrete predictor. Below (§7.8) we will investigate the case where we can not assume parallel slopes. Q66: How much of the variation in subsoil clay is explained by the zone? by the 7 topsoil clay? by both together? Is the combined model better than individual models? How much so? Jump to A66 + Q67: In the parallel regression model (topsoil clay and zone as predictors), what aro the differences in the means between zones? What is the slope of the linear regression, after accounting for the zones? How does this compare with the slope of the linear regression not considering zones? Jump to A67 « Q68 : Are all predictors in the combined model (topsoil clay and zone as pro- dictors) asignificant? (Hint: look at the probability of the t-tests.) Jump to AGS « Diagnostics We examine the residuals to see if any points were especially badly- predicted and if the residuals fit the hypothesis of normality. Task 47 : Make a stem plot of the residuals, . > svon(residuals (inézt)) Tae decinal point is at the | 4a 22 | 20 | -18 | a6 | a4 | a2 | =10 | 540 “8 | T7108 -6 | 10098662 4 | Basssoe5¢s22 2 | 3655521009876110 ‘98666542221 0987660595921 ‘oox22asasass7o0zascaesseaees9 ‘oasaeeesoo127335946568 (0336800058. ss7e2244 Q69: Are the residuals normally-distributed? Are there any particularly bad values? Jump to A69 « Clearly there are some points that are less well-modelled. Task 48 : Display the records for these poorly-modelled pois their subsoil clay to the prediction, . > reso < which (residuals (issz1) < -12) > ves.ht G which(residuals(is5z1) > 9) > obsires.te, J © _melev zone urbt LC Clayi Clay2 Clays cEct cEC2 cECs Mis 95008 aena7 e472 rach 808 act oc2 acs M5150.8 0.8 > predict (1n621) (res. 10) a5 > obsires.di, J % 881230 a1t0s3 600-2 =f FY de 870 7.8 8.7 4.8 27 979242 338073 360-2 a FV 35 I S.0 Sata 38 871039 336818 130 =A OCh 1328 40k A 2D 42 se7as 334883 243 @ a FY 2338483. 4.2 88 119 666452 297405 134 40a PFO 40a TS 8 2.30 1.360.8 27 1108 0.82 0.5 38 1300.94 0.2 421.27 0.58 0.5 119 2.00 0.60 0.4 > predict (InSz2i) fres.i) 8 oa 8 9 eta? 58,856 29,229 28.228 35.582 24.112 37.524 95.912 55.913 Q70: What are the predicted and actual subsoil clay contents for the highest and lowest residuals? What is unusual about these observations? Jump to A70 7.6 Diagnosing multi-colinearity Another approach to reducing a regression equation to its most parsiminious form is to examine the relation between the predictor variables and the predictand for meulti-collinearity, that is, the degree to which they are themselves linearly related in the multiple regression. In the extreme, clearly if two variables are perfectly related, one can be eliminated, as it can not add information as a predictor. This was discussed to some extent in §7.1 “Multiple correlation”, but it was not clear which of the correlated variables to discard, because the predictand was not included in the analysis, For this we use the Variance Inflation Factor (VIF), which measures the effect of a set of explanatory variables (predictors) on the 79 variance of the coefficient of another predictor, in the multiple regression equation including all predictors, ie. how much the variance of an estimated regression coefficient is increased because of collinearity. The square root of the VIF gives the increase in the standard error of the coefficient in the full model, compared with what it would be if the target predictor were uncorrelated with the other predictors, Fax (13] has a good discussion, including a visualization, In the standard multivariate regression: k ¥ =D BeXt +8 Xo= (3) 7 solved by ordinary least-squares, the sampling variance of an estimated regression coefficient B; can be expressed as var(Bj) = @) hops z (n= s} 1-8} where: 2 $s? is the estimated error variance of the residuals of the multiple regression; 5 + is the sample variance of the target variable; RE is tho multiple cooffiient of determination for the regression ofthe target variable X; on the other predictors ‘The left-hand multiplicand applies also in a single-predictor regression: it meae sates the imprecision of the St compared to that of the predictor. A larger overall error variance of the regression, s*, will, of course, always lead to a higher vari- jnce in the regression coelicient, while a larger number of observations mand 4 larger vasiance s} ofthe target vatiale will bth lower the variance in the regression coefficient. ‘Tho right-hand mltplicand,1/(1.—R3) applies only in mile regression, Tie is the VIF: it multiplies the variance of the regression coefficient by a factor that will be larger as the multiple correlation of a target predictor with the other predictors increases, Thus the VIF increases as the target predictor does not add ich information to the regression ‘The VIF is computed with the vit funetion of John Fox’s ear package (12) ‘Task 49 : Load the car package and compute the VIF of the six predictors. + > require(car) > vigaa(Clays ~ Clay! + cECt + oct + clay? + cEce + oc2, + data = ebe)) ciayt cect. oct clay? ceca ace 12.8391 4.7712 4.0944 10.3882 3.5831 3.0349, ‘There is no test of significance or hard-and-fast rule for the VIF: however many authors consider VIF > 3 as a caution and VIF = 10 as a definite indication of 80 multicolinearity. Note that this test does not tell which variables, of the set, each variable with a high VIF is correlated with. It could be with just one or with several taken together. QTL: According to the VIF = 10 criterion, which variables are highly correlated with the others? Jump to A7I« Task 50: Re-com; each taken out separat he VIF for the multiple regression without these variables, ly. . > vag(am(clays ~ clay! + cet + oct + cEC2 + 062, data = ebs)) > vif im(Clays ~ clay2 + chet + oct + exc? + 0c2, data = obs)) Slay2 cect oct ceca oz 2.0978 4,9094 4.0277 3.5256 2.9037 Q72: According to the VIF >= 10 criterion, which variables in these reduced equations are highly correlated with the others? What do you conclude about the set of variables? Jump to A72 « Since either Clay1 or Clay2 can be taken out of the equation, we compare the models, starting from a reduced model with each one taken out, both as full models and models reduced by backwards stepwise elimination. First, climin: ting Clay2: > atcCim(Clays ~ clay! + cet + oct + cEcD + C2, data = obs)) (1) 920.5 > Alc(etepCie (clays * clayt + CECI + oct + cEc2 + 072, + data = obs), trace = 0}) Tn) 916.16 Second, eliminating Clay > AlC(im(ClayS ~ Clay? + CECL + oct + cECD + 0¢2, data = obe)) [11 839.57 > Alc(step(in(clayS * clay? + CECI + oct + oEc2 + 72, + data = cbs), trace = 0)) a} 835.2 Q73 Which of the two variables with high VIF in the full model should be climinated? Jump to A736 Task 51: Compute a reduced model by backwards stepwise elimination, starting fros ll model with this variable eliminated, . > (ine.2 © atepCin(ciays ~ clay? + caCt + oct + céc2 + 28)? ChayS ~ Clay? Ceci + act + cEcD + oc bt sun of $4 - cect 4 8 soc 4 6 -0c2 4 2 ~ cee. 1 56 Sciay2 1 10782 Step: arcnate 69 2370 aan 2387 2365 2an clays ~ clay2 + act + ceca De sun ot Sq oc 4 1 = 0c 4 20 Schay2 111683 Step: Alo-426.75 hays * clay2 + cece DE Sun of Sq -0c2 4 eT Step: ALo-426.03 Clays © clay2 + cece DE Sun ot Sq Sciay2 118687 An(tormula = clays ~ couttacients CEntercept) a 14.519 ° ass aon 2390 2370 + 02 ass 2382 2371 ass 2392 cay ay? 61 419 419 420 420 422 + 9¢2 ate 47 a8 419 ae 418 air ae 418 + ce 02, data = obs) cece -0.199 QTA: What is the final model? What is its AIC? How do these compare with the model found by stepwise regression, not considering the VIF criterion? Jump to ATs Another approach is to compute the stepwise model starting from a full model, 82 and then sce the VIF of the variables retained in that model. ‘Task 52: Compute the VIF for ll stepwise model. . ‘The vit function can be applied to a model object; in this ease Ims, computed > vigame) clay ciay2 Bee QTS: What is the multi-colinearity in this model? Jump to 4 ‘This again indicates that the two “clay” variables are highly redundant, and that, climinating one of them results in a more parsimonious model. Which to climinate is evaluated by computing both reduced models and comparing their AIC. ‘Task 53; Compute the AIC of this model, with each of the highly-correlated variables removed, . We specify the new model with the very useful updave function. This takes model object and adjusts it according to a now formula, where existing terms are indicated by a period (°.") > arccans) 2) 834,96 > AICCupdate(ies, ~ etayt)) (a) 638.2 > AIC(update(ims, . ~~ clay2)) 1) 938.48 Q76: Which of the two “clay” variables should be eliminated? How much doos this change the AIC? Jump to A76« 1.7 Visualising parallel regression* In parallel regression (additive effocts of a continuous and discrete predictor) there is only one regression line, which is displaced up or down for each class of the diserote predictor. Even though there are two predictors, we ean visualize this i 8 2D plot by showing the displaced lines Task 54: Plot subsoil vs. topsoil clay, with the observations coloured by zone. ‘Add the parallel regression lines from the combined model, in the appropriate colours, and the univariate regression line. . 83 > plot(ClayS ~ Clayl, col = as.nuneric(zene), ped = 20) > ablane (coefficients (Isat) ("(Intercepe)*), coefficient > for Giz an 2:4) ( + ablinescoefticients(inse1) [*(Incercept)"] + coefficients (inSz1) [iz], + coofficients(inzi){*Clayi"J, col = iz) + > abline(inst, Tey = 2, Iud = 1.5) > text(70, 30, pou = 2, paste("Slopes: parallels", round(coefficiente(inse1) ("Clay!"), +3), 8; undvariave:*, round coefficients (ia6%) [*Clay?"], + a > text(70, 2, pos = 2, paste(” ATC: parallel", tleer(ATCCm6z1)), + tnivariate:", floer(AIC(1M51)))) > text(70, 22, pos = 3, paste(*PrGF) parallel is not be 4 round(anovaCinSzi, In6i)$"Pr(3F)"(2], 2 > for (iz in 14) € + eexe(68, 50 - (3 #12), pastectzene”, 12), col = sz) +2 & . zone 1 6 . zone 2 e4 ze a4 ‘Slopes: parallol: 0.736; univariate: 0.823 . ‘AIC: paral: 919 ; univariate: 932 el. + Pr(SF) parallel isnot beter: 0 10 wD Clayt Note the use of the coefficients method to extract the vector of fitted coefli- cionts, which can sed by name or position. QTT: How well do the four parallel lines appear to fit the corresponding points, ie. the points from the corresponding zone? Jump to A77« 7.8 Interactions* Both topsoil clay and agro-ecological zone can predict subsoil elay to some extent. Combined as an additive model, they do better than each separately. But this 84 leaves the question of whether they are completely independent. In this case, we may ask if the slope of the regression of subsoil on topsoil clay is different. ‘Task 55; Model the clay zone and measured clay in the topsoil (0-10 em layer) as a additive model with interactions . content of the 20-50 em layer from the agro-ccological ‘To express an interaction between model terms, we use * instead of + in the model formula: > Imsi.z < In(Clay5 ~ Clay! * zone) > sunnary(In5t.2) cana An(formula = Clay ~ Glayl + zone) Min 19 Median 3g Mae cootticients Estinate Sté, Error t value Pr(>It!) cntercept) 14,5362 6.4099 2,27 0,025 + clayt 018943 0.1265 6.59 8,26-10 vee zone? 10.3477 6.9758 1.480.140 faylizones 0.2703 0.1518. -1.79 0.076 Claytizened 0.2471 0.1877 1.320.190 Signit, codes: 0 ‘ere! 0.001 fer! 0.01 191 0,05 11 O11 1a Residual standard error; 5.24 on 138 degrees of freeden Multiple Resquared: 0,342, ‘ajusted Resquared: 0,834 Festatistie: 106 on 7 and 129 DP, prvalue: <2e-16 Q78 : How much of the variation in subsoil clay is explained by this model? Is it better than the additive model? Jump to J Q79: Are ail predictors in the combined model (topsoil clay and zone as predic tors) significant? For the predictors also present in the additive model (i.e. zone and clay separately, not their interaction) are the s significant, and to the same degree? Jump to A796 Of most interest are the interaction terms in the model suzamary. In this model, ‘these tell us if the relation between topsoil and subsoil clay is the same in all zones. This was the assumption of parallel (additive) regression (§7.5); but if there are interactions, there is not only one slope, but different slopes for eac level of the classified predictor (here, zone). Q80: Is there evidence that the relation between topsoil and subsoil clay is different in some of the zones? If so, at what significance level (i. probability of ‘Type I error)? Jump to 480 ‘Task 56: Visualise this by plotting the different regressions of subsoil on topsoil clay, by zone. . ‘To do this, we use the subset optional argument to the 1m method to select just, some observations, in this case, those in a zone. We plot each regression and its associated points in different colours, Note: ‘This code wee a "tric to plot each regression only inthe range ofits subset (cone). The abline method draws a line for the whole range ofthe plot, and can’ be limited to a range, We use the Loess “local polynomial regression fitting Tanction on the subset, elected with the aubset argument, with a very Iarge span argument to force a straight line (rather than a locally-adjusted polynomial). This returns a set of fitted values which only cover the span of the data. We plot thie with the usual Lines function, but only use the minimum and maximum ftted points (Le. the end points of the fitted lin), otherwise the line become too thi > ploe(clayt, clays, xiim = (10, 60), ylin = e(10, 80), 4° peh = 20, cex'= 1.5, col = as.nuneric(zone), xlsb'= “Tepsoi) clay 2", + lab = *Subeait etay 2°) > tatle(*Subseal ws. topsoil clay, by zone! > text(88, 40, "Slope of regression") > tor @ in 114) € + we In(Clays * Clayt, eudset = (zone + text (65, 40 - (3 # 2), paste zone + 3)), eat = 3) round (coaf#icients(a) (21, mo & locas (GiayS ~ Clayi, subser san = 100) Linea(y = c(win(e.18titted), maxim ifticted)), x * ‘max(e.19x)), ol = 2) e(ain(e180), > a < In(ClayS ~ Clayt) > abline(n, cel = 5, Ivd = 1.5, Ity = 2) > text(65, 25, paste(Yoverali’", round(coefticients(n) (21, +B) eat = 8) > mo, 21, 2) 86 Subsoil vs. topsoil clay, by zone 4 Slope of regression one 1 0.834 zone 2: 0.738 e4 Zone 3 0.588 zone 4: 1.081 ‘overall 0.828 ee a a ee) Topsoil lay % With the lines covering only part of the data, and the obviously different slopes, a black-and-white graph with different point symbols may be a cleaner viswaliza- tion: > plac(clayt, clays, xlim = e(t0, 60), ylin = e(10, 80) +0 ped = de-nunerie(zone), xlab = "Topsoil clay 2, + Ylab = *Subsodt clay 2) > tatte(*Subsoal vs. topsoil clay, by zone") > Legend(t0, 75, pch = 1:4, Ity = 1°4, legend = 1:4) > texe(65, 40, *Slopes:) > for (@ in 114) € + me AniCrayS ~ Chayt, subset ~ (zone == 2)) + texe(6s, 40 (3 #2), pasteCtzone", 2, YM, roundCcoeftietante(a) (2), + 300) + m1 & Leess(Clays ~ Clayt, subset = (zone == 2), + span = 100) + Lines(y = c(win(n.16sicted), maxin.isticted)), x = efain(a.ite), + max(a.18%)), ley =z, 10d = 1.5) 2 > m < In(clays ~ clayt) > text (65, 25, paste(Yoverali:", round(coetficienta(n) [2] +a) 7.9 Analysis Subsoil vs. topsoil clay, by zone i s 2 e4 Slopes: a zone 10.834 zone 2: 0.738 e4 Zone 3: 0.588 Zone 4 1.081 x a ‘overall 0.828 ee a a ee) Topsoil lay % Q81: Do the different regressions appear different? How different are the slopes? Referring back to the combined model summary, can we reject the null hypothesis that these slopes are in fact the same? Jump to ASI « Q82; What are the reasons why an apparent difference that is readil not statistically-significant? Jump to A82 « of covariance* In the paralleLlines model (§7.5) there was only one regression line between the continuous predictor and predictand, which could be moved up and down accord: ing to different class mans; this is an addifwve model. In the interactions mod (§7.8) there was both an overall line and deviations from it according to class, allowing different slopes, as well as differences in class moans. Another way look at this is to abandon the idea of a single regression altogether, and fit soparate line for each class. This is a nested model: the continuous predictor is measured only within each level of the classified predictor. It is specified with the / formula operator: > InSt.2.m <= An(ClayS ~ zone/Clayt) > sunmazy(In52.2.2) An(formula = Clays ~ zone/Clayl) ANCOVA, Residuals: Mon 19 Median 39 Mae 24.068 2,883 0.515 2.989 13.293, Estimate Std. Error © value Pr(>lel) (intercept) 14.9362 6.4093 2.27 0.025 zoned 10.3477 «6.9758 $48 0.140 zones wzzsa1 68s 1.77 zoned “1272 6.8954 -0.28 Zonel:Clay 0.8943 0.1265 «6.58. 8,2e-10 vee zone2:Clayi 0.7388 0.0525 11.83. < 2e-16 ve¥ zone3:Clayi 0.5660 0.0828 6.80 2.8e-10 eee Signit. codes: 0 ‘eee! 0.001 tes! 0.01 '6! 0.08 1 0.1 1 Residual atandard error: 6.24 on 130 degrees of treedon Multiple R-squared: 0.842, ‘Adjusted Resquared: 0.834 Feetatistie! 105 on 7 and 198 DF, prvalue: <2e-16 Note that th each zone, eg. Zonet separately is different cis no ontry for Clay! by itself; r yi for gone 1. The ‘om 0. ther there is a separate slope for is then whether each slope Q83: How much of the variation in subsoil clay is explained by this model? How does this compare with the additive (parallel) model ($7.5) and the interactions model (47.8)? Are all terms significant? Jump to A83« Q84: Compare the slopes for zones 1 and 4 in the nested model with the zone slopes (i.e. combined plus zone-specitic) for these zones in the interaction model Are they the same? Jump to A84« 1t6(0nS1.2.n) [*zone4:Clayl"] ~ (coefticients(in61.z){*Clayl"] + ‘ficients (Inst-2) ("Clayl‘zoned"]) “a 220de-16 This model is also called the Analysis of Covariance (ANCOVA) when the aim is to detect differences in the classified predictor (here, zone), controlling for the effect of a continuous covariate, here the topsoil clay, when the covariate is considered a ‘nuisance’ parameter, not. an object of the study: In this case topsoil clay is not a nuisance parameter, but we can still see if controlling for it changes our perception of the differences between zones for subsoil clay. Q85: Aro the coofficients and significance levels between subsoil clay contents in the four zones different in the nested and additive models, and also the model which did not consider the covariate at all? Jump to A85 « 89 > sunmary(1a52) An(Gormula = Clays ~ zone) Residuals Min 19 Median 39 (intercept) 56.00 3.21 zone? 2182 zoned aa zoned 3.55 Signit. codes; 0 tees! 0.001 Multiple R-squared: 0.813, 9.08 on 143 degre wax ft value PrOIEN) Waa < 2616 0.27 0.7874 228 0.0013 “6.67 §.26-10 Adjusted R-squared: 0.502 Festatiscie: $0.1 on $ and 145 DF, prvalue: <2e-i6 > sunmary(Ia52_2.2) An(formula = Clays ~ zone/Clayt) Residuals: Mon 19 Median 39 Mae -24.068 2,883 0.515 2.989 13.293, Estimate Std. Error © value Pr(>ltl) (intercept) 14.8362 6.4093 2.270.025 zoned 10.3477 «6.9758 448 0.140 zones i23s1 68s LTT. zoned 2272 6.8954 -0.26 0 Zonel:Clay 0.8943 0.1265 6.58. 8,2e-10 ve zone2:Clayi 0.7388 0.0525 11.83 < 2e-18 ve¥ zone3:Clayi 0.5640 0.0828 6.80 2.8e-10 +++ Signit. codes: 0 ‘eee! 0.001 tes! 0.01 16! 0.08 1.1 0.1 1 Multiple Resquared: 0,242, 5.24 on 199 degrees of treedon Adjusted Resquared: 0.834 Feetatiatie! 105 on 7 and 198 DF, prvalue: <2e-16 > sunmary(1n51.2) cana An(formula = Clays ~ Clayt + zene) siduale Min 19 Median 39 Mae Estimate Std. Error t value Pr(>It]) Clnvercept) 14,5362 6.4082 2.27 0.025 © cayt 0.8943 0.1265 6.59 8.26-10 +e 90 zoned 10.3477 6.9758 1.480.140 igs 68s 1.770.078 zoned “19272 6.8954 -0.26 0.791 Clayt:zene2 -0.0955 0.1411 -0.68 0.500 Clayt:zenes -0.2703 0.1513 1.780.076 Claytizenes 0.2471 0.1877 1.520.190, Signit. codes: 0 ‘eee! 0.001 tee! 0.01 16! 0.08 1 0.1 tt Residual avandard error: 5.24 on 128 degrees of treedon Multiple A-squared: 0.842, Aayusved Reaquared: 0.834 Festatisie! 108 en 7 and 199 DF, prvalue: <2e-16 7.10 Design matrices for combined models* In §6.3 we examined the design matrix for ANOVA and regression models, It is instructive to see this matrix for the combined models: additive (parallel), interactive, and nested. As in §6.3 we'll look at the matrix for observations 15 through 22, since they come from different zones, > model matrix inb21) (15:22, J Intercept) zone? zoneS zoned Clay! 18 1 0 0 OO 16 1 1 0 0 8 wv 1 1 0 0 2 18 1 1 0 0 38 18 1 0 1 0 20 1009 1 0 2 2 1 0 1 0 2% 2 1 0 0 4 3 > model natrix(in5t,2)(15:22, 1 CIncercept) Clay! zone2 zene3 zones Clayl:zone? Clayl:zenes 16 1 2 0 0 0 ° ° 16 1 2 1 0 0 2 ° 7 1 0 1 0 0 20 ° 18 1% 1 0 0 33 ° 19 1 mn 0 1 0 ° a a 1 2% 0 1 © ° 26 2 10% 0 0 8 ° ° 4 18 ° 16 ° 7 ° 18 ° 19 ° 20 ° a ° 2 3 > nodel.natrix(inst.2.n)115:22, 1 Insercept) zone? zone3 zoned zonel:Clay zoned :Clayl zene3:Clayl 16 10 0 OO 2 ° ° 16 1 1 0 0 ° 52 ° 7 1 1 0 0 ° 20 ° a1 1 18 20 a 2 a 22 26 zoned: Clay 1” 18 18 20 a 2 ai Observation 15 is in zone 1, so it only has an entry for the intercept and topsoil clay. Observation 16 is in zone 2, so it has an entry for the intercept, topsoil clay, zone 2, and (in the second case) the interaction between topsoil clay and its zone. Note that for the parallel regression model LmSzi there is only one column for the continuous predictor, whereas in the interaction model m61.z there is a separate column for the continuous predictor in each zone. This is how the model can fit a separate slope for each zone. In the nested model 1m61..2.n there is no column for slope difference nor for overall slope, but rather one s each with only the topsoil clay observations for that zone. TAL Answers ASS: CEC is positively correlated with both clay and organic matter, however there more spread in the CBC-vs-clay relation. The two possible predictors (clay and organic zmatter) are also positively correlated. Return (0 Q58 + ‘N59: The covariances depend on the measurement scales, whereas the correlations are standardised to the range [-1,1]. CEC is highly correlated (r = 0.74) with organie carbon and somewhat less so (r ~ 0. 56) with clay content. The two predictors are also moderately correlated (r ~ 0.60) Return 10 Q59 + ‘AGO = These are given by the adjusted R®: 0.5066 using ouly clay as a predictor (CEC = 4.83 +020 Clay), 0.5469 using only organic carbon as a predictor (CEC 3.67 +2.52-OC), and 0.5662 using both together (CEC ~ 2.72 +2.16- C+ 0.64- Clay). Return to Q60 * AG1: The prodictive equation is only a little affected: the slope associated with OC decreases from 2.52 to 2.16, while the intercept (associated with no clay or organic carbon) decreases by 0.95. Adding Clay increases R® by only 0.5662 ~ 0.5489 = 0.0173, ic 1.7%. This is significant (p = 0.010152) at the « = 0.05 but not the & = 0.01 level Return t0 QOL + ‘Nez: OC only: 05 : Clay only: 0.307; Both 566, The model with both is slightly 92 otter then the single predictor model from OC. Return (0 Q62 + ‘A63> The AIC favours the model with both OC and clay, but this is only slightly Dotter than the single-predictor model fram OC. Return to Q6s « G4: The combined model has the lowest RSS (necessarily); the differonce is only 764, ie. about 12% lower. There is a 1X probability that this reduction is due to chance, Return to Q6d « AG5: The residuals are not normally-distributed: both tails are too long, and there are about six serious under-predictions (observations 73, 60, 68, 140, 77, 124) The two observations with the most negative residuals (over-predictions), ie. 1 and 10, are the only two with very high clay and OC*. This suggests an interaction at high levels; “the whole is more than the sun of the parts” ‘There seems to be no comparable explanations for the four observations with the most positive residuals (under-predictions) Return to Q65 + ‘466: The model explains 50% (zone); 80% (topsoil clay); 82.5% (both) of the variation, jn subsoil clay; the combined model is only # bit better than the model using only ‘measured topsoil clay. Return to Q66 + N07 The regression lines for zones 2, 3, and 4 are adjusted by 5.69, 2.25, and —0.66, respectively, compared to zone 1. These are the mean differences. The slope is 0.736, which is somewhat flatter than the slope estimated without considering zones, 0.829. ‘That is, some of the apparently steep slope in the univariate model is accounted for by the differences between zones. In particular zone 2, which has the higher clay values in both layers, has a higher mean, so that once this is accounted for the regression line is znot “pulled” to the higher values. Return to Q67 ‘AGB; Topsoil clay is very highly significant (p = 0 that ib isn't) and sos the intercept (0 ‘lay and zoue 1). Zoue 2s significantly diferent (p < 0.008 that it isn't) but the others are not. Note that in the one-way ANOVA by zone, zones 3 and 4 are both significantly diferent from zone 1 and 2, which form a group. Here we soe thatthe inlusion of topsoil dlay in the model has completely changed the relation to zone, since uch of the zone ellct was in fact a clay effect, ie, zones had diferent average topsoil clay coutents. The two predictors were confounded. Return to Q68 + ‘469: The residuals are more or less normally distributed around 0, except for one very large negative residual (under-prediction) and seven large positive residuals (heavy tail) Return to Q69 + ATO: At point 145, the prediction is 28% while the actual is 47%; this is a sovere under-prediction. This is an unusual observation: topsoil clay is 7% higher than both tnderlying layers. There are only two observations where topsoil clay excoeds subsoil Tobet(Glayi > 60) & (OCT > 6.5) 1 93 clay (> which(Clayt > ClayS)), 145 and 81, and for observation 81 the differ only 2%. At point 119, the prediction is 34% while the sctual is 48%; this is the largest under- prediction. Here topsoil clay is fairly low (21%) compared to the much higher subsoil values. Return to Q70 + ATI: Variables Clayt and Ctay2 have VIF > 10 and are thus highly co-linear with other variables. As a set, the others are fairly independent. Return to Q7l « ‘A72: Ifeither Clay! or Clay? are removed, the remaining set of five variables are fairly independent (all VIF5). This shows that the high VIF for Clay1 and Clay? in the full ‘model was due to the presence of the other ‘clay variable. So either topsoil or subsoil clay should be included in a parsimonious model, but not both, Return (0 Q72 + ‘A73: Eliminating Clay! results in a much lower AIC. Th clay (Ctay2) is closer physically to the deep subsoil (target variable Clays), so the processes that lead to a certain clay’ content would seem to be more sisilaz, Return to Q73 s seems logieal, as subs. ‘N74: The Ginal stepwise regression model, starting irom the full set less Clay, is clays ~ Clay2 + CEC2, with an AIC of 835.2. The model starting from the fall sot is Clays ~ Clayl + Clay? + CEC2, ic. st has both clays as well as the subsoil CEC. Its AIC is 834.94. The two final models are almost the same except for the inclusion of the highly-colinear variable; their AIC is almost identical. So, the reduced model (without Clay! is preferred. Return to Q7t + ATS: Both Glayt and Clay2 have VIF > 8, not above the threshold VIF >= 10 but not much below. Clearly, Clay and Clay? are still highly-correlated. Return to Q75 + ‘N76: Asin the previous tasks of this section, we see that Clay can be eliminated with almost no increase in model information content as shown by the AIC. Return (0 Q76 + A77: Zone 4 (blue points and line, low clay values) seems poorly-fit. A line with a lower intercept and a steeper slope would appear to fit better. Soa model with interaction botween classified and continuous predictor, allowing separate slopes for each class, might be botter. For the other throe the parallel lines soem OK. Return to Q77 + ‘ATS: The model explaine 83.4% of the variation in subsoil clay; this is slightly better than the additive model (82.5%). Return to Q78 + A79: Additive terms for topsoil clay, the intercept (zone 1 at zero clay) and zone 3 are significant. This differs from the additive model, where zone 2 was the only zone on significantly different from the intercept. Return to Q79 + ‘A80: The most significant interaction is Clayt zone but the probability that rejecting the null hypothesis of no difference in slopes is fairly high, 0.076, s0 we can’t reject the null hypothesis at the conventional 95% confidence level. Return to Q80 « ASI: They certainly appear different, ranging from 0.364 in zone 3 (green points and line) to 1.081 (bine points and line), almost double. Yet the t-tests for the interaction terms are not significant at the 95% confidence level, so these four slopes could all be different just because of sampling error. Return to Q8! + ‘N82: The fundamental problems aze: (1) small sample size in each zone; (2) a spread of points (“cloud” or “noise") within each zone, These two factors make it difficult to ‘establish statistical significance. ‘Return to Q52 + ‘ASS: The nested model explains 63.4% of the variation in subsoil clay; this is slightly botter than the additive model (82.5%) and the same as the interactions model. It is quite unlikely that the mean for zone 4 is different from zone 1 Return (0 Q83 + ‘A84: Yes, they are the same. For zone 1, the interaction model has the default slope (coefficient for Clayt) which is the same as the nested model slope for zone 1 (coofliciont for zonet :Clay1). For zone 4, adding the slope difference in the interaction ‘model (coefficient for Clay! zones) to the default slope (coefficient for Clay1) gives the ‘same value as the nested model slope for zone 4 (coefficient for zone4:Clayt). Return 10 Q84+ ‘AB5 + There is a big difference between the model coefficients and their significance. Without considering the covariate at al, the difference from zone 1 is (zone 4 > zone 3 > zone 2), the latter is not significantly different. In the nested model the differences are (sone 3 > zone 2 > zone 4), the latter coefficient not significant; this 1s because the diflerence between zone I and 4 subsoil clay can be almost entirely explained if one knows the topsoil clay and allows separate regression lines for each zone. In the additive (parallel) model the differences are (zone 2 > zone 3 > zone 4). The parallel regression ne for zone 2is significantly above that for zone 1, the others not significantly different. Return to Q85 + 8 Factor analysis: Sometimes we are interested in the inter-rclations between a set of variables, not just their individual (partial) correlations (87.1). That is, we want to investigate the structure of the multivariate feature space covered by a sct of variables.; this is factor analysis, This can also be used to diagnose multi-collinearity and select representative variables (sce also §7.6) ‘The basic idea is that the vector space made up of the original variables may be projected onto another space, where the new synthetic variables are orthogonal to each other, i.e. completely uncorrelated. ‘These synthetic variables can often 95 be interpreted by the analyst, that is, they represent some composite attribute of the objects of study. 8.1 Principal components analysis ‘The first such technique is Principal components analysis. This is a multivariate data reduction technique. It finds a new set of variables, equal in number to the original sot, where these synthetic variables arc uncorrelated (j.c. orthogonal to each other in the space formed by the principal components). In addition, the first synthetic variable represents as much of the common variation of the origi- nal variables as possible, the second variable represents as much of the residual variation as possible, and so forth. [Note: ‘This is a common image-processing technique and ie explained and illus ‘tated in many textbooks on remote sensing eg. 1, 1). In the present example, we investigate the structure of the feature space defined by the three variables (CEC, Clay, and OC) in a single horizon. A summary of the components reveals how much redundaney there is in this space. Task 57 ; Compute the unstandardized principal components of three variables: topsoil clay, CEC, and organic carbon, . ‘To compute the PCs we use the preomp method; this produces an object of class rconp which contains information about the components. The relevant columns, are extracted from the data frame. > pe < preemp(obel, c(MCBCI", *Clayi", "Oct > clase(pe) 2 ie comp" > stripe) $sdev —: num [1:5] 14.282 4.192 0.933 $ rotation: aus (1:3, 1:3] -0.2187 -0.9735 -0.0868 -0.9589 0.2271, = ater(+, “dimanea")=List of $: chr (1:3) “cect” *cayt* “oct” $0: chr [1:3] "Pei" "Pca" "Poa S center: Noned mun [1:3] 11.2 31,27 2.99 = ater(s, *nanes")= ehr [1:3] "CECI" *Clayt* “oct” $scale : logs FALSE $e nue [1:147, 1:3] ~40.3 -39 -31.8 -23.2 16.2 ~ ater(, “dimenes")=List of 2 $2 eh [Lsn47] #1" m2e 8" nae $5 chr (1:3) "Pei" "Poa" “PCa ~acte(e, "elase")= che "preonp' > summary(pe) Taportance of components: Per pea Pea Standard deviation 14.28 4.182 0.9330 Proportion of Variance 0.82 0.079 0.0038 Cumulative Prepertion 0.82 0.995 1.0000 96 Q86: What proportion of the total variance is explained by the first component alone? By the first and second? Jump to A86 + ‘The numbers here are misleading, because the variables are on different scales In these cases it is better to compute the standardised components, using th correlation instead of covariance matrix; this standardises all the variables zero mean and unit standard deviation before computing the components, ‘Task 58; Compute the standardized principal co topsoil clay, CEC, and organic carbon. . ‘This option is specified by setting the scale optional argument to TRUE, > pes & preemp(obsle(10, 7, 15)], scale = 1) > summary (pe.s) Standard deviacs 1.81 0.69 0,508 Proportion of Variance 0.76 0.16 0.085, Conulative Preportaon 0.76 0.82 1.000, Q87: What is the difference between the variance proportions in the standard ized vs. unstandardized principal components? Which gives a better idea of the proportion of variance explained? In what circumstances would you prefer to use unstandardized components? Jump to A87 « Q88 : What proportion of the total standardised variance is explained by the first component alone? By the first and second? Jump to A88 « ‘We can see which original variables are associated with which synthetic variables by examining the loadings, also called the factor rotations. These are the eigen- vectors (in the columns) which multiply the original variables to produce the synthetic variables (principal components) > pe-sfrevation Pet poz Pas cec1 0.58810 0.45705 -0.566386 Clayt -0,54146 ~0.83542 -0,094322 ‘These show the amount that each original (standardised) original variable con- tributes to each synthetic variable. Here, the first PC is an almost equal mixture of CEG, Clay, and OC; this can be interpreted as an overall intensity of soil ac- tivity; we've seen that C y and OC aro generally all positively-correlated and this strong relation comes out in the first PC. This represents 76% of the overall variability. The second PC has a large contribution from Clay oppose the two other variables; this component can be interpreted as high CEC without high Clay, ic. high CEC due mostly to OC. This represents 16% of the overall variability. The third PC represents CEC that is higher than expected by the OG content. The interpretation here is more difficult. It could just represent lack of precision in the laboratory (ie. experimental error). Or it could represent a different compostion of the organic matter. This represents 8% of the overall variability. 8.1.1 The synthetic variables* If the retx argument to the preomp method is specified as TRUE (this is the default), R computes the numeric value of each observation for each PC; these are the scores and are the values of the new variables, substituting for the original variables. They are stored in the x field of the preomp object. It's instructive to see what the observations look like in the space spanned by the PCs. (These are also displayed as part of the biplot, sce §8.1.3.) Task 59: Compute the standardized PCs, along with the scores for each obser- vation, Plot these in the space spanned by the first two PCs and highlight the observations that are not well-explained by these, . > pe.s < preemp(obs[e(10, 7, 13)], scale = T, ret > plottpe.s#xC, 1], pe-stxl, 21, pch +0 yap = *Standardised Po 2") > abline(h = 0) > abline(y = 0) > abline(h = 2, col = "red", 1ty = 2) abline(s = 3, col ~ "red", ley = 2) ablineh = =2, col = ted", Tey = 2) abline(e = -3, col = trea", ty = 2) prs < which(abstpe.s8x0, 1)) > 3) | (abs(pe.stxl, 21) >= 2)? 23 10 13 78 a1 106 2 3 10 13 78 a1 106 points(pe.stelpts, 13, pe.sSx{pts, 21, peh = 21, col bg = “elue") vere(pe.stripts, 1], pe-sfelpte, 2), pts, pes ~ 4, col = "rea pevstxlpts, et, 2)) Pes Poo 3 “319647 -0.028187 10 -2.612 0.625002 13. 2.2530 0.685328 7a 5.0854 1 986i 1 5.673. 2.733634 106 -3.9612 0.804364 > ebsipts, e(10, 7, 13)) cect clayt oct 2 126 71 3.20 27 61 6.98 0s 62 78 98 132.2 48 6.00 Te 28.0 83 9.40 $1 28.0 46 10.90 106 22.0 67 4.80 Be e outside of In the displayed graph, we can identify unusual points (towards the plot) with the identity method and then display their values of the original variables. In this example, points furthest from the eentroid are indentified by r distance and plotted in a different colour; the identify method wasn't used because it is interactive Q89: Which are the most unusual observations in the space spanned by the first two PC's? What do these represent, in terms of the original variables? — Jump 10 ASD « 8.1.2. Residuals* ‘The PC scores, along with the loadings (rotations), contain the complete infor- mation of the original observations. ‘the results of ‘Task 60 : Confirm this: reproduce the original observations from the PCA and compare to the original observations, . By default, the preomp fumetion centres each variable on zero (by subtracting the mean), but docs not by defaults seale them (by dividing by the standard deviation). ‘This is done simply to avoid problems with the numerie solution. I's easier to see that the multiplication of the score matrix by the rotation matrix reproduces the original values with the non-centred and non-scaled PCA. So, specify argument center to be FALSE. 99 First, we compute the PCs without any centring or scaling, and confirm that the rotations are used to produce the synthetic variables: > pe < preonpCobsl, e(*CECI", "Clay", *OCI")], retx = TRUE, a center = FALSE > summary as.aatrix(ebsl, e(C8CI", "Clayi", *001")]) Ken + pebretation ~ ped) Pct Pez Pea st gu.:0 1st gu.:0 tee Qu.:0 Median :0 Median 10 Median :0 Mean :0 Mean 0 Mean 0 Sed Qu.:0 3rd gu.:0 Sra Qu.:0 Max, 10 Max. 0 Max, 0 ‘The observations multiplied by eigenvectors indeed are the synthetic variables. Now we invert the process, to reproduce the orginal values from the synthetic ‘ones. Since we have the relation: OE (5) where O are the original observations (147 x 3), E are the eigenvectors (3 Xx 3), and X are the values of the synthetic variables (147 x 3), we then must have: oO se" 6) We find the inverse of the rotations matrix with the selve function with only fone argument (the matrix to be inverted); this then post-multiplies the scores matrix: > obs reconstruct <- pee Jal, solve(petrotation) > suapary(obs.recoustruct ~ ebsf, e("CECI", "Clay! cect cays oct Min, 3.88018 Min, -2.MMete Min, 8 st Gu.: 0.00800 at Quis-3.55e-15 tet Qu.: 2 Median eaian : 0,006+00 Median : 4. Moan Mean :-1.636-15 Mean: 5. ara gu 3rd Qu.: 0.00e-00 rd Gu.: 8. ax. Max: Tte-15 Max, 2 ‘The only difference between the reconstructed observations and the originals is, due to limited computational precision; mathematically they are identical If fewer than the maximum PCs are used, they will not exactly reproduce the original observations. By omitting the higher PCs, we are sacrificing some infor- mation for increased parsimony. The question is, how much? ‘Task 61: Compute the orginal value of the observations, using only the first, standardized PC, and then with the first two. Compute the residuals. . Here we use only the first eigenvector, and then the first two. In both cases we have to use only the first scores and the first rows of the inverted rotation matrix. 100 Note the use of the drop argument when selecting only one row or column of a ‘matrix with the [ “extract” method. By default this is TRUE (and invisible); any extra dimensions are dropped, and so selecting only one row or column results in vector, not a matrix, and so can not be used in matrix operations, When FALSE the dimensions aro retained. For completeness, we show the long form of the fall reconstru residuals are defined as (observed - modelled) > din(selve(peSrotation) (1, 1) va > din(solve(peSrotation) (1, , drop = TI) su > din(solve(peSrotatien) (1, , arep iis > obs reconstruc + drop = FT > sumaryCebel, ¢¢ 1 & petal, drop = F] ah solvetpctrotacien) [t, saci", *Clayt", "oct" ct cayt ca det u.: -2.303 Let gu.:-0.956 et Qu.:-0.5964 Mean: 0.545 Mean :-0.195 Mean: 0.1079 Sed Qu.: 2.548 3rd Qu. 0.800 3rd Qu.: 0.7975 Mar. : 12.071 Max, 1 3.721 Max. : 6.2981 > ebs.xeconstruct.2 < pebsL, 1:2] Kat solve(pefrotation) (1:2, > sumary(obs(, ¢(°C8C1 ') *Olay1", "001")] ~ obs. reconstruct.2) chayt et 79864 Min, 1-0.118803 Min, :-9.2482 09584 tat gu.:-0.01445 xt gu. :-0.54a7 Sed qu.: 0.10186 Sed gu.: 0.015344 Sra gu.: 0.5130 Max. : 0.60799 Max. : 0.091628 Max. : 4.2873 > ebs.reconstruct < peel, 1:3) Ke solve pesrotation) [1:5 > summaryCobs[, c(°CECI", *Clayi*, *001")] ~ obs. reconstruct) cect ayt oct Min 7a Man iTtet6 Min, 1-2. ast gu. st Qu. 0.00000 ist Qu.:-8 Median = Neaian : 0.006+00 Median :-4.440-16 Sed Qu.: 0.006400 Sra Qui: 3.SSe-18 Sed Qu.:-2.22e-15, Mar. : 3.88e-18 Max. 2840-14 Max. 8880-18 Q80 What happens to the accuracy of the reconstruction as the number of components is increased? Jump to 490 « 10 ‘Task 62: Create a matrix of the residuals that result from using only one and two PCs to represent the three variables. . ‘The previous code has done this, but not returned it as a separate object. > resid reconstruct. < obsl, <(*OECI", “Clay!’, "oct")] + peel, 1, drop = F) Za solve(pesrotation)[1, , deep = F) > summary (resid. reconstruct. 1) cect ciayt oct Min, 10.187 Min, 6.808 Min. #2. 9565 tet Gui: -2.308 Ler Qu.:+0.936 1st Qu.:-0.5964 Median : 0.203 Median :-0.129 Median : 0.0123 Mean: 0.545 Mean :-0.195 Maan: 0.1079 Sed Qu.: 2.548 3ré Qu. 0.800 3rd Qu. 0.7975 Mar, 19.071 Max, 1 3.721 Max. + 6.2081 > head(sort (resid. reconstruct 1, *0C1")) [1] -2.9565 -2.2845 -2.1700 -2.0075 ~1.8831 ~1.6532 > head(sort (resid. rt mnstruct.1[, "Oct" + decreasing = 1)) a) 8.2981 4.2815 2.2299 2.241 2.2481 1.9575 > resid.reconstruce.? < obal, <("oBI", “Clayl", "0c!" + peel, 1:2] Hal zolve(pefrotacion) [1:2, ] clayt et et Gu.:-0.09584 tet Gu.:-0.014452 tat Ou Mean :-0.0020 Mean :-0.000204 Mean Sed qu.: 0.10186 Sra gu.: 0.015344 Sra gu Mar. : 0.60758 Max. : 0.093628 Max. > head(zort (resid. reconstruct. 2[, *O6t"I)) (a) -9.2491 -2.2609 -2.0974 -1.9392 -1.4185 -1. 9874 mistruct 26, *Ot"], decreasing = 1)) [1] a.arva 2.7109 2.9981 1.7508 1.7186 1.8820 Task 63: Plot the residuals vs. original values of organie carbon for the one and two-PC cases, using the same scale. . > par(afrov = c(1, 2)) > cund(nax(resid.reconstruct.1(, "OCt*], resid.reconstruct.2[, > ret resid.reconstrct.2[, > plot (resid. recenstruct.t{, "201"] ~ obs{, "OCt"), mein = ‘Residuals, 1 PC reconstruction", +) xlab = "Topsoil organic carbon, 2°, jab = ‘Residual, 1 00" + yim = eGmia, yma) > abline(h = 0, Ity'= 2) > plot (resid. reconstruct 2[, "Oct" ~ ebs{, "0C1"T, main 1 + xlab = "Topsoil oxgante carbon, 2", ylab © *Mesidual, 1 00% + yim = emia, ymax)? > abline(h = 0, Ity'= 2) > par(efrov = C1, 2) Residuals, 1 PC reconstruction Residuals, 2 PC reconstruction gt ey co 6 é 2 4 6 8 Ww 2 4 6 8 Topscl organic carbon, % Topsoil organic carbon Q91: What is the pattern of the reconstruction residuals? Try to explain. (Hint Jook at the loadings, pe$retation.) Are two PCs satisfactory for representing topsoil carbon? Jump to A9l « ‘Task 64 : Repeat the analysis for CEC . > par(afrov = e(1, 2) > ymax < round(nax(reaid.reconstruct.1[, "CECI“J, resid recenstruct.2l, + ec1"})) > ymin < round(win(resid.reconstruct1[, "CECI*), resid reconstruct 2, . 1"})) > plot(resid. reconstruct -1[, "CECI"] ~ obs, *CECI"], main = "Residuals, 1 PC reconstruction +0 stab = “Topsoil GC, cnol* kg-t soil", ylab = "Resadual, % CEC", + yim = e¢ymin, > abiineth = 0, Tey a ab = "Topseit CEC, cme ylab = "Residual, 1 C80", + ylim = e abline(h = 0, Ity'= 2) > par(atrow = (1, 12) Residuals, 1 PC reconstruction. Residuals, 2 PC reconstruction ed oes. ed Bod get go] = fer god : g a 5 Fo cemamamemnse aa09 3, Eg {ORE a4 Q92: What is the pattern of the reconstruction residuals? Are two PCs satis- factory for representing topsoil CEC? Jump to A92 Tt may help to answer this question if you compute the range of residuals in the two cases using the range function: > range(resid reconstruct a) -10.187 19,072 vorci")) > range (resid. reconstruct. 2 [1] -0.78864 0.60759 Standardized residuals can also be computed. The relation 0 = SE~ is not valid here, because the E matrix refers to eigenvectors from the standardized variables, However, we can standardize the original variables ourselves and then ‘use these to compute residuals; iv. back-transformation can be used. S becomes standardized Sy and the same ‘Compute the standarized residuals of organic carbon and CEC, for ‘Task 65 the two-PC case. Pl he residuals vs s. original standardized values. ‘We use the scale function to scale the columans of a matrix: + 8901)]) = pevstel, 1:2) fen del > summary (resid. reconstract.2) in. Min. in1.Ste-01 ist Gu tet Qu.:-2 940-02 Median Median | 1.700-03 Mean Mean: 6426-18 cvoscr®, “ctayt™, (pe-sfrotation) [1:2 Min. :-1.606 ist gu.s1.97 Median :-1,326 Mean: 5.326 Sed Qu.: 1.7Te-0L Sra Quis 2.51e-02 Sed gu.: 2.50e-01 Mar. : LAde#00 Max. 2.08e-01 Max, 1 > par(efrov = e(1, 2)) 1, "0ct"] ~ seateobel, + Ylab = "Residual, { 00 standardized”) > abline(h = 0, Ity = 2) > plot (resid. reconstruct 2{, "OECI") ~ scaleCebsl, *CEC1")), +0 sain = *Kesiduals, 2 PC reconstruction", x1ab = “Topsei? CEC, standardized", + ylab = *Residual, CEC standardized") > ablineth = 0, Ity = 2) > par(efroe = (1, 12) Residuals, 2 PC reconstruction Residuals, 2 PC reconstruction 24 ao] 2 ie of ae 3 a3 e+. é 24 ° ° sores 45 so 4 203 “opel organic carbon, tancarzed Topsoil CEC, tandarczes Q93 | How do these reconstructions compare to those using unstandardized PCs? Jump to A93 « 8.1.3 Biplots* ‘The relation between variables and observations in the new space can be visualised with a biplot (14, Lo]. This has two kinds of information on one plot, leading to four interpretations: 1. The plot shows the observations as points, labeled by the observation num- ber (i.e. the row in data frame), in the plane formed by two principal com- ponents (synthetic variables). Any two PC's c are the first two, n be used; most, common ‘The coordinates of the points are shown in the lower (PC1) and left (PC2) nnargins; they are the transformed variables, with the origin (0,0) defined by the mean of the data in this space, and scaled to have similar ranges (this can be changed by an optional parameter). ‘These points are interepreted like any scatterplot: you can find clusters of observations or outliers in this space. Note that the scaling can affect your visualisation, Topaosl organic carbon, standardized’, 2. The plot shows the original variables from which the PC’s were computed (ie. the original feature space) as vectors. They begin at the origin and extend to coordinates shown by the upper (PC1) and right (PC2) margins. ‘These have a different scale (and scaling) than the observations, so they must be interpreted separately. ‘These can be interpreted in three ways: (a) The orientation (dixection) of the vector, with respect to the PC spaco, in particular its angle with the PC axes: the mare parallel o @ PC axis is a vector, the more it contributes only to that PC. The contribution of an original variable to a PC can be estimated from the projection of the vector onto the PC. (b) The length in the space defined by the displayed PCs; the longer the vector, the more variability of this variable is represented by the two displayed PCs; short vectors are thus better represented in other di- mensions (i.e. they have a component that is orthogonal to the plane formed by the two displayed PCs, which you can visualize as coming, out of the plane) (o) The angles between vectors of different variables show their correla tion in this space: small angles represent high postive correlation, right angles represent lack of correlation, opposite angles represent high negative correlation, We can also prodtuce versions of the biplot emphasizing only the vectors or only the points. These require some tricks with the arguments to the biplot method. In any event wo exaggerate a bit the text size of the variables with the cex= argument. > par(afrow = (2, 2)) > biploeipe.s, main = “Biplat, Standardized PCe 1 and 2°, 4" pe.bipoe = T, eax = €(0.9, 1-2) > biptotipe-e, main’* “Variables only, Standardized Pox 1 and 2", +" pecbipler = 7, cex = e(0.3, 1.2), xlabe = rapier, + aus(pe- 88%) [1))) > baplot(pe.s, aia = “Observations only, Standardized PCs 1 and 2°, + pecbapler = 7, var.azes =F, cex = e(1, 0.1)) > par(atroy = c(t, 2) 106 ‘The argument pe.biplot=T produces a so-called “principal component biplot”, where the observations are scaled up by 71 and variables scaled down by the same factor. With this scaling, inner products between variables (as shown by the vectors) approximate their correlations and distances between observations (as shown by the points) approximate Mahalanobis distance in PC space. These lead to easier visual interpretation.) First, we can look for groups of points (“clusters”) and unusual points (“outliers”) in the new space. Q94: Do there appear to be any clusters formed by the observations? If so, where? Jump to A94 « Q95 : Which observations are unusual in the space spanned by the first two standardized principal components? Can you explain them from the original observations? Jump to A95« > ebsle(t, 2, 78, 81), €(7, 10, 139) Chayt cect oct 1 Rane 6.5 2 126 3 Te 5229.0 9 146 28.0 30. Second, we can look at the vectors rep esonting the original variables. Q96: Which variables are better explained in this space? (Hint: look at the Iength of the vectors.) Jump to A96 « Q97: Which variables contribute most to PCL? to PC2? (Hint: look at the projection of the vectors onto the axes.) Jump to A97 « Q98 | Which variables are highly-correlated in this space? (Tint: look at the angles between the vectors.) What does this imply for modelling? Jump to A9S © 8.14 Screoplots* A useful graphical representation of the proportion of the total variance explained by each component is the sereeplot. This is named for a “scree slope”, which is the zone at the base of a steep cliff where debris (“scree”) accumulates. We look for the “breaks” in the slope to decide how many PCs can be meaningfully interpreted. Task 66: Repeat this analysis, but with the three continuous variables fron 20 layers, ie. a total of nine variables. Show the proportional variance with a screeplot . > ped < preomplobs{7:15], scale = 1) > summary (pe Inportance of components Pct Pc? PCd Pca Pcs oS pcr Pee Standard deviation 2.38 1.09 1,00 0.694 0.548 0.392 0.30 0.2668 jon of Variance 0.63 0.13 0.11 0.053 0.033 0.017 0.01 0.0078 ve Proportion 0.63 0.76 0.87 0.928 0.961 0.978 0.99 0.9968 > screeplet (ped, main = “Screeplot, 9 principal components") Screeplot, 9 principal components Q99: How much of the variance is explained by the first PC? At what compo- nents does the “scree slope” change substantially? How many PCs out of the 9 computed are meaningful? Jump to 499 « ‘Task 67 | Show the three faces of the cube defined by PCs 1, 2 and 3 as three 2D biplots, . To show combinations of PCs other than the first two, we must use the choice= argument. > par(efrov = €(2, 2)) > biplot (pe9, pe-biplet = T, main = "Standardized Pos 1 and 2") > baplot (ped, choice = 2:3, pe.biplot = T, main = "Standardized PCs 2 and 3") > baplot (ped, choice = (1, 3), pe.biplot = T, main = "Standardized Pcs 1 and 3°) > par(atrow © e(1, 12) 8.2 Factor analysis* PCA is a data reduction technique, but the resulting synthetic variables may not be interpretable. A related technique is factor analysis in the sense used in social sciences (34, 11.3]. Here we hypothesize that the set of observed variables is a measureable expression of some (smaller) number of latent variables that can’t themselves be measured, but which inffuence a number of the observed variables This has an obvious interpretation in pyschology, where concepts such as “math ability” or “ability to think abstractly” can’t be directly measured; instead these variables are assumed to exist (based on external evidence) and measured with various clever tests, In the natural sciences the concept of latent variables is not so easy to justify, still, the techniques are useful because they (1) give a method to rotate axes to line up observed with synthetic variables and (2) allow us to determine how many latent variables there might be. Suppose there are k original variables, to be explained by p < k factors. Factor analysis decomposes the k x k variance-covariance matrix 3 of the original vari- ables into a p x k loadings matrix A (the k columns are the original variables, the p rows are the factors) and a kx k diagonal matrix of unexplained variances 110 per original variable (its uniqueness) ¥, such that E=NAGY In PCA p = k, there is no ¥, and all variance is explained by the synthetic variables; there is only one way to do this. In factor analysis, the loadings matrix ‘A is not unique; it can be multiplied by any k x k orthogonal matrix, known as rotations. The factor analysis algorithm finds a rotation to satisly user-specified conditions; one common condition is known as varimaz; this is the default in R. Task 68: Compute a factor analysis assuming three latent variable, over the three continuous variables from all three layers (ic. nine original variables). « > (fa < factanel (obel715], 3)) can factanal (x = obs(7=18], factors = 3) Uniqueneases: Clayt Clay? Clays Ceci cec2 cEcs oct 0c2 acs 0.087 0,016 0,085 0.180 0.008 0.508 0.094 0.335 0.320 Leadings: Factor! Factor? Facterd ciayt 0.838 0.383 0.277 ciay2 0.928 0.200 0.289 Clays 0.910 0.186 0.227 ceci 0.144 0.787 0.404 ceca 0.265 0.283 0.918 ees 0.280 0.640 oct 0.217 0.096 cz 0.478 01830 0.393 cS 0.653 0.283 9.410 Factori Factor? Factors S$ loadings 9.822 2.118 1.985, Proportion Var 0.368 0.235 0.217 Cumulative Var 01368 0.604 0.821, Test of che hypothesis that 3 factors are sufficient The chi square statistic is 94 on 12 degrees of treedon The pale $s 7.07e-19 Interpretation First, the unigueness of each original variable is the “noise” left over after the factors are fitted. Hore CECS is far and away tke most. poorly explained, followed by OC2 and OCS. Second, the loadings are just like PCA: the contribution of each original variable to the synthetic variable. Factor 1 is clearly built up mainly from all three Clay contents; Factor 2 is clearly built up from topsoil CEC and OC. ‘Third, the proportional variances axe like PCA: how much of the total variance in the original set is explained by the factor. These are generally lower than the corresponding PC's, ul Fourth, a test of the hypothesis that the number of factors is sufficient to explain the data set. Here we see it is not, so we add another factor: > (fa © updste(ta, factors = 8) can factanal(x = obs(7:16), factors = 4) Uniquenesses: Clay! Clay2 ClayS Ceci ceca cEcs oct 0c2 acs 0.088 0,013 0.088 0.030 0.008 0.482 0.224 0.005 0.242 Leadings: Factor! Factor? Factor’ Factor’ Clayt 0,820 0.363 0.261 0.285 Clay? 0.94 0.178 0.289 0.188 Clays 0.894 0.172 0.274 0.185 cect 0.125 0.882 0.375 0.137 ceca 0.28 0.262 0.881 0.315 cecs 0.282 0.876 ocr 0.225 0.781 0.236 oc2 0.408 0.400 0.201 0.788 0cs 0.611 0.183 0.331 90.482 Factor! Factor? Factor3 Factor’ SS loadings 9.097 1.861 1.799 1.168 Proportion Var 0.34 0.207 0.193 0.130 Cunulative Var 0.364 0.881 0.746 0.876 Test of whe hypothesis that 4 factors are sufficient The chi square evatievic 18 20.06 on 6 degrees of treedon, Tae Prvalue 26 0.0027 ‘Now the four factors explain the data set. Notice how the uniqueness values have all decreased. The first three factors have changed somewhat, ‘We can visualise the meaning of the axes: > par(afrov = e(2, 2)) > plot (loadings(ta), xlim = e(-0.1, 1-1), ylim = eC-0.1, +714), ype = tt, main = "hoedings 1 and 2, d-tector aodel”) > text (loadings(ta), dimanes (loadings (f2))([11]) > plot eadings(ta)l, e(1, 3)J, xliz = c +14), type = "", main = "Loadings 1 and 3, 4-facter nodel”) > text (leadinga(ta)C, (1, 3)J, dimanes loadings (f2)) (C01) > plot (leadings (fa), 2:3), xlim = ¢(-0.1, 1.1), ylim = c(-0.1, 1.1), type = ta", main = “hondings 2 and 3, 4-facter model") vere (loadings(f2)[, 2:5], dimanes loadings (Fa) C111) plot (deadings(fa)l, $24), liz = o(-0.1, 1.1), yliz = e(-0.1, 1.1), type = th", main ~ "Loadings 3 and 4, d-ractar model") vere (loadings (te), 5:4), dimanes loadings (t2))C(11]) par(atrow = c(t, 1)) 0.1, 1.1), ylim = o(-0.1, 12 In this example, the clay contents clearly align with factor 1, high CEC and OC. in the topsoil with factor 2, subsoil CEC with factor 3, and subsoil OC with factor 8.3 Answers ABG > 91.7%, 99.6%, So the throe-dimensional space is effectively two-dimensional; and even reducing to one dimension only discards about 8% of the total information. Return to Q86 + ‘A87: The first component explains much loss of the overall variance in the standardized PCs, 75.6% vs. 98.4%. This is because, in the non-standardized case, one variable (clay) ‘has much larger numbers (in range 10~ 70, mean about 30) than the others (CEC in the teens, OC below 5). Further, the units of measurement are different. Ifthe variables had the same units of measurement and were roughly comparable (e.g. clay proportion at different depths, or analyzed by different lab. methods) the unstandardized PCA would sive better insight into the relative magnitudes of Return to Q87 « N88: 75.6%, 91.5% All three dimensions contain significant variability in the new space. Return to Q88 + ‘489; Observations 78 and 61 score very low on PCI and very high on PC 2; they ha h CBC and OC, Thus most 113 Observation 2 scores very low on PC2 and moderately low on PCI; this has an unusually low CEC for its high clay contont, because of the low OC. Observations 3, 10, 13 and 106 score quite low on PCI but are not exceptionsl for PC2 Return to Q89 + A90; The accuracy increases with the number of components used in the reconstruction, With all three, the reconstruction is exact (within numerical precision). With two, there are some fairly large errors; e.g. for OC1, as Iarge as 4.22%. Using only one component, this increases to 6.8%. OF course, with all three there is no error. Renn to. Q90 AOI: The residuals are somewhat smaller (tighter distribution about the zero-line) with the two-PC reconstruction; however in both cases they are quite related to the original values, High OC are systematically under-predicted and vice-versa. This shows that the first two PCs do not capture all the variability in OC and systematically ignore extremes. ‘The rotations matrix shows that OC is highly related to the third component, wh the first two components are related to clay and CEC. The systematic bias is because Return to QI + some high-CEC, high-clay soils have low OC and vice-v N92: As with OC, the representation with one PC is poor, and has « systematic bing towards average values, But with two PCs the reconstruction is excellent. The ‘maximum absolute-value residual is only 1.44 cmol* (kg, soil)"! a fairly small error given the minimum observation of 11.2, although it is about 1/4 of the minimum observation 3) Return to Q02 + ‘A93 Of course the units of measure are different (here, standardized; in the previous analysis original units of measure). The pattern of residuals vs. original values is vory similar for OC in the two cases. However for CEC the original values are reproduced very accurately and with no gain in the unstandardized case, whercas the standardized CEC is less well reproduced. Return to Q93 + ‘N94: There are no large clusters; indeed the data is remarkably well-distributed around (0,0); however the unusual observations (see next question) seem to form two small groups Return (0 Q9d + ‘A95 > Observations 81 and 78 have unusually high values of both PCs; observations 1 and 2 are nearer the centre of the range of PC! but are unusually low for PO2. The first two have unusually high OC and CEC, with low clay; no other observations have this combination. The second two have high clay, moderate OC, but lower-then-expected CEC. Return to Q95 + ‘A96: All three have about the same length vectors, so are equaily-well represented in t his space, Return to Q96 + A97 + CECI and OCI are nearly paraliel to the axis for PCI; Clay! contributes about equally to the two axes, All three variables contribute in the same direction. $0 PCL represents the general trend of all the soils in this dataset from low CEC, OC, and clay 14 towards (generally correlated) high CEC, OC and clay. PO2 represents the variability in clay, and to a lessor extent CEC, not associated with this goneral trond. Return to Q 98: CEC and OC are almost identical, implying that OC would be a good single rodictor of CBC (as we saw when comparing various regressions, §7.2)- Clay is about halfway from correlated to uncorrelated with these two (angle about 1/4 with OC). Return to Q98 + A99: The frst PC explains almost 90% of the variance. The slope changes dramatically after the first PC and somewhat less after the third PC. Three PCs can be interpreted. Return (0 Q99 + 9 Geostatisties ‘These observations were made at known locations, which allows us to examine them for their spatial structure, First we look at the spatial distribution of the points and the data values, then we sec if there is a regional trend and/or a local structure Note: This dataset is not ideal for spatial analysis; the sample size is not great and the sample locations are clustered. It is included here to introduce some techniques of spatial analysis. 9.1. Postplots Task 69 : Display a map of the sample locations, coloured by agro-ecological zone, and with the symbol size proportional to the value of subsoils clay at each location (a postplot), . NNote: Note on thic code: the asp argument, with value 1, to the plot method ‘ensures that the horizontal and vertical axes have the same expansion, asin a map) the grid mothod draves a grid on the map. 100; Does their appear to be any regional trend in subsoil clay content? How consistent is this? Jump to A100 « 9.2 ‘Trend surfaces ‘The regional trend can be modelled by a trend surface, using the grid coordinates as independent linear predictors. However, there is a problem with the naive approach using ordinary least squares (OLS) (c.g. Im(Clay5 ~ ¢ + n), if the observations are clustered and, even worse, spatially-correlated, as seems to be the case here, If the sample points are clustered in some parts of the map (as in this case), there is a danger of mis-estimating the regression coefficients. In particular, a large 115 > plott + ap > griacey, cox = ClayS © 3/max(Clay$), pch = 20, col = az.nuneric(zone), » » of topeoit clay %, by soit type") Postplot of topsoil clay %, by soll type a A 5 1, ew g . * ao ae 2 380 3] ad ° . zd oe 50000 © 67000080000 690000700000, Figure 1: Postplot of clay%, 30-50 cm layer, by agro-ecological zone 116 number of close-by points with similar values will “pull” a trend surface towards them. Furthermore, the OLS R? may be over-optimistic. ‘The solution is to use Generalised Least Squares (GLS) to estimate the trend sur- face, This allows a covariance structure between residuals to be included directly in the least-squares solution of the regression cquation, GLS is a special case of Weighted Least Squares (WLS). ‘The GLS estimate of the regression coefficients is [6] Bats = (XT CE XT XTC y where X is the design matrix, C the covariance matrix ofthe (spatially-correlated) residuals, and Y the observations. If there is no spatial dependence among the crrors, C reduces to Io? and the estimate to OLS: Bots = XP XY XP ey ‘This leads us to a further difficulty: The covariance structure refers to the resid uals, but we can’t compute these until we fit the trend ...but we need the co- variance structure to fit the trend ...which is a classic “chicken or the egg?” problem, In practice, it is usually sufficient to (1) make a first estimate of the trend surface with OLS; (2) compute the residuals; (3) model the covariance structure of the OLS residuals as a function of their separation; (4) use this covariance structure to determine the weights to compute the GLS trend surface, The GLS residuals could again be modelled to see if their covariance structure differs from that estimated from the OLS residuals; in practice, unless the dataset is large itis not possible to see any such difference, GLS surfaces and spatial correlation structures can both be analyzed in the spatial package; this follows the procedures explained by Ripley [28, 34] Task 70 : Load the spatial package. Use its surf.1s method to compute the OLS trend surface; display its analysis of variance and coefficients. . > require(epatial) > clayS.l2 © aurf-le(1, e, n, Clays) > summary (clay5.12) Analysis of Variance Table Model: surf.le(ap = 1, x =e, y =m, 2 = ClayS) Sum Sq Df ‘Mean $q F value PrOF) R 32928 2 6114.24 73.718 <2e-16 Deviation 11944 144 #2.943, Toral 26:72 146 Multiple A-squared: 0,606, Aajusted Resquared: 0.499 are: (af = 3) 652.44 Fivtea Min 19 Median 3g Max 274 404 M7 52.7 628 Residuals: Min 19 Median 59 Mae “31.601 9,106 -0,363 3.607 20.487 ur > clays. 1esbeta [11 46.4288 14.3561 ~7.0893 A note on trend surface coefficients computed by the spatial package: they do not refer to the original codrdinates (¢ and n) but rather to offsets ine and n values from the centre of the trend surface area, defined by the extreme values of the codrdinates, This is to make computations more stable, The first coefficient is thus the value at the eentre of area: > (predice(clayS.ts, dite (rangele))/2 + min(e), ditt zange(n))/2 + +) nintn))? [1] 46.429) > clays lasbetalt] 2) 46.429) Ql01; How much of the variation in subsoil clay is explained by the OLS trend surface? Jump to A101 « 102: What is the equation of the I*-order OLS trend surface? Jump to Al02« ‘Task TL: Use the correlogram method to compute the spatial auto-correlation of subsoil clay, Examine this correlation, . ‘The correlogran method automatically computes the correlation for the resid uals, once a trend surface is fit, as it was for object clayS.1s, above: Q103 : What is the autocorrelation at the shortest lag? What distance range between point-pairs is in this bin? How many point-pairs contributed to this estimate? What happens to the auto-correlation as the distance between pairs increases? Jump to A103 « ‘The observations are indeed spatially-correlated at short ranges; we now model this, Task 72: Fit an autocorrelation function to the correlogram and use this to fit the GLS trend surface. Display its analysis of vatiance and coefficients. . ‘This structure is not very strong and difficult to model, so we use an estimate just to show how the procedure works. An exponential model with an effective range of 1800 m seems to fit, We first re-plot the correlogram, then the fit: 18 > e < correlegram(elayS.te, 50, plotit = ) > serve) $x: mun [1:48] 0 949 1898 2847 5796 $ y+ mun [1:48] 0.4069 0.1299 0.0646 0.0351 0.1015 Sent plete, ylin > text ets, sy Figure 2: Auto-correlogram, elay%, €(-0.2, 0.6), ati fine [1:48] 429 291 299 273 348 199 116 143.127 163, (0, 12000), peb = 20, sty, round(efy, 2), poz = 3) 0-50em layer 19 plot (e, ylin = e(-0.2, 0.6), xlim = e(0, 12000), peh = 20, cel = "blue"? ablineth = 0) 4 < 5096100, 12000, by = 100) ines (4, expcov(d, d= 600, alpha ~ 0.4), col ~ "blue") \ #3 g : oe fit a GLS trend surface, using this covariance fumetion to model the spatial autocorrelation: > clayS.gle < surf.gis(t, expeov, d = 600, alpha = 0.4, +a, n, clays) > summary (clayS. els) Analysis of Variance Table Model: surf.gls(ep = 1, covnod = expeov, x=, y=, 2 = ClayS, 4 = 600, alpha = 0.4) Sun Sq Df Mean Sq F value PrOF) Regression 12180 2 6075.20 72.772 clays. gistheta 1) 48.2158 14.2960 ~6.3662 Q104: How do the R® and coefficients compare to the OLS surface? Jump to 120 Al04 + ‘Task 73: Plot the OLS and GLS trend surfaces, with the sample points super imposed . We use the eqscplot method of the MASS library, as well as the contourplot method of the lattice library; both these must be loaded; MASS was loaded above, so here we just load lattice > require(lateice) ‘Task 74: Make a postplot of the residuals, with the symbol coloured according to whether it is positive or negative . Q105 Does there appear to be any spatial pattern to the residuals? Jump to Al05« 9.3 Higher-order trend surfaces Evidently the first-order trend surface (a plane) captured a regional trend, but the fit was not very good (R? = 0.496). This suggests that a higher-order surface may be more satisfactory, both mathematically and in explaining the trend. That is, the trend may not be a plane, but rather a second-order surface such as a dome cor saddle. In this case the residuals did not show an obvious pattern to suggest, this, but still we wil try Task 75: Compute and plot a 2eLorder trend surface and summarize its goodness-of-fit. . Q106: How well does this trend surface ft the observations? Is this an improve ment over the I*-order surface? Jump to A106 Q107 Describe the form of the 2nd-order surface. Does its main 1*torder axis match the I*-order surface? Jump to A107 « 9.4 Local spatial dependence and Ordinary Kriging In the previous two sections we've considered a regional (“global”) trend in subsoil clay. However it is evident that there is local spatial autocorrelation, that is, nearby points tend to be similar. The correlogram of Figure 2 shows this spatial autocorrelation. This is best analyzed with variograms; the results can be used for mapping by “optimal” interpolation: Ordinary Kriging. R has several packages 12 > amin < min(e) > xmax < max(e) > ymin <& manta) > ymax < mata) > rer < 40 > clayS.t2 < trmat(clay5.e, xin, xmax, ymin, ymax, res) > etaySigte © tenat(clayS.gie, amin, amar, ymin, yaax, > eqacplet (clays. gta, type = *n*, main = “OLS and GLS trend aurt #0 dab = MEP, ylab = 1) > contour(elay5.gts, level = > conteur(elays.ts, eve + col = "blu > griddty = 1) > points(e, m, cex = ClayS + 2.5/max(Clay5), peh = 23, ag = 3) > mn(elayS.te, clay6.gts, xin, max, ymin, ymax, res) subsoil clay 1° 19(20, 80, 4), add = 7) seq(20, 80, 4), aad OLS and GLS trend surfaces, subsoil clay % 220000 390000 40000 ‘19000 690000 670000 680000 880000 700000 E Figure 3: First-order OLS and GLS trend surfaces for clay%, 30-50em layer 122 amex < max(e) yain < min(e) yaar < max(2) res < 40 clayS.gls < surf.gis(1, expcey, 4 = 600, alpha = 0.4, ‘s,m, Clays) clayS gia.resid < resid(clayS. gle) claySsgee © tet (cay gi, sain, mar, ain, ue, equcplet (clayS.gte, type = “*, main = "Residuals fron GLS Ist-order trend surface, subseil clay 1, sub = "Red: negative; Green: positive", xlab = YE" ylab =") contour(elayS.gte, level = 24g(20, 80, 4), add = gridary =D points(e, n, cox ~ abs(clayS.gle.resia) © 2.5/max(abs(clayS. gle resid)), (poh = 23, bg = ifelze(clay5.giz.resid <0, 3, 2)) ra(clays giz, clays.gt2, clayS.gle-resid, xain, ema, ain, yaax, ree) Residuals from GLS 1st-order trend surface, subsoil clay % A/T. Is Lil (127: / PLL] // ‘650000 670000 680000 890000700000 40000 320000 ‘320000 ‘0000 Rev: negative G postive Figure 4: Residuals from first-order GLS trend surface for clay%, 30-50em lay 123 © mine) & max(n) > clayS.gle2 <- surf.gls(2, expcov, 4 = 600, alpha = 0.4, +e, nm, Clays) > sunnary(elays.¢1s2) Analysis of Variance Table Model: surf.gla(np = 2, covnod = expeov, x= 6, y= n, z= ClayS, d= 600, alpha = 0.4) ‘Sun Sq Df Mean Sq F value PrOF) Regression 12832 § 2566.460 21.912 elayS.gleafbera (1) 42.2565 15.9599 6.8285 -8.7301 5.0456 8.9743, > ea(clayS.gls2, clayS.gte2, emin, amex, ymin, ymax, rea) GLS 2nd-order trend surface, subsoil clay % 320000 390000 40000 ‘10000 650000 670000 680000 880000700000 124 Figure 5: Second-order GLS trend surface for clay%, 30-50cm layer for this; we will use gstat [24, 25, 26] which is part of the spatial data initiative, and uses the sp “spatial classes” package. Task 76: Load the gstat and sp packages with the Library met they ar 10d and confirm the search path, with the search method . > require(ep) > require(eetat) > search 1) *.Clobalsne” wpackage:gstat” “package: sp @ spatial’ ‘package: ca a “package stat (10) "package: graphics" exDevices" “package: util: [13] “paceage:datacete ethods" “Autoload [161 vpackage:base™ 94.1 Spatially-explicit objects ‘Task 77 : Review the structure of the obs object. . > ser(obs) $e + int 707638 701659 705488 703621 703358 707334 681328 681508 581750 683989 $n + Smt 326959 326772 372133 322508 372846 324551 311602 311295 311053 311685 zone : Factor #/ 4 levels "1",°2" aizai22q bl | Factor ¥/ 3 levels "a" 533333 Lo: Factor #/ 8 levels "BF s4agaaaaa $ $ $ $ $ Clay: ant 72.71 61 95 47 49 63 59°46 69 $ Clay2: int 74 75 59 62 56 53 65 66 56 63 $ Clays: int 78 80 66 61 53 57 70 72 70 62 $ CECI + mun 19.6 12.6 21.7 11.6 14.9 18.2 14.9 14.6 7.9 14.9 $ che? : mun 1018.7 10.28.09.2 11.67.47. 5.7 6.8 $ CRS : nun 7.1 7.46.6828.56.25.67 4.56 $ OC : mun 5.55.26.98 3.19 4.45.31 4.55 4.52.5 7.34 $062 i mun 3117 $005 i mun LS1t 2A 1,5 1.23.2 2.15 1.42 1.36 2.64 31.26 0.8 Q108: What is the data type of this object? Which of the fields refer to spatial information? What is their data type? Jump to A108 « ‘The data types for the e and n fields in the data frame are int, i.e. integers. These are indeed numbers, but of a special kind: they are codrdinates in geographic space, Itis possible to do some visualization and analysis in Rw the data frame, but it is more clogant, and gives many more possibilities, if geographic data is explicitly recognized as such. This was the motivation behind the R Spatial Project, which resulted in the sp package [23] which provides classes (data types and methods for these) for spatial data 125 ‘The sp package adds a number of spatial data types, i.c. new object classes; these are then recognized by methods in other packages that are built on top of sp, most notably (for our purposes) the gstat. package ‘To take advantage of the power convert the data f 1 explicit spatial representation, we must appropriate sp class, ‘Task 78: Create a new object of class SpatialPointsDataFrame named obs.sp, from the obs datafame. . We do this by adding the computed coordinates to the data frame with the coordinates method; this automatically converts to the spatial data type defined by the sp package: > elase(obs) (a) anea.tram > ebs.sp < obs > coordinates (obs.sp) < "e+ 2 > class(ebs.¢p) [1] "SpactaiPointabatatrane" cr("package”) ax we haven't seen before; the coordinates method ean appear ei- ther on the right or the lft of the assignment operator, and the formula operators * and + Q109: What is the data type of the obs. sp object? Jump to A109 « ‘Task 79: View the structure and data summary of the spatial object. . ‘As usual, the structure is displayed by the str method; we then summarize the object with the generic summary method and view the first few records in the Gataframe with tho head method: > ser(obs.ep) 1 class !SpatiaiPointabsaFrane! (package “sp"T with § slots $ elev = int [1:147] 687 628 840 707 870 780 720 657 600 720 zone : Factor u/ 4 levels *1°,°2",*S* "4": 2211211221 Wabi | Factor #/ 3 levels "a*vver st"! 3335383333 Lo: Factor #/ 8 levels “BF Wee 33444439484 147) 72 71 61 85°47 49 62 59 46 62 147] 74 75 99 62 56 53 66 66 56 63, 147) 78 80 66 61 59 57 70 72 70 62 167) 13.6 19.6 21.7 11.6 16.9 18.2 14.9 14.6 7.9 18.9, 147] 10.1 6.2 10.28.49.2 11.6 7.4715.7 6.8 147) 7.17-45.68856.25.474.56 147] §.5 3.25.98 5.19.44 5.51 4.55 4.5 2.3 7.34 M7) 211.7 2.41.8 1.23.2 2.48 Lad 1.98 2.58 126 $005: mun (L147) 1.51 151.280.8 @ coorés.ars ant (1:2) 12 @ coords; mum (1:147, 1:2] 702638 701659 TosasB 703421 703358 Sactr(s, "dinnanes")=List of 2 $5 MUL $5 eke (1:2) beer num (1:2, 1:2] 659401 310897 ToStse 342379 attr(s, “dimmanes"}oList of 2 $5 eke (1:2) teh tnt $5 ek (1:2) mint "max" jatring:Formal claae "CRS! (package 9 projargs: chr NA opr witha > head obs. spedet) lov zone vrbt IC Clay! Clay? ClayS oFe1 cP? CEeS oct ve? OCs 4 707 2 56 62 1116 #4 #0319 1.51.25 6 610 2 47 66 83.14.98 9.2 8.9 4.4012 0.80 6 70 3 495367 18.2118 6.25.91 3.2 1.08 elev zone erbt tc chayt tet Qu; 322 240 6: 3 BF 8 tet Qu.sat.o Men: 418° 4:36 cE HS Mean 31.3 Sed gu. 560 Och 114 Sra gu.:39.0 Max, :1000 Wh Max. 172.0 (orner): 2 clay clays cect ceca Min, 18.0 Min, 116.0 Min, 73.0 Min, 7 1.60 et Qu.:27.0° tet Gu.s36.5 tat Qu; 75 tat Qu; $.00 Bed Qu.:47.0 Sed Qu.s64.0 ard Qu.si3.1 Sed Qu. 9.40 Max. :75.0° Max. 180.0 Max, 128.0 Max. 122.00 ea os Min, 7 1.00 Min. : 1.08 Hin, 10,20 Ast Ga.: 6.00 tat Qu: 1.98 at Qu. 10.60 5.90 Sed Qu. 3.70 Sed guist-70 Sra Qu.:1.00 14.00 Max. 10.90 Max. 12.70 Max. 31.70 ‘The spatial object now has several slots, marked with the @ symbol in the struc ‘ure listing. 9.4.2 Analysis of local spatial structure With this preparation, we can now use the gstat package to analyze local spatial structure, Task 80 : Compute and plot the empirical variogram of subsoil clay. . 127 > ¥ < variogran(layS ~ 1, obe-sp) > print(piet(v, pl = T, peh = 20, cal = "blue", cex = 1.5)) ts dats (rango(e))/1000 129 > (mf & fit variogram(y, =)? yéel psill range 1 ug 47.188 0.0 2 Exp 62.348 2626.5 > ser(e.t) CLanses !variograntodel! and ‘data. frame! 2 obs, of 9 variables: $ model: Factor 4/ 19 levels "aug", 2 $ petll: mun 47.2 62.3 $ range: mun 0 2626 $ kappa: mun 0 0.5, $ angi : mm 00 $ ang? : mm 0.0 S angs : mun 0.0 $aniel: non 11 $ anie2: mun 11 > attr(a.t, "SSErr") 2) 0.088754 Dy By ba Fgue 7 cl vatogtam, wih ted exponential model, for ay %, 3 50cm layer 1] 44.087 > dite (range(s))/1000 in] 31.482 > ditt Crangete)) + AittCrange(n))/10°S 0 1367.8 So a1 x 1 km grid would require about 1388 cells; we'll use double that resolution, i.e. 500 x 500 m. We first use the expand.grid method to make the grid, then coordinates to make it a spatial object, and finally gridded to specify that this is a regular spatial grid, not just a collection of points: > —800 < expand.grid(e ~ seq(nin(e), max(e), by = ree), n= seqinin(n), max(n), by = res)? > coordinatea(g500) “e+ > gridded(g500) <= T > sex(g500) Forsal class 'SpatialPixels! (package “sp"] with 6 slets @ grid ‘Formal class ‘GridTopolegy' (package "sp") with 2 slots ‘9 collcentre,oftaet Naned nn (1:2) 659402 910897 = aver(#, "nanes")= chr [1:2] "a" tn” @ cetlaize Named mum [1:2] 500 500 * ater(e, Manes") ehr [1:2] "e" nt @ cells. di Maned ant [1:2] 39 63 = attr(+, nanes")= chy (1:2) “e* tat (9 gridindex int [1:5607] 5519 5520 5521 5522 5525 5624 5525 5525 5527 9528 @ coorés mum [1:5507, 1:2) 658401 659901 560401 650901 561401 = avtr(*, "dimanes")=List of 2 $5 MOL $5 ekr (1:2) "er tn 2 beer ‘nun (1:2, 122] 659161 310667 Tose61 342107 ~ atte(s, "dinnanes")=List of 2 $5 eke (1:2) $5 ek (1:2) *min* "max" (9 projéstring:Fermal clase "CRS (package (© projargs: chr UA > zm(res) Now we krige onto this grid, and display the prediction and variance maps: > ko < krige(ClayS ~ 1, ebe.sp, 800, 9.1) using ordinary kriging] > ser(ke) Foonal clasn !SpatsalPiseleDstafrase! (package “sp"] wich 7 alec data a.frane! S607 abe. of 2 variables: $ vari.pred: nus [1:5607] 46.5 46.3 45.3 45.3 45.3 $ verivar © mux [1:5607] 113 113 119 118 113 (9 coords ars mun(o) grid ‘Formal class ‘GridTopolegy' (package "sp"] with 3 slots (@ cellcentre.offset: Naned nus (1:2) 659401 310897 131 = aver(+, "nanes)= eh (1:2) a” te 2 cellsize Maned nus (1:2) 500 00 * acer(+, Snanes")= chr (1:2) * 9 cells.aix aged ant [1:2] 89 63 S ater(e, “nanes")= chr (1:2) "e* 1 grid-index : int [15507] 9519 5520 5521 5522 5523 5626 5525 9526 5527 5528 (@ coords + num [1°5507, 1:2) 659401 659501 550401 50901 561401 attr(s, ‘dimanes")eList of 2 5 WOL $5 ek (1:2) tet mae 9 vbex ‘nun (1:2, 1:2] $59161 310647 Toses1 342147 avers, "dinnam 8: eke (1:2), $5 eke (1:2) tin 9 projéstring:Formal clase 'CRS! [package projargs: chr KA J with 1 stots ‘The maps are shown in Figure 8: predictions left and kriging prediction va right. main = "OK prediction of Clay f, 30-50 en", y=) snain = “OK prediction vs > plot.1 < spplot(k.e, 2col = *vart.ps + col.regions = bpy-colors(128), pr > plot.2 < spplot(k.e, zee) = *varl.var" ance of Clay X, 30-50 a col regions = cx_colore(128), pretty = T) princ(plet.1, split + c(t, 1, 2,1), mere = 7) > pranc(plet.2, splat = ¢(2, 1, 2, 1), more =F) Figure 8: Predictions and standard deviation of the prediction, elay%, 30-5 Q113: Does this map seem realistic? Jump to A113 « Q114: Explain the pattern of prediction variances Jump to All4 « Block OK Although we predicted on a 500x500 m grid, the reported prediction variances refer to a plot of the same size as the original sample, also called its 132 support, Here that is not specified exactly, but we know it is a small plot, about 0.5 ha (© 70x70 m); this is smaller than the grid size. Note that samples were bulked from the whole field; although each sample is just one auger core, if they are taken from different places in the field and mixed, it is as if the entire soil ‘were mixed and then subsampled. Thus the reported kriging prediction variance is for an 0.5 plot on 500 m centres. If we are satisfied with average values over a larger area, we should use black hbriging at that size; this determines an average, rather than point, value, and reduces the kriging prediction variance, because all variation smaller than the block is ignored. This is easy to do with the krige method, simply by specifying the block size ‘Task 84: Predict by ordinary kriging in 500 m by 500 m blocks, and compute the difference in predicted values and variances . > .0.500 <- krige(ClayS ~ 1, obe.sp, 6500, m1, block = (600, 5009) using ordinary kriging] > str(x.0.500) Forel clase 'SpatielPixelsDatafrene! [package "sp"] with 7 slots 9 dave Naava,trane! S607 eb. of 2 variable $ vari.pred: mun [1:5607] 46.3 46.3 46.3 46.3 45.3 $ vari.var © mun [1:5607] 60.2 60.2 60.2 60.2 60.2 @ coorés.ars + mun(0) grid Formal class 'GridTopelogy! (package "sp"] vith 9 slots 9 collcentre.offaet: Naned nun [1:2] 659401 310897 = ater(+, "nanes")= chr (1:2) @ cellsize Maned nus [1:2] 500 00 = avtr(+, "nanes")= ek (1:2) "e* tat 0 cells.aix Maned ant (1:2) 89 63 = avtr(+, "nanes")= chy (1:2) "e” tn grid. Andex ¢ int [1:5807] $519 5520 5521 5522 5523 5524 5525 $526 5527 5528 @ coorés mum (1/5607, 1:2) 658401 858801 650401 660901 661401 Sacer(s, "dinnanes")=List of 2 $5 MoU $5 enr (1:2) 9 box ‘nua (1:2, 1:2] s99151 s10ser roses1 s¢2za7 Savers, "dinnanes" aL $5 hy (1:2) tet $5 chy (1:2) "mint (@ proyéstring:Fermal clase 'CRS! (package (9 projargs: chr XA } with 1 stots > sunmary (x. ofver!_pred ~ &.».5008vart.pred) Min. tat Qu. Median Mean Sr@ Qu. Max -6.136-01 -1.240-037.17e-05 8.700-08 1.350-03 7.600-01 > summary (ik. 9,5008vartvar/k. ofvart. var) Min. ist Qu. Median Mean Sed Qu. Max. 0.103 0.486 0.500 0.488 0.819 0.532 133 9.5 Answers ‘The predictions are almost the same but the prediction variances are much smaller, from 1/8 to 1/2 of those for punctual ordinary Kriging. ‘$100; See Figure 1 Subsoil clay appears to increase from the NW to the SE. There are local anomalies to this trond, especially in zone 2 (NW corner) Return to Q100 + A101: Almost half of the variation is explained by the trend surface. Residuals are symmetric, with most within 5% clay, but a fow are quite largo, both positive an« negative. This means that the surface does not account for some local variability, Return to Qiol» A102 z= 42.4288 + 14.3561 + x — 7.0893 - y, whore x and > aro offsets from the centte of the area. Return (0 Q102 « ‘A103 + See Figure 2 ‘There were 429 point-paits with a separation from 0 to 949/2 = 475 m, Their autocor relation was 0,41. At the next bin, the autocorrelation has decreased to near zero, and stays there, Guctuating irregularly around zero as distance increases. Return to QI03 + ‘A104; See Figure 3 ‘The R? is slightly lower (more realistic, accounting for spatial correlation); the coelliients changed slightly (OLS: 46.43, 14.36, ~7.09; CLS: 46,8214.90 ~ 6,37): a bit more N-S trend and a bit lose E-W; ic, the direction of the trond surface rotates slightly towards the , from N1O7.2°E to N1O7.7°E In this case the GLS and OLS trend surfaces are almost the same, which can be appre- ciated in the plot Return to Qi04 A105: See Figure 4 There is no clear pattern; some large positive and negative residuals are close to each other, so the fit can't be improved. Thore aro some small clusters of large residuals with the same sign, but with no obvious trend. Return to Q105 + ‘A106; The Ded-order R’ = 0.514, which is only a bit higher than the 1*-order R? = 0.496, i.1.8% more variance was explained, Return (0 QI06 + ‘A107; See Figure 5 The I#-order plane, trending to the ESE, is preserved; the 2%d-order structure is domed in the middle of this trend and falls off to the NNE and SSW. Thus the 2”¢-order trend improves but does not fundamentally change the 1*-order trend. Note that the linear cocfficionts of the 2a4.order trend (13.6599, ~8.7301) are similar to the I*-order trond (24.8960, 6.3662), Return to QUO7 + A108: The object is « data frame. Fields e and.n give the UTM coordinates; field elev ives the third (elevation) coordinate. These three are all integers (i.e. whole numbers). The other fields are measured attributes. Return to Q108 + A109: The data type is now SpatialPointaDataFrane; this is defined in the sp package. Retum to Q109 ‘AIO; Spatial information includes (1) a bounding box, ie. limits ofthe the cobrdinates; (2) the geographic projection, in this case marked as NA ("not applicable”) because we aven't informed sp about it Return to QUO + ‘ALL (attribute) spsce information: the two categorical variables and the seven metal contents Return (0 QUI + You-spatial iuformation is the data frame less the codedinates, ic. all the feature A112: Yes; the evidence is that semivariances at close separation distances between point pairs are generally lower than those at longer distances. Return to QUI2 + TIS» This variogram is erratic and difficull to model, mainly because of the low number of point pairs at some separations. It does not seem to reach a clear sill, 30 an exponential model (which approaches a sill asymptotically) may be most appropriate ‘The nugget is about 50%?; the sill somewhere near 150%, and the effective range 15 km. Return to QUIS + A114: The estimated parameters of the exponential model were: partial sill 100, nugget 50, range 5 000 (n.b. this implies an effective range of 15 000). The fitting algorithan emphasis close-separation points and large numbers of point pairs, and thus lowered the partial sill and shortened the range considerably: partial sill 623, nugget 47.2, range 2.627 (effective rauge 7 881) Return to QUd + ALIS : The map respects the general clusters of higher or lower values (also seen on the 2Lorder Wrend surface) but seems to have small patches that are artefacts of the sampling. Away from the sauaple cluster, the spatial mean is predicted, with no detail Return to QUIS + AII6: Prediction variances are low jear the point clusters; in areas further than the effective range (about & km) the variance is maxinvun Return to QUI6 + 10 Going further ‘The techniques introduced in this note do not even begin to exhaust the possibil- ities offered by the R environment. There are several good R-specific texts which you ean consult: # Dalgaard (7] is a simple introduction to elementary statistics using R for the examples. If you found these notes too advanced, this is a good book to got you started. « Fox (13) is a thorough treatments of applied regression and linear models, with examples from social and political sciences. It is rigorous but provides ‘many aids for understanding. This is accompanied by a text with all the techniques illustrated by R code {12} ‘# Venables and Ripley [34 is the well-known Modern applied statistics with which gave rise to the MASS package, This has a wealth of advanced techniques, but also some well worked-out examples of statistical mod- clling. Some topics covered are lincar modelling, multivariate analysis, spatial statistics, and temporal statistics. I highly recommend this book for serious $ users And remember: Each R package has its own documentation and demonstrations. ‘There are a large number of specialised packages; one may be exactly what you need. Browse through CRAN and the R help and package archives. ‘Task Views A useful R resource is the set of Task Views, on-line at http: //cran.r-project. org/sre/contrib/Views/. These are a summary by a task maintainer of the facilities in R to accomplish certain tasks, with links to all relevant packages. In particular there is a task view for “Multivariate Statistics” at http: //cran. x-project .org/src/contrib/Views/Maltivariate. html. ‘The best resource is a goad statistics textbook at your level; anything explained inn these is either already available in R or can be directly programmed. Above all, experiment and keep thinking! References (1] F.C. Barrett and L. F. Curtis. Introduction to environmental remote sensing Stanley Thornes Publishers, Cheltenham, Glos., UK, 4th edition, 1999. 96 [2] D Birkes and Y Dodge. Alternative methods of regression, Wiley Series im Probability and Mathematical Statistics. John Wiley & Sons, Inc., New York, 1993, 39 [3] KA Brownlee, Statistical theory and methodology in science and engineering, John Wiley & Sons, New York, 2nd edition, 1965. 2 [4] MG Bulmer. Principles of statistics. Dover Publications, New York, 1979. 2 [5] RD Cook and $ Weisberg. Residuals and influence in regression. Chapman and Hall, New York, 1982. 142 136 [6] N Gressic. Statistics for spatial data. John Wiley & Sons, revised edition, 1993. 117 [7] P Dalgaard. Introductory Statistics with R. Springer Verlag, 2002. 1, 2, 136 [8] JC Davis. Stattstics and data analysis in geology. John Wiley & Sons, New York, 3rd edition, 2002, 2, 42 [9] JC Davis. Statistics and data analysis in geology. Wiley, New York, 1986, 42, 43 [10] P Driessen, J Deckers, O Spaargaren, and F Nachtergacle, editors. Lecture totes on the major soils of the world. World Soil Resources Report 4. FAO, Rome, 2001. 3 [11] FAO. World Reference Base for Soil Resources. World Soil Resources Report 84. FAO; ISRIC, Rome; Wageningen, 1998. 3 [12] J Fox. An R and S-PLUS Companion to Applied Regression. Sage, Newbury Park, 2002. 1, 80, 136 [13] J Fox. Applied regression, linear models, and related methods. Sage, Newbury Park, 1997. 80, 136 [14] JC Gower and DJ Hand. Biplots, Monographs on statistics and applied probability ; 54. Chapman & Hall, London ; New York, 1996. 105 [15] R Thaka and R Gentleman, R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299-314, 1996. 1 [16] P Legendre and I Legendre. Numerical ecology. Blsevier Science, Oxford, 2nd english edition, 1998. 2, 105 [17] F Leisch. Sweave User's Manual. TU Wein, Vienna (A), 2.1 edition, 2006. URL http: //awu.ci.tuwien ac.at/“leisch/Sweave. 3 [18] F Leisch. Sweave, part I: Mixing R and TEX. R News, 2(3):28-31, December 2002, URL attp://CRAN.R-project .org/doc/Rnews/. 3 [19] Thomas M, Lillesand, Ralph W. Kiefer, and Jonathan W. Chipman. Remote sensing and image interpretation, John Wiley Sons, Hoboken, NJ, 6th edition, 2007. 96 [20] DM Mark and M Church, On the misuse of regression in earth science. Mathematical Geology, 9(1}:63-T7, 1977. 16 [21] MB McBride. Environmental chemistry of soils. Oxford University Press, New York, 1994. 2 [22] JM Pauwels, E Van Ranst, M Verloo, and A Mvondo Ze. Manuel de Labora- toire de Pédolopie. Méthodes d’'Analyses de Sols et de Plantes, Equipement, Gestion de stocks de Ververie et de Produits chimiques. Publications Agri- coles 28. AGCD et Centre Universitaire de Dschang, Bruxelles, 1992. 2 [23] Edzer J. Pebesma and Roger $. Bivand. Classes and methods for spatial data in R. R News, 5(2):9-13, 2005. 125 137 [24] EJ Pebesma. gstat User's Manual, Dept. of Physical Geography, Utrecht University, Utrecht, version 2.3.3 edition, 2001. 125 [25] EJ Pebesma, Multivariable geostatistics in S: the gstat package. Computers & Geosciences, 30(7):683-691, 2004. 125 [26] EJ, Pebesma and CG, Wesseling. Gstat: a program for geostatistical mod- elling, prediction and simulation. Computers & Geosciences, 24(1):17-31, 1998, 125 [27] R Development Core Team. An Introduction to R. The R Foundation for Statistical Computing, Vienna, version 2.6.2 (2008-02-08) edition, 2008. 1 [28] BD Ripley. Spatial statistics. John Wiley and Sons, New York, 1981. 117 [29] D G Rossiter. Introduction to the R Project for Statistical Computing for use at ITC. International Institute for Geo-information Science & Earth Observation (ITC), Enschede (NL), 3*4 edition, 2007. 1, 11 [80] GW Snedecor and WG Cochran. Statistical methods. Iowa State University Press, Ames, lowa, 7th edition, 1980. 2 [B1] DL Sparks. Environmental sotl chemistry. Academic Press, San Diego, 1995. 2 [52] P Sprent. Models in regression and related topics, Methuen's monographs on applied probability and statistics. Methuen, London, 1969. 42 [33] RGD Steel, JH Torrie, and DA Dickey. Principles and procedures of statistics « biometrical approach. McGraw-Hill series in probability and statistics MeGraw-Hill, Now York, 3rd edition, 1997. 2 [4] WN Venables and BD Ripley. Modern applied statistics with S. Springes- ‘Verlag, New York, fourth edition, 2002, 1, 22, 39, 110, 117, 136 [85] R Webster. Is rogression what you really want? Soil Use & Management, 5 (2):47-53, 1989, 16 [86] R Webster. Regression and functional relations. European Journal of Soil Science, 48(3):557-566, 1997, 16, 17, 42 [7] R Webster and MA Oliver. Statistical methods in soil and land resource survey, Oxford University Press, Oxford, 1990. 2 [38] DS Wilks. Statistical methods in the atmospheric sciences: an introduction. International Geophysics Series 59. Academic Press, New York, 1995. 2 [39] M Yemefack, Modelling and monitoring soil and land use dynamics within shifting agricultural mosaic systems. ITC Dissertation 121. ITC Enschede and Utrecht University, Enschede and Utreeht, the Netherlands, 2005. 2 [40] M Yemefack, DG Rossiter, and R Njomgang. Multi-scale characterization of soil variability within an agricultural landscape mosaic system in southern Cameroon. Geoderma, 125(1-2):117-143, 2005. 2 138 Index of R Concepts + formula operator, 125 {, 100 ~ formula operator, 125 abline, 85 abs, 30 aov, 57 as.numeric, 21 asp graphies argument, 114 attack, 9 biplot, 105 border graphics argument, 10 boxplot, 55 breaks graphics argument, 10 by, 55, 56 car package, 79 center argument, 98 coefficients, 83 col graphics argument, 10, 30 colors, 11 contourplot (package:lattice), 120 coordinates (package:sp), 124, 125, 129 cor.test, 21 correlogram (package:spatial), 117 cov, 21, 65 data frame, 37 drop argument, 100 eqscplot (package:mass), 120 expand grid, 129 file.shou, 3 fit variogram (package:gstat), 125 fitted, 29 gria, 114 gridded (package:sp), 129 gstat package, 3, 122, 124, 126, 128, 129 hacvalues, 34 head, 125 identify, 98 krige (package :gstat), 131 Lattice package, 3, 120 Length, 56 Levels, 21 Library, 122 Lines, 85 Im, 24, 29, 34, Lead, 5 Loess, 85 lowess, 22 igs, 39 57, 59, 60, 68, 73, 85 main graphies argument, 10 MASS package, 39, 40, 120 order, 54 pairwise.t.test, 60 par, 18 plot (package: lattice), 126 plot, 11, 114 preomp, 95, 97, 98 predict read.table, 4 reshape, 7 resid, 29 retx argument, 97 rou.names, 5 rug, 10 scale, 103 scale argument, 96 24, 21 search, 122 segnents, 30 seq, shapiro.test, 33 solve, 99 sort, 51 sp package, 3, 122, 124, 133 span argument, spatial package, 116 stem, 32 step, 73-75 str, 125 subset argument, 85 139 surmary, 41, 125 surf.1s (package: spatial), 116 t.test, 15 text, 1] update, 82 vit, 79, 82 hich, 30 140 A. Derivation of the hat matrix ‘The “hat” matrix is derived from an interesting way to look at the lincar model: y = Xbee (ay where € ~ N(O,a1) are the (unobservable) identically-distributed errors with (unknown) variance 0? Recall that a vector of fitted values # is computed from the design matrix X and the veetor of fitted coefficients b: Xb (A2) ‘The “hat” notation, e.g. J, is used to indicate a fitted (not observed) value; the observed value has no hat, e.g. ¥. ‘This can be written separately for each of the m fitted values as: i = Bor Pix t+ BnXim, f= Lem (a3) Equation A3 shows tho expansion of the matrix multiplication of equation A2 by rows, There is one 8B; for each of the predictors, including the intercept Bo. So, if we know the cocficients b we can predict at any value of the predictors, ie. a given row of X. In least squares regression, the coefficients b are solved by least squares estimation from the vector of sample observations y: y = Xb (Ad) X'y = XXb (A8) KIX = XXX (A6) xX) IX’y = Ib (AT) b = xy 'x’y (A8) and substituting into equation A2: y = xxx) x’y (A9) so that the “hat” matrix, which is what multiples the observations to get the fits, can be defined as v X(X"X) (Al0) that is, yew (au) Deo= VM FVM Fine t= Lem (Al) ‘The “hat” matrix is so-named because it “puts the hats on” the fitted values: from observed ¥ to best-fit 9. ul ‘The V matrix gives the weights by which cach original observation is multiplied when fitting. This means that if a high-leverage observation were changed, the fit would change substantially. In terms of vector spaces, the V matrix gives the orthogonal projection of the observation vector y into the column space of the design (model) matrix X. ‘The overall leverage of each observation ; is given by its hat value, which is the sum of the squared entries of the hat matrix associated with the observation. ‘This is also the diagonal of the hat matrix, because of the fact that V = V'V = V?: mw = Suh (ars) a Details of this and many other interesting aspects of the hat matrix are given by Cook and Weisberg [5] 142

You might also like