ComputerLabNotes 2024

Download as pdf or txt
Download as pdf or txt
You are on page 1of 109

Dpto. de Estadı́stica e I.O.

y Didáctica de la Matemática

STATISTICS
Material for the computer lessons

February 2024
Contents

1 Descriptive Statistics 3
1.1 R-Commander . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Dataset Steel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Types of statistical variables . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3 Other types of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Bar chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Pie chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.3 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.4 Box plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Measures of central tendency and dispersion . . . . . . . . . . . . . . . . . . . . 17
1.6 Generation of new variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.1 Computing a new variable . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.2 Recoding variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2 Distribution models 29
2.1 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Solutions to the exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 One sample tests 39


3.1 Introduction to hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Tests for the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Population proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Solutions to the exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

1
2 CONTENTS

4 Two sample tests 55


4.1 Comparison of proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Comparison of variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Comparison of averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.1 Independent samples with normality . . . . . . . . . . . . . . . . . . . . 60
4.3.2 Paired samples with normality . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.3 Independent samples without normality . . . . . . . . . . . . . . . . . . . 66
4.3.4 Paired samples without normality . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Solutions to the exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 χ2 -test of independence and linear correlation 73


5.1 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Pearson’s correlation test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6 Linear regression 87
6.1 Step 1: Search for a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Step 2: Model estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3 Step 3: Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4 Step 4: Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

A Summary of the main hypothesis tests 103

Statistics E.P.I. Gijón


Session 1

Introduction to R-Commander and


Descriptive Statistics

1.1 R-Commander
Let us introduce the software we shall use in the computer classes. It is called R-Commander,
and it is a graphical interface of the statistical computation environment R.

1.1.1 Installation
With an internet connection, the installation of R-Commander (Rcmdr) in the usual operating
systems is simple.

Windows operating system


Option 1 :
The most usual way of installing R-Commander in Windows is detailed in the following
steps:

a) Go to https://cran.r-project.org/ and choose Download R for Windows. In the next


window, select the link that says install R for the first time and, after that, download the
last version of the program available (unless instructed otherwise). Install the R program
in your computer.

b) Open R and write


install.packages("Rcmdr")
in the console and press Enter. It will ask you to select an internet repository. Choose
one close to your location (for example, Spain (Madrid)).

c) Load the package by writing in the console


library(Rcmdr)
and press Enter. The first time you run this code the program will ask you whether you
want to install the additional packages or not. Say yes. This step may take some time,
do not close the program until it is over. At the end, if everything goes well the window
of R-Commander should open.

3
4 SESSION 1. DESCRIPTIVE STATISTICS

The next time you need to open the program, it suffices to open R and write library(Rcmdr)
in the console.
Note: If you have some trouble installing the R-Commander package, check your antivirus:
it may be blocking the process.
Option 2 :
Alternatively, you may download R-Commander following these steps:

a) From the website http://knuth.uca.es/R/doku.php, open the link Versión X.Y.Z Pa-
quete R-UCA para windows, similarly to what is shown Figure 1.1 (the version number
of R may change, download the newest version unless instructed otherwise).

Figure 1.1: Website of the project R-UCA.

b) Once you have downloaded the package, execute it to proceed with the installation, that
shall open a window similar to that in Figure 1.2.

c) Once the installation is completed, look in the list of programs of the initial menu the
link Rterm. When executing it, you should obtain a DOS window that will also start
R-Commander (figure 1.3).
Note that by default, R uses the language from the Windows version of your computer.
However, it is possible to change the language to English (en), French (fr), Italian (it),
etc.
In order to change the language to English, we should type in the R Console window:

Sys.setenv(LANGUAGE="en")

Press return.

In this manner you can install the latest R version available at R-UCA. This version may differ
from the one installed in the computers at the school of Engineering.

Statistics E.P.I. Gijón


SESSION 1. DESCRIPTIVE STATISTICS 5

Figure 1.2: Initial window of installation of R-UCA.

Ubuntu operating system


In the menu Aplicaciones start Centro de Software de Ubuntu, select R-Commander and press
Instalar. The program will appear in the menu Aplicaciones, submenu Ciencia.

Macintosh operating system


The steps for the installation of the R-Commander package in macOS are very similar to the
ones explained for the first Option of the Windows operating system, with one main difference:
in order to launch the R-Commander package, the installation of the XQuartz program is
compulsory.
a) Go to http://cran.r-project.org/ and choose Download R for (Mac) OS X.
In the next window we shall download the file with extension .pkg (the newest version,
unless instructed otherwise). Then we proceed to install it, by clicking on the file.
b) Install the XQuartz program. You may download the last version of this program from
https://www.xquartz.org/.
c) Reboot your computer after installing XQuartz. Most of the problems that occur when
trying to install R-Commander in macOS come from skipping these two last steps.
d) Now we have to download the package Rcmdr, which is the interface we shall use in this
course. For this, we open R and write
install.packages("Rcmdr", dependencies=TRUE)
The installation takes some time, so we should not close the program until it is over.
e) To load R-Commander we must write
library(Rcmdr)
in the console and press Enter.

E.P.I. Gijón Statistics


6 SESSION 1. DESCRIPTIVE STATISTICS

Figure 1.3: Windows Rterm and R-Commander.

The next time one wants to use this program it suffices to repeat this last step.
Note: A list of problems that may appear during the installation process is collected here,
along with their possible solutions:

• tar: Failed to set default locale.


If the error message tar: Failed to set default locale appears while trying to in-
stall the Rcmdr package (step d ), write
system("defaults write org.R-project.R force.LANG en_US.UTF-8")
in the console, and then reboot the program.

• Problem with the data.table package.


If after trying to load the Rcmdr package a message appears announcing that the data.table
package is missing, write
install.packages("data.table")
in the console. If no message errors were returned, try the library(Rcmdr) command
again.
If the message ERROR: compilation failed for package data.table appears, you
may have to install some other computational tools. A window will open (not from R)
asking whether you want to install the tools make. Say yes, install those tools and then
try to install the package data.table again.

• It does not download anything.


If nothing happens when executing the line install.packages("Rcmdr"), close R, open
it again and choose a different server.

• R-Commander runs slowly.


Sometimes R-Commander runs slowly in macOS. A possible solution is to launch the
program directly from the Terminal of the computer.

Statistics E.P.I. Gijón


SESSION 1. DESCRIPTIVE STATISTICS 7

In order to do that, open the Terminal (usually located in Applications → Utilities), write
R and press Enter. This will open the R program in the Terminal window. To launch the
Rcmdr packge, write library(Rcmdr) and press Enter.

1.1.2 Structure
• Packages → Load package.

• From the drop-down menu, choose Rcmdr.

Figure 1.4: Loading R-Commander.

The R-Commander window has the following parts: menu, active parts (data and models),
instructions, results, and messages (Fig. 1.5).

Figure 1.5: R-Commander.

E.P.I. Gijón Statistics


8 SESSION 1. DESCRIPTIVE STATISTICS

1.2 Data
1.2.1 Dataset Steel
In order to analyze the energy consumption of a steel company, we have inspected the production
of the company. The inspection consists in registering the most relevant values during several
working hours selected randomly.

Example 1.1. Open the database Steel.RData.

Answer: To open a database, we must go to the submenu Data; and if we want to use a file
with the R format (.rda, or .RData), we must select Load data set.

Data
yLoad data set

We select the database Steel.RData

Example 1.2. Identify the number of variables and the number of observations in the database
Steel.

Answer: There are several ways to proceed. The easiest consists in viewing the database.

Data set
yWe select Steel (if there were many)
yView data set

Statistics E.P.I. Gijón


SESSION 1. DESCRIPTIVE STATISTICS 9

We obtain a window with the available


data. By moving the cursor to the left or
down we can scan the whole database.

In all we have 117 observations of the following variables:


consumption Energy consumption of the company (megawatts/hour).
pr.hbt Production of the hot belt train (steel tons).
pr.cc Production of continuous casting (steel tons).
pr.sc Production of the steel converter (steel tons).
pr.galv1 Production of type I galvanized steel (steel tons).
pr.galv2 Production of type II galvanized steel (steel tons).
pr.paint Production of painted panels (steel tons).
line Line of production used (A or B).
shift Shift when data were collected: morning (M), afternoon (A), night (N).
temperature Temperature of the system: High, Medium and Low.
breakdowns Existence of breakdowns (Yes, No).
nbreakdowns Number of breakdowns detected.
system Activation of the overheating detection system: ON, OFF.
TotalProd Total production per hour (steel tons).
NOx Nitrogen mixture emissions per hour (tons/h).
CO Carbon monoxide emissions per hour (tons/h).
COV Volatile organic component emissions per hour (tons/h).
SO2 Sulphur dioxide emissions per hour (tons/h).
CO2 Carbon dioxide emissions per hour (tons/h).
N2O Nitrous oxide emissions per hour (tons/h).

E.P.I. Gijón Statistics


10 SESSION 1. DESCRIPTIVE STATISTICS

1.2.2 Types of statistical variables

The values taken by a statistical variable are called modalities. If they are numerical quantities,
the variable is called quantitative (for instance the speed, age, time, etc); if they are names
(labels, categories, levels, etc), the variable is called qualitative or a factor.
When working with a statistical variable within R-Commander, it is important to know if
it is a quantitative or a qualitative variable, because some procedures can only be applied to
one of the two types. For instance, we can only make a bar chart with qualitative variables; in
case the variable is quantitative and discrete, we should first turn it into a factor.

1.2.3 Other types of datasets

It is also possible to create our own data set directly in RCommander. In order to do this, we
must go to Data-> New data set. As in our previous example with the dataset Steel, in a data
set each column represents a variable and each row represents an element in the sample. We
can enter these values directly by filling in the elements of the grid.

Data
yWe select New data set
yand enter the name of the data set

Once we have done so, we can view the


dataset and edit it further using the options
View data set and Edit data set from the
menu.

In addition, we may also import data sets with a similar grid structure from other programs,
such as Excel and SPSS into RCommander:

Statistics E.P.I. Gijón


SESSION 1. DESCRIPTIVE STATISTICS 11

Data
yImport data
ySelect the appropriate type

Finally, we can save our changes by following the path Data -> Active data set -> Save
active data set.

1.3 Frequencies
Let us see how to obtain the frequencies of the different values of a statistical variable.
Example 1.3. Determine the frequencies of the statistical variable breakdowns.
Answer: We proceed in the following way:

Statistics
ySummaries
yFrequency distributions

Select variable breakdowns


yWe accept

This produces the following output:


counts: breakdowns
No Yes
89 28

percentages: breakdowns
No Yes
76.07 23.93

Hence, we have obtained the absolute and relative frequencies for the different values of the
variable within the sample.
Example 1.4. Obtain the frequency distribution of the statistical variable nbreakdowns.
Answer: In this case, because it is a quantitative statistical variable, R considers it by default
to be continuous, and it does not provide the absolute and relative frequencies 1 . To be able to
determine the frequencies, we must create a new qualitative variable with these data.
1
Because for a continuous statistical variable the absolute frequency of each of its values will usually be 1.

E.P.I. Gijón Statistics


12 SESSION 1. DESCRIPTIVE STATISTICS

Data
yManage variables in the active data
set. . .
yConvert numeric variables to factors

There are two possibilities. When turning a quantitative variable into a qualitative one, the
most convenient one is to use the same values as categories:

Select variable nbreakdowns


yUse numbers
yNew variable name: numbreakdowns
yOK yAccept

On the other hand, take into account that, if for New name we use the default option <same
as variables>, we lose the quantitative character of the variable. This means that if we later
want to obtain descriptive measures of the data (for instance the mean), we must to explicit a
new name different from the original one (in this case, nbreakdowns).
Once this is done, we proceed as in the previous case. We obtain the following output:

counts: nbreakdowns
0 1 2 3 4
89 2 9 9 8

percentages: nbreakdowns
0 1 2 3 4
76.07 1.71 7.69 7.69 6.84

1.4 Graphs
1.4.1 Bar chart
Example 1.5. Represent variable breakdowns by means of a bar chart.

Answer: Since this is a qualitative statistical variable, bar charts are an adequate graphical
representation. They can be obtained with the menu Graphs; specifically,

Statistics E.P.I. Gijón


SESSION 1. DESCRIPTIVE STATISTICS 13

Graphs
yBar graph

Select variable breakdowns


yAccept

Note that we can also modify the labels of the two axis, by filling the options under Plot
labels: in this way, we can give a different label to the x-axis, the y-axis, and the graph.
With this procedure we obtain the following bar chart:

avería
No

80 60 40 20 0

Frecuencia

Example 1.6. Obtain the bar chart of variable nbreakdowns.

Answer: Recall that R considers that all quantitative statistical variables are continuous, and,
as a consequence, it does not allows us to make this representation. We must then create first
of all a new qualitative variable with the same values, as we did in Example 1.4.
Once this is done, we obtain the bar chart in the same way as in Example 1.5:

Graph
yBar graph

and we obtain a graph similar to this one:

E.P.I. Gijón Statistics


14 SESSION 1. DESCRIPTIVE STATISTICS

4
3

Número de averías
2
1
0
80 60 40 20 0

Frecuencia

1.4.2 Pie chart


Example 1.7. Represent variable breakdowns by means of a pie chart.

Answer: Pie charts are one of the options of the menu Graphs; specifically,
Graphs
yPie chart
ySelect variable breakdowns
yAccept

1.4.3 Histogram
Example 1.8. Obtain the histogram of variable consumption.

Answer: To obtain an histogram, we must do the following:

Graphs
yHistogram. . .

Statistics E.P.I. Gijón


SESSION 1. DESCRIPTIVE STATISTICS 15

Select variable consumption


yAccept

This produces the following output:

Histogram of acero$consumo
30
Frequency

20
10
0

0 50 100 150 200 250 300

acero$consumo

We must make a couple of observations:

• By default, the number of bars of the histogram is determined by R. We may specify a


fixed number in the option Number of bins; however, R does not necessarily follow this
command (the determines a number that makes the borders of the interval values round
numbers).

• By default, we obtain the histogram of (absolute) frequencies. We may instead obtain


the histogram of percentages or densities, by modifying the option in Axis scaling.

1.4.4 Box plots


Box plots are useful when representing quantitative statistical variables, and specifically:

• detect outliers;

• compare the distribution of the same variable in different samples.

Example 1.9. Obtain the box-plot of variable consumption.

Solution: The steps to follow are:

E.P.I. Gijón Statistics


16 SESSION 1. DESCRIPTIVE STATISTICS

Graphs
yBox plot...

Select variable consumption


yAccept

The output is:


300
250
200
consumption

150
100
50

From this diagram we observe, for instance, that there are no outliers for the variable
consumption in this sample.

Example 1.10. Obtain the box-plots of variable consumption for each level of temperature.

Solution: The steps to follow are:


Graphs
yBox plot...
ySelect variable consumption
yPress Plot by groups...
ySelect temperature as group variable
yAccept
y The button changes its name to Plot by: temperature
yAccept

The output is:


Statistics E.P.I. Gijón
SESSION 1. DESCRIPTIVE STATISTICS 17

300
88

250
106
71
84

200
consumption

150
100

79
86
50

54
60

High Low Medium

temperature

From this plot we can clearly see that the consumption decreases with extreme temperatures.
There are also some outliers. To detect them we can click on the corresponding point, if before
making the diagram we have activated the option Identify outliers with mouse.

1.5 Measures of central tendency and dispersion


As examples of quantitative statistical variables we may consider in our database the variables
nbreakdowns and consumption. To describe these variables, we are interested in knowing their
means, standard deviations and some of their percentiles (usually the quartiles).
Example 1.11. Determine the mean, standard deviation and quartiles of the variable nbreak-
downs.
Answer: These values can be obtained in the following way:

Statistics
ySummaries
yNumerical summaries

Select variable nbreakdowns


yAccept

E.P.I. Gijón Statistics


18 SESSION 1. DESCRIPTIVE STATISTICS

The output of this procedure is:

mean sd IQR 0% 25% 50% 75% 100% n


0.6752137 1.292078 0 0 0 0 0 4 117

This output indicates that the mean is 0.675 breakdowns/hour, with a standard deviation
of 1.29. The number of breakdowns varies from 0 to 4, and at least 75% of the observations do
not present breakdowns. In all we have 117 observations.

Example 1.12. Calculate the main descriptive measures of variable consumption.

Answer: These values can be determined using the following procedure:

Statistics
ySummaries
yNumerical summaries

which gives the following output:

mean sd IQR 0% 25% 50% 75% 100% n


135.6771 56.90756 83.39 17.5 99.09 135.1 182.48 290.72 117

With this information we conclude that the mean consumption is around 135.68 MWh, with
a standard deviation of 56.91 MWh. The minimum consumption is of 17.5 and the maximum
one of 290.72. In 25% of the data the consumption is no greater than 99.09 MWh; in 50%, no
greater than 135.1; and 25% consumes more than 182.48.

1.6 Generation of new variables


A quantitative variable can be used to generate new variables, both quantitative and qualitative.
Let us show how this can be done:

1.6.1 Computing a new variable


The variables in the data set can be manipulated by means of the option

Data
y Manage variables in active dataset
y Compute new variable...

The notation for the elementary operations within the cell Expression to compute are the
usual ones: + − ∗ / y ˆ .
For instance, if we want to generate variable cost from consumption, if the relationship
between them was cost = 2.34 · consumption, we should use the expression 2.34*consumption.

Statistics E.P.I. Gijón


SESSION 1. DESCRIPTIVE STATISTICS 19

The name of the variable we are transforming can be either typed or transferred to the cell by
clicking twice on its name in the list of Current variables.
It is important to note that if we need to introduce decimals in Expression to compute as
in the example above we must use the point instead of the comma.

1.6.2 Recoding variables


The option
Data
y Manage variables in active dataset
y Recode variables...
allows us to create a new variable. This is often interesting when we want to create a discrete
variable from a continuous one.
The expressions we may type inside the cell Enter recode directives are:
• a single value: "High"=1
• several values separated by commas: 7,8,9="high"
• a range of values separated by : 7:9="high". The particular cases of unbounded values
lo (lowest) and hi (highest) is admitted.
• the command else, which applies when none of the previous ones is applicable (including
if the cell is blank).
For instance, if we want to recode variable consumption into variable Groupconsumption
as: 
Low
 if consumption ≤ 100
Groupconsumption = M edium if 100 < consumption ≤ 200

High if consumption > 200

we should type:
lo:100="Low"
100:200="Medium"
200:hi="High"

We can also recode several variables simultaneously, simply by selecting all of them together.
One interesting instance is that of the dichotomy of a numerical variable. Let us give an
example:
Example 1.13. Define a new variable called Production that takes the value Failure when the
total production is of at most 10000 tons and Success when it is greater.
Answer: We should simply follow the steps of the previous example, but typing now
lo:10000="Failure"
10000:hi="Success"

in Enter recode directives y Production in New variable name or prefix for multi-
ple recodes: inside Recoding variables.

E.P.I. Gijón Statistics


20 SESSION 1. DESCRIPTIVE STATISTICS

1.7 Filters
Similarly to the case of box-plots, we can obtain descriptive measures of a quantitative statistical
variable for each value of a qualitative statistical variable. However, in some cases we only want
to analyze part of the data: those satisfying some condition. Let us see how to make this filtering
of the data.
Example 1.14. Determine the frequency distribution of the variable breakdowns for those cases
where the temperature is high.
Answer: To filter the data with the given condition, we must follow:
Data
y Active data set
y Subset active data set...
We see a new window called Subset data set. In its upper part we can select a few columns
(variables); usually we shall consider the default option: Include all variables.
Subset expression. In our case, we consider the following:
temperature=="High"
Name of the new data set. The default option is <same as the active data set>. It is
advisable to change it into another one, because otherwise the new data set replaces the one
we had. In our case, we give the new name Steel.temp_high. Note that in the new name we
can use letters, numbers, points and underscores, but not spaces nor scripts.
After making sure that the active data set is
Data set: Steel.temp_high
we proceed as in Example 1.3 to obtain
counts: breakdowns
No Yes
38 8

percentages: breakdowns
No Yes
82.61 17.39

Observe that the logical condition for equality is (==) instead of (=). The following table
shows how too make the different logical conditions with R-Commander:
equal (=) ==
different (̸=) !=
smaller or equal (≤) <=
greater or equal (≥) >=
conjunction (and) &
disjunction (or) |
We may also use parentheses to group conditions.
On the other hand, when using

Statistics E.P.I. Gijón


SESSION 1. DESCRIPTIVE STATISTICS 21

• a numerical value, use a point (not a comma);

• a text, put it between inverted commas “ or apostrophes ’.

Note: If after making a filter we want to work again with our initial data set, we shall
change the active set in the menu Data set.

E.P.I. Gijón Statistics


22 SESSION 1. DESCRIPTIVE STATISTICS

1.8 Exercises
Exercise 1.1. Represent variable nbreakdowns in a pie chart. Is this an adequate graphical
representation?

Answer: We must convert the variable into a factor first. In this case a bar chart would be
more adequate, because the values represent quantities, and the relationships between them
would not be represented in a pie chart.

Exercise 1.2. Which graphical representation is the most adequate for the hot belt train? Make
this representation using percentages in the Y axis.

Answer: Histogram.

Exercise 1.3. How many of the 117 observations correspond to a high temperature?

Answer: We calculate the frequency table of the temperature and obtain the value 46.

Exercise 1.4. Observe the distribution of the variable production of continuous casting. How
much is the mean production? How much is the median?

Solution: We make an histogram to observe graphically the distribution of this variable, and
calculate the main descriptive measures of pr.cc by selecting the menu option Numerical
summaries. The output is
mean sd IQR 0% 25% 50% 75% 100% n
433.9316 276.8536 406 33 201 380 607 1204 117

Exercise 1.5. What is the percentage of hours with a high temperature?

Solution: We calculate the frequency distribution of the temperature:

High Medium Low


39.31624 28.20513 32.47863

In 39.32% of the days the temperature is high.

Exercise 1.6. Which production type has the highest production?

Solution: We calculate the descriptive measures of the different productions:


mean sd IQR 0% 25% 50% 75% 100% n
pr.ca 244.9231 167.5311 234 13 99 225 333 677 117

pr.cc 433.9316 276.8536 406 33 201 380 607 1204 117

pr.galv1 440.4530 235.8312 392 19 245 432 637 982 117


pr.galv2 1173.2222 511.7398 654 13 902 1333 1556 1963 117
pr.pint 349.6923 245.1241 423 20 135 270 558 908 117

pr.tbc 6916.6667 3017.5123 4451 22 4882 8062 9333 10955 117

Statistics E.P.I. Gijón


SESSION 1. DESCRIPTIVE STATISTICS 23

The hot belt train has the highest production (10 955 tons).

Exercise 1.7. Which graph is more appropriate to detect outliers in the emissions of SO2?
Represent it. Which outliers do you observe?

Solution: Box plots. To identify the outliers with the mouse, we have to click on the corre-
sponding points in the box-plot.
There are several outliers: clearly observations 68 and 85 are outliers, but the others cannot
be seen clearly. By looking at the output window we observe that the outliers correspond to
data 85, 87, 106 and 68. Their respective emissions are 0.014, 0.002, 0.001 and 0.127.

Exercise 1.8. a) How many data correspond to a medium temperature? And to a high tem-
perature?

b) What percentage were not taken at a high temperature?

c) How many of the 117 data were taken at a medium temperature?

d) Which is the most frequent temperature in the sample? And the least frequent?

e) Can we represent the temperature data in a bar chart? And in a pie chart? And in an
histogram?

f) Make a pie chart for the variable temperature, with the title “System temperature”.

Solution. In order to answer the first four questions we obtain the frequency distribution of the
temperature:

> .Table # counts for temperature


High Low Medium
46 38 33

> round(100*.Table/sum(.Table), 2) # percentages for temperature


High Low Medium
39.32 32.48 28.21

a) 33. 46.

b) 60.68%

c) 33.

d) The most frequent temperature is High and the least frequent is Medium.

e) Yes, because it is qualitative. For the same reason, we can also use a pie chart. We should
not use an histogram, because it is not even a quantitative variable.

f) It suffices to write the label ‘System temperature’ in the cell we see when selecting the
option Graph -> Pie chart.

E.P.I. Gijón Statistics


24 SESSION 1. DESCRIPTIVE STATISTICS

Exercise 1.9. Make a graphical representation of the data about the energy consumption of the
company, so that in the Y axis we have the percentages and with the labels “Consumption” and
“%”. Using this graph, answer the following questions:

a) If this study has been made to determine if the company is fulfilling the goals not exceeding
a consumption of 400 megawatts/hour, do these data support the hypothesis? What if the
goal was that the consumption is below 200 megawatts/hour?

b) According to these data, can we assume that approximately 40% of the time the consump-
tion is greater than 250 megawatts/hour?

c) Assuming these data come from a stable process, and that they represent the usual be-
haviour of the company, around which values lies usually the energy consumption of the
company?

Solution. To make this graph we must write the following instructions:


with(Steel, Hist(consumption, scale="frequency", breaks="Sturges",
col="darkgray",xlab="Consumption",ylab="%"))

a) Yes. No.

b) No.

c) Between 100 and 200.

Exercise 1.10. Make a bar chart of the variable consumption and comment its properties.

Solution. The program does not allow us to make this graph because it is a quantitative variable.
We turn it into a qualitative one with (Data → Manage variables in the active dataset →
Convert numeric variables to factors) and make the corresponding bar chart. This is not an
useful graph because most of the absolute frequencies are equal to 1, because it is a continuous
quantitative variable. It is not an adequate graph for this variable.

Exercise 1.11. The experiment design suggested to have approximately the same number of data
with the overheating detection system on and with the system off. Does the sample correspond
to this design? How many data do we have with the system off ? Which percentage of the sample
size does this represent?

Solution. Yes, because more or less we have the same amount of data with the system on and
off (50.43% off and 49.57% on). With the system off we have 59 observations out of 117, which
is 50.43% of the data.

Exercise 1.12. Make a graphical representation of the energy consumption of the company with
two box-plots, one with the data with the overheating detection system on and another one when
it is off. Analyze this graph. How much is the mean consumption in each of these two cases?

Solution. Box-plot: it seems that there is less consumption when the system is on. The mean
consumption in this case is 124.24 and when the system is off, 146.92.

Statistics E.P.I. Gijón


SESSION 1. DESCRIPTIVE STATISTICS 25

Exercise 1.13. If the production of a new product X can be obtained as the difference between
the total production and the sum of the six productions given (hot belt train, continuous casting,
steel converter, type I and type II galvanized steel and painted panels), how much is the average
production of X?

Solution. First of all we must compute the new variable prodX and then we must compute its
average. We obtain that the mean production of X is 2282.761 tons.

Exercise 1.14. We define the variable Grouprod as:



M oderate if T otalP rod ≤ 8000
Grouprod =
Adequate otherwise

What is the percentage of times when the production has been moderate?

Solution. 17.09%

Exercise 1.15. To analyze the behaviour of the overheating detection system, we consider only
the data when this system is on. For those data, determine:

a) The mean number of breakdowns.

b) The number of breakdowns that has half of the observations below it and half of them
above it.

c) The most frequent number of breakdowns.

d) The percentage of data coming from line A.

e) Make a graphical representation of the line of production used and analyze it.

f) Make a graphical representation of the number of breakdowns and analyze it.

g) Make a graphical representation of the production of painted panels and analyze it.

Solution. a) Mean= 0.637931

b) Median=0

c) Mode=0 (79.31% of the data; 46 times)

d) 51.72%

e) We may use the bar chart or the pie chart (it is a qualitative variable). We observe
approximately the same number of data in each of the lines.

f) We may use the bar chart or the pie chart (it is a discrete quantitative variable). We have
0 breakdowns almost all of the time, and 1, 2, 3 and 4 breakdowns in similar proportions.

g) We may use an histogram or a box-plot (it is a continuous quantitative variable). We


observe a large number of data with little production, and a fairly uniform distribution
between 200 and 900.

E.P.I. Gijón Statistics


26 SESSION 1. DESCRIPTIVE STATISTICS

Exercise 1.16. From the sample, can we assure that in less of 25% of the data the consumption
is greater than 150?

Solution. No, because the 75-th percentile is 182, and the percentage of data greater than 150
is at least 25%.

Exercise 1.17. Calculate the mean, median, mode, range, standard deviation and variance of
the following variables whenever it is possible:

a) Existence of breakdowns.

b) Number of breakdowns.

c) Production of the steel converter.

Solution. a) Since it is a categorical variable, it only makes sense to calculate the mode,
which is No

b) Since it is a discrete quantitative variable, we can calculate all of them: x = 0.675,


M e = 0, M o = 0, R = 4, s = 1.29 and s2 = 1.292 .

c) Since it is a continuous quantitative variable, it does not make much sense to calculate
the mode. The other measures are: x = 244.92, M e = 225, R = 664, s = 167.53 and
s2 = 167.532

Statistics E.P.I. Gijón


SESSION 1. DESCRIPTIVE STATISTICS 27

1.9 Appendix
Bar graph options
Bar graph with percentages instead of absolute frequencies
barplot(100*table(Steel$temperature)/sum(table(Steel$temperature)),
xlab="temperature", ylab="Percentage")

Change the color of the columns


barplot(table(Steel$temperature), xlab="temperature",
ylab="Frequency", col="green")

Adding a title
barplot(table(Steel$temperature), xlab="temperature",
ylab="Frequency", main="Working title")

Labelling the bars


drawing <- barplot(table(Steel$temperature), xlab="temperature",
ylab="Frequency") # we save the output graph
numbers <- table(Steel$temperature) # numbers to draw

text(drawing, numbers + 2, format(numbers, digits=2), xpd = TRUE)

Labelling bars with percentages


drawing <- barplot(table(Steel$temperature), xlab="Temperature",
ylab="Percentage") # we save the output graph
numbers <-
100*table(Steel$temperature)/sum(table(Steel$temperature))
# percentages to draw
text(drawing, numbers + 2, format(numbers, digits=2.2), xpd = TRUE)

Options in a pie chart


Pie chart with labels
pie(table,
labels=paste(levels(Steel$temperature),table(Steel$temperature)),
main="temperature", col=rainbow_hcl(length(levels(Steel$temperature))))

Pie chart with percentages instead of absolute frequencies and with labels
tabla<-round(table(Steel$temperature)*100/sum(table(Steel$temperature))
*100+0.5)/100
pie(table, labels=paste(levels(Steel$temperature),tabla,"%"), main="temperature",
col=rainbow_hcl(length(levels(Steel$temperature))))

E.P.I. Gijón Statistics


28 SESSION 1. DESCRIPTIVE STATISTICS

Histogram options
Hist(Steel$consumption, scale="frequency",
breaks="Sturges",col="darkgray",
main="Title",xlab="Consumption",ylab="Frequency")

See all the output graphs


In order to be able to move forward and backward with Re Pag and Av Pag it is necessary that
when making the first graph we go to History → Recording in the R console. We must repeat
this operation every time we open R.

Statistics E.P.I. Gijón


Session 2

Distribution models

Menu Distributions of R-Commander has a large set of probability distributions, classified by


discrete and continuous ones. For each of them, there are four options:

Quantiles. It is the smallest value c such that Pr[X ≤ c] ≥ p (lower tail) or Pr[X > c] ≤ p
(upper tail).

Probabilities. In a discrete random variable, this gives the values of the probability mass func-
tion, that is, Pr[X = k] for a given k. For continuous random variables, it gives the tail
probabilities (see next entry).

Tail probabilities. For a given k, it gives the probability Pr[X ≤ k] (lower tail) or Pr[X > k]
(upper tail).

Plot. It makes a graphical representation of the density function (for continuous variables)
or the probability mass function (for discrete random variables), and also that of the
distribution function.

Sample. It generates a random data set of a given size following a certain distribution.

a) In some versions of R-Commander, there is by default a blank space in Enter name


for data set. However, R-Commander does not allow for blank spaces in the object
names, so we must change this in order to avoid errors.
b) The rows correspond to samples and the columns to observations.

2.1 Continuous distributions


Example 2.1. Let Z be a normal distribution N (0, 1). Calculate Pr[Z ≤ −0.2].

Solution: Follow these instructions:

29
30 SESSION 2. DISTRIBUTION MODELS

,→ Distributions
,→ Continuous distributions
,→ Normal distribution
,→ Normal probabilities
,→ Variable value(s): -0.2
,→ Mean: 0 (default)
,→ Standard deviation: 1 (by default)
,→ Lower tail
,→ OK

The output must be 0.4207403.


Example 2.2. Let X be a normal distribution N (5, 2). Compute:
(a) Pr[X ≤ 7].
(b) Pr(X = 7).
(c) Pr(X < 7).
(d) Pr(X > 7).
(e) Pr(X ≥ 7).
(f ) Pr(4 < X < 7).
(g) Pr(4 ≤ X ≤ 7).
Solution: (a) Follow these instructions:
,→ Distributions
,→ Continuous distributions
,→ Normal distribution
,→ Normal probabilities
,→ Variable value(s): 7
,→ Mean: 5
,→ Standard deviation: 2
,→ Lower tail
,→ OK
The output must be 0.8413447.
(b) In a continuous distribution each single outcome has probability 0, so we can conclude
without using the software that P (X = 7) = 0.
(c) The answer is the same as in (a), because in a continuous distribution the probability of
a set does not change if we add or remove a point, as we just said.
(d) The only difference with respect to (a) is that we should choose now Upper tail. The
answer is 0.1586553. We can also obtain this by taking into account that P (X > 7) =
1 − P (X ≤ 7).

Statistics E.P.I. Gijón


SESSION 2. DISTRIBUTION MODELS 31

(e) It coincides with (d), because since X follows a continuous distribution we have P (X >
7) = P (X ≥ 7).
(f) In this case there are several ways to proceed. One would be taking into account that
P (4 < X < 7) = P (X < 7) − P (X ≤ 4).
The first value was obtained in (a). In a similar manner (changing 7 by 4) we can obtain
P (X ≤ 4) = 0.3085375, whence
P (4 < X < 7) = 0.8413447 − 0.3085375 = 0.5328072.
This last calculation can be made by typing it in RScript.
(g) Again, since we have a continuous distribution it holds that
P (4 ≤ X ≤ 7) = P (4 < X < 7),
so the answer is the same as in (f).
Example 2.3. Draw an histogram of a random sample of 10 000 values following a normal
distribution with mean µ = −3 and standard deviation σ = 2. Use approximately 50 bars in
the histogram.
Solution: Follow these instructions:
,→ Distributions
,→ Continuous distributions
,→ Normal distribution
,→ Sample of a normal distribution
,→ Enter name for data set: NormalSam-
ples.
,→ Mean: -3
,→ Standard deviation: 2
,→ Number of samples (rows): 10000
,→ Number of observations (columns): 1
,→ Sample means: disable
,→ Sample sums: disable
,→ Sample standard deviations: disable
,→ OK
Now we only have to make the histogram:

,→ Graphs
,→ Histogram...
,→ Variable (pick one): obs
,→ Number of bins: 50
,→ OK

E.P.I. Gijón Statistics


32 SESSION 2. DISTRIBUTION MODELS

We should obtain an output similar to this one:

but not exactly the same, because we are using random observations.

Example 2.4. Generate 100 random values of an exponential distribution with mean 2.

Solution: Follow this route:


,→ Distributions
,→ Continuous distributions
,→ Exponential distribution
,→ Sample from exponential distribution
,→ Enter name for data set: Exponential-
Samples
,→ Rate: 0.5
,→ Number of samples (rows): 100
,→ Number of observations (columns): 1
,→ Sample means: disable
,→ Sample sums: disable
,→ Sample standard deviations: disable
,→ OK
If we go now to View data set we obtain something similar to the following:

Statistics E.P.I. Gijón


SESSION 2. DISTRIBUTION MODELS 33

(but not exactly this, because the observations are random).

Example 2.5. Let X follow an exponential distribution exp(0.1). Compute:


(a) Pr[X ≤ 7].

(b) Pr(X = 7).

(c) Pr(X < 7).

(d) Pr(X > 7).

(e) Pr(X ≥ 7).

(f ) Pr(4 < X < 7).

(g) Pr(4 ≤ X ≤ 7).


Solution: (a) Follow these instructions:
,→ Distributions
,→ Continuous distributions
,→ Exponential distribution
,→ Exponential probabilities
,→ Variable value(s): 7
,→ Rate: 0.1
,→ Lower tail
,→ OK
The output is 0.5034147.

(b) Since the exponential is a continuous distribution each single outcome has probability 0,
so P (X = 7) = 0.

(c) The answer is the same as in (a), because in a continuous distribution the probability of
a set does not change if we add or remove a point.

(d) The only difference with respect to (a) is that we should choose now Upper tail. The
answer is 0.4965853. We can also obtain this by taking into account that P (X > 7) =
1 − P (X ≤ 7).

(e) It coincides with (d), because since X follows a continuous distribution we have P (X >
7) = P (X ≥ 7).

(f) We have that P (4 < X < 7) = P (X < 7) − P (X ≤ 4), that P (X < 7) = 0.5034147 and
in a similar way we can obtain that P (X ≤ 4) = 0.32968, so P (4 < X < 7) = 0.1737347.

(g) Again, since we have a continuous distribution it holds that

P (4 ≤ X ≤ 7) = P (4 < X < 7),

so the answer is the same as in (f).

E.P.I. Gijón Statistics


34 SESSION 2. DISTRIBUTION MODELS

Example 2.6. A similar procedure could be made with a Weibull distribution W (2, 3). We
obtain:
P (W (2, 3) ≤ 7) = P (W (2, 3) < 7) = 0.9956798
P (W (2, 3) > 7) = P (W (2, 3) ≥ 7) = 0.0043202
P (4 < W (2, 3) < 7) = P (4 ≤ W (2, 3) ≤ 7) = P (4 ≤ W (2, 3) < 7) = P (4 < W (2, 3) ≤ 7) = 0.1646931

2.2 Discrete distributions


Example 2.7. Let X follow a binomial distribution with parameters n = 10 and p = 0.4.
Compute:
(a) Pr[X ≤ 7].
(b) Pr(X = 7).
(c) Pr(X < 7).
(d) Pr(X > 7).
(e) Pr(X ≥ 7).
(f ) Pr(4 < X < 7).
(g) Pr(4 ≤ X ≤ 7).
(h) P (X = 2.3).
(i) P (X = 25).
Solution: (a) Follow these instructions:
,→ Distributions
,→ Discrete distributions
,→ Binomial distribution
,→ Binomial tail probabilities
,→ Variable value(s): 7
,→ Binomial trials: 10
,→ Probability of success: .4
,→ Lower tail
,→ OK
The output must be 0.9877154.
(b) In this case it is not a continuous distribution, so the probability of a single outcome may
not be zero. A possible manner would be the following:
,→ Distributions
,→ Discrete distributions
,→ Binomial distribution
,→ Binomial probabilities
,→ Binomial trials: 10
,→ Probability of success: .4
,→ OK
Statistics E.P.I. Gijón
SESSION 2. DISTRIBUTION MODELS 35

We observe that P (X = 7) = 0.0424673280.

(c) As we just said, we do not have that P (X < 7) = P (X ≤ 7). To compute P (X < 7),
we must take into account that in the case of the binomial distribution we have P (X <
7) = P (X ≤ 6). Proceeding as in (a), but with the value 6 instead of 7, we obtain
P (X < 7) = 0.9452381.

(d) To obtain P (X > 7) we proceed as in (a) but choose Upper tail instead of Lower tail.
We obtain P (X > 7) = 0.01229455.

(e) Again we do not have the equality P (X ≥ 7) = P (X > 7) that we had in the continuous
case. One way of solving this would be using that in the case of a binomial distribution
P (X ≥ 7) = P (X > 6). Proceeding as in (d), we obtain P (X > 6) = 0.05476188.

(f) We can compute P (4 < X < 7) as P (X ≤ 6) − P (X ≤ 4) = 0.3121348.

(g) Similarly, in this case we have P (4 ≤ X ≤ 7) = P (X ≤ 7) − P (X ≤ 3) = 0.6054248.

(h) In the case of a binomial distribution B(10, 0.4) the possible values are 0, 1, 2, ..., 10, so
P (X = 2.3) = 0.

(i) The same reasoning implies that P (X = 25) = 0.

Example 2.8. Determine the 95-th percentile of a Poisson distribution with parameter λ = 3.5.
Solution: Follow these instructions:

,→ Distributions
,→ Discrete distributions
,→ Poisson distribution
,→ Poisson quantiles
,→ Probabilities 0.95
,→ Mean 3.5
,→ Lower tail
,→ OK

The output is 7.

2.3 Exercises
Exercise 2.1. The measurement errors in a machine follow a normal distribution N (0, 2). Com-
pute:
a) The probability that the error is smaller than 1.

b) The probability that it lies between -2 and 2.

E.P.I. Gijón Statistics


36 SESSION 2. DISTRIBUTION MODELS

c) The value k such that 30% of the times the error is smaller than k.

d) The value k ′ such that 20% of the times the error is greater than k ′ .

Exercise 2.2. If the lifetime of a component follows an exponential distribution with mean 2
years, determine:

a) The probability that a component lasts longer than 5 years.

b) The probability that it lasts less than 6 years.

c) The probability that it lasts between 5 and 6 years.

d) The guarantee we should give so that at most 40% of the components to be repaired are
in the guarantee period.

Exercise 2.3. If the lifetime of a component follows a Weibull distribution with shape parameter
2 and scale parameter 3, determine the probability that it lasts more than 5 years.

Exercise 2.4. A study with roller bearings has determined that their lifetime (in hundreds of
hours) follows a Weibull distribution with parameters k = 0.4 and λ = 4.

(a) What is the probability that they fail before 160 hours?

(b) Given a batch of 10 roller bearings selected at random, what is the probability that none
of them fails before 160 hours? And the probability that at most one of them fails before
160 hours?

Exercise 2.5. The number of breakdowns in a factory during an 8 hour shift follows a Poisson
distribution with parameter λ = 16.

(a) What is the probability that there are more than 20 breakdowns during a given shift?

(b) What is the probability that the time between two consecutive breakdowns is longer than
1 hour?

2.4 Solutions to the exercises


Exercise 2.1. If X denotes the random variable “measurement error”, it follows that X ∼
N (0, 2). Thus, we obtain the following solutions:

a) P (X < 1) = 0.6914625.

b) P (−2 < X < 2) = 0.6826894.

c) −1.048801.

d) 1.683242.

Exercise 2.2. If X denotes the random variable “lifetime of the component in years”, it follows
that X ∼ exp(0, 5), because we are told that E(X) = 2. With this, we obtain the following
solutions:

Statistics E.P.I. Gijón


SESSION 2. DISTRIBUTION MODELS 37

a) P (X > 5) = 0.082085.

b) P (X < 6) = 0.9502129.

c) P (5 < X < 6) = P (X < 6) − P (X ≤ 5) = P (X < 6) − (1 − P (X > 5)) = 0.9502129 −


(1 − 0.082085) = 0.0322979.

d) The value k such that P (X ≤ k) = 0.4 is k = 1.021651.

Exercise 2.3. If X denotes the random variable “lifetime of the component in years”, it follows
that X ∼ W (2, 3) and P (X > 5) = 0.06217652.

Exercise 2.4. If X denotes the random variable “lifetime of a roller bearing in hundreds of
hours”, then X ∼ W (0.4, 4).

(a) P (X < 1.6) = P (X ≤ 1.6) = 0.4999988.

(b) The random variable Y :=“number of roller bearings that fail before 160 hours” follows a
binomial distribution B(10, 0.4999988). Then the requested probabilities are:

– P (Y = 0) = 0.0009765859.
– P (Y ≤ 1) = 0.0107424.

Exercise 2.5. (a) If the number of breakdowns per shift follows a Poisson distribution with
parameter λ = 16 and we denote this variable X, then we must compute P (X > 20). We
can do this using the option Upper tail in the manu Poisson tail probabilities.
We obtain P (X > 20) = 0.131832.

(b) If the number of breakdowns in 8 hours follows a Poisson P(16), then the time in hours
between consecutive breakdowns follows an exponential exp(2). Thus, we must compute
P (exp(2) > 1) = 0.1353353.

E.P.I. Gijón Statistics


Session 3

One sample tests

3.1 Introduction to hypothesis testing


Descriptive methods provide information about how a sample is distributed. To draw con-
clusions about the population, we need to use techniques of statistical inference (hypothesis
testing). We call hypothesis to any statement about the statistical features of a process. For
instance, if a technician observes the consumption for several hours, he can determine the mean
consumption during those hours. He may conjecture that the mean consumption during all the
working hours of the factory is equal to the same value, for instance 120. The scientific process
consists then is testing this hypothesis against an alternative.

Null hypothesis H0 : mean consumption = 120


Alternative hypothesis H1 : mean consumption ̸= 120

A test is a statistical process that is used the determine the truth of an hypothesis (the null
one). If the sample data are not very plausible when this hypothesis is true, then we reject it.
Otherwise, we say that there is not enough significant evidence to reject the hypothesis, and
we accept it.
To present the result of an hypothesis testing, we use the p-value, which is the smallest
significance level for which we reject the null hypothesis H0 . Once this is known, we compare the
p-value with a particular significance level (that may be pre-fixed or decided at that moment).

DECISION RULE
p-value < α =⇒ Reject H0
p-value ≥ α =⇒ Accept H0
We usually take α = 0.05.

39
40 SESSION 3. ONE SAMPLE TESTS

The most important hypothesis tests we shall make are:

• Mean of a population: Is the mean consumption equal to 120?

• Proportion within a population: Is the percentage of hours with high consumption (>
200) less than 1%?

• Comparison of means: Is the mean consumption the same irrespective of whether there
are breakdowns?

• Comparison of proportions: Is the percentage of hours with high consumption the same
irrespective of whether there are breakdowns?

• Comparison of standard deviations: Is the variability the same irrespective of whether


there are breakdowns?

3.2 Tests for the mean


In order to test an hypothesis we follow these steps:

1. Select an adequate test for the sample;

2. Determine H0 and H1 for this test; and

3. Interpret the p-value.

In particular, if we want to test an hypothesis about the mean, we must first of all determine
if the data follow a normal distribution or not; depending on the answer, we use a different
test: (see Table 3.1)1 .
To test if the data come from a normal distribution, we shall use in this course the Shapiro-
Wilk test. For this kind of test, the hypotheses are:
1
The Wilcoxon one sample test is only useful when the distribution of the data is symmetrical. Otherwise,
we may use different tests which lie outside the scope of this course.

Statistics E.P.I. Gijón


SESSION 3. ONE SAMPLE TESTS 41

Table 3.1: Tests about the mean/median.

Test about the Approx. Test type


normal distrib.?
Mean (µ) YES t one sample test
Median (M e) No Wilcoxon one sample test

GOODNESS OF FIT TEST TO A NORMAL DISTRIBUTION


H0 : the data come from a normal distribution
H1 : the data do NOT come from a normal distribution

DECISION RULE
p-value < α =⇒ Reject H0 (the distribution is not normal)
p-value ≥ α =⇒ Accept H0
We usually take α = 0.05 .

Example 3.1. Study the normality of the distribution of the variable consumption.

Solution: We use the Shapiro-Wilk test:

Statistics
ySummaries
yTest of normality. . .

Select consumption
yOK

Shapiro-Wilk normality test

data: Steel$consumption W = 0.9924, p-value = 0.4207

Since the p-value (0.4207) is greater than α (α = 0.05 by default) we do not reject the null
hypothesis, and conclude that the data follow a normal distribution.

Since the consumption follows a normal distribution, we can apply a test about the mean.
We must use the t one sample test, whose hypothesis can be of three types:

E.P.I. Gijón Statistics


42 SESSION 3. ONE SAMPLE TESTS

H0 : µ = 120 H0 : µ ≥ 120 H0 : µ ≤ 120


H1 : µ ̸= 120 H1 : µ < 120 H1 : µ > 120
Example 3.2. Is the mean consumption equal to 120?
Solution: In this case we have:
H0 : the mean consumption is 120
H1 : the mean consumption is not 120

Statistics
yMeans
ySingle-sample t test

Select variable consumption


yPut 120 in the null hypothesis
yOK

One Sample t-test

data: consumption t = 2.9798, df = 116, p-value = 0.003514


alternative hypothesis: true mean is not equal to 120 95 percent
confidence interval:
125.2568 146.0974
sample estimates:
mean of x
135.6771

With the decision rule:


p-value < α =⇒ Reject H0 (mean consumption ̸= 120)
p-value ≥ α =⇒ Accept H0 (mean consumption = 120)

Statistics E.P.I. Gijón


SESSION 3. ONE SAMPLE TESTS 43

the p-value (0.0002210) is smaller than α and we reject the null hypothesis (H0 ); we conclude
thus that the mean is different from 120.

Example 3.3. Is the mean consumption smaller than 140?

Solution: Since we are dealing again with the variable consumption, we already know that it
follows a normal distribution, and we can use the t one sample test. Hence, an adequate test
for this statement is:
H0 : the mean consumption is not smaller than 140
H1 : the mean consumption is smaller than 140

Statistics
yMeans
ySingle-sample t-test

Select variable consumption


yWe put 140 in the null hypothesis
yPut Population mean < mu0
yOK

One Sample t-test

data: consumption t = -0.8217, df = 116, p-value = 0.2065


alternative hypothesis: true mean is less than 140 95 percent
confidence interval:
-Inf 144.4005
sample estimates:
mean of x
135.6771

Since the p-value (0.2065) is greater than α, we do not reject the null hypothesis. Hence,
there is not enough evidence in the sample to assume that the mean is smaller than 140.

E.P.I. Gijón Statistics


44 SESSION 3. ONE SAMPLE TESTS

Example 3.4. We want to make a test about the mean production of type I galvanized steel.
In order to select the adequate test, we must answer the following question: Does the variable
pr.galv1 follow a normal distribution?

Solution: We test the normality of pr.galv1.

Statistics
ySummaries
yTest of normality. . .

Select pr.galv1
yOK

We get the following output:

Shapiro-Wilk normality test

data: pr.galv1 W = 0.9697, p-value = 0.00957

Since the p-value (0.00957) is smaller than α, we reject the null hypothesis; therefore, we
conclude that the variable does not follow a normal distribution.

Example 3.5. Is the mean production of type I galvanized steel less than 400?

Solution: Since there is no normality we must make the Wilcoxon one sample test. The different
types of hypotheses for the median are:
H0 : M e = 400 H0 : M e ≥ 400 H0 : M e ≤ 400
H1 : M e ̸= 400 H1 : M e < 400 H1 : M e > 400
two.sided less greater
We are interested in the following hypothesis:

Is the average production smaller than 400?


H0 : M e ≥ 400 (the average production is big)
H1 : M e < 400 (the average production is small)

We must follow these steps:

Statistics E.P.I. Gijón


SESSION 3. ONE SAMPLE TESTS 45

Statistics
yNon-parametric tests
ySingle-sample Wilcoxon test

Select variable pr.galv1


yPut 400 in Null hypothesis: mu =
ySelect mu < 0 in Alternative hypothe-
sis
yOK

We obtain the following output:


Wilcoxon signed rank test with continuity correction

data: pr.galv1 V = 4003.5, p-value = 0.9538 alternative hypothesis:


true location is less than 400
Since the p-value (0.9538) is greater than α, there are not significant evidences to reject the
null hypothesis; thus, we can assume that the average production is of at least 400.

3.3 Population proportion


Frequently we are interested in knowing the proportion of elements of a population with a
certain feature: for instance, if the percentage of hours with breakdown is excessive (greater
than 10%) or not.

Example 3.6. In our example, is the percentage of hours with breakdowns significantly greater
than 10%?

Solution: We then follow the usual steps when testing an hypothesis:


Select an adequate test for this sample
In this problem we must make a one sample proportion test. R-Commander allows us to
apply this test to dichotomic variables (factors with exactly two levels).
Determine H0 and H1 for this test
The different hypotheses we can make about a proportion are:
H0 : p = 0.1 H0 : p ≥ 0.1 H0 : p ≤ 0.1
H1 : p ̸= 0.1 H1 : p < 0.1 H1 : p > 0.1
two.sided less greater
Note that RCommander considers by default that p is the proportion of the first class in
alphabetical order, unless the classes have been previously ordered by the researcher (we can
do this with the menu option Data −→ Manage variables in active data set −→ Reorder factor
levels..., as we shall see later.); in our case, p refers to the proportion of the class No. Hence,
we consider the following hypotheses:

E.P.I. Gijón Statistics


46 SESSION 3. ONE SAMPLE TESTS

H0 : p ≥ 0.9 (reasonable proportion of breakdowns)


H1 : p < 0.9 (excessive proportion of breakdowns)
Now it suffices to make

Statistics
yProportions
ySingle-sample proportion test

Select variable breakdowns


yWrite 0.9 as the null hypothesis
yOK

The output is:


1-sample proportions test without continuity correction

data: rbind(xtabs(~breakdowns, data = Steel)), null probability 0.9


X-squared = 25.2317, df = 1, p-value = 2.542e-07

alternative hypothesis: true p is less than 0.9

95 percent confidence interval:


0.0000000 0.8192062
sample estimates:
p
0.7606838

Interpretation of the p-value


Since the p-value (2.542 · 10−7 ) is smaller than α we reject the null hypothesis, and conclude
that the proportion of breakdowns is excessive.

OTHER POSSIBILITIES

Test using the binomial distribution. We have performed the previous test using the
default option Normal approximation in Type of test. When the sample size is small, it is
better to use the option Exact binomial. In our example the differences are minimal, and we
obtain again a very small p-value:

Statistics E.P.I. Gijón


SESSION 3. ONE SAMPLE TESTS 47

Exact binomial test

data: rbind(.Table)
number of successes = 89, number of trials = 117, p-value = 1.002e-05
alternative hypothesis: true probability of success is less than 0.9
95 percent confidence interval:
0.0000000 0.8242563
sample estimates:
probability of success
0.7606838

Reordering the factor levels. Another approach to the problem is to reorder the factor
levels and to put YES as the first one.

Data
yManage variables in active data set
yReorder factor levels

Select variable breakdowns


yOK

We reorder as we please
yOK

In this way, the hypotheses would be

H0 : p ≤ 0.1 (reasonable percentage of breakdowns)


H1 : p > 0.1 (excessive percentage of breakdowns)

To perform the test, we do the following:

E.P.I. Gijón Statistics


48 SESSION 3. ONE SAMPLE TESTS

Statistics
yProportions
ySingle-sample proportion test

Select variable breakdowns


yWe write 0.1 as the null hypothesis
yOK

1-sample proportions test without continuity correction

data: rbind(xtabs(~breakdowns, data = Steel)), null probability 0.1


X-squared = 25.2317, df = 1, p-value = 2.542e-07

alternative hypothesis: true p is greater than 0.1

95 percent confidence interval:


0.1807938 1.0000000
sample estimates:
p
0.2393162
We observe that we obtain the same p-value as before and reach as a consequence the same
conclusion.

3.4 Confidence intervals


Finally, note that in the output of some of these tests we are also obtaining a confidence interval
for the parameter (the population mean in the case of the single sample t test and the population
proportion in the case of the single sample proportion test). By default, we obtain a confidence
interval at the confidence level of 95%; we may change it using the options of the test.
Note also that, in order to obtain a bounded interval of the type we have seen in class, we
must select the two-sided alternative in the options of the test, that is represented in these two
tests with the symbol ! =

3.5 Exercises
Exercise 3.1. a) Obtain a confidence interval for the mean consumption at the confidence
levels 1 − α = 90%, 1 − α = 95% and 1 − α = 99%, respectively.
b) Obtain a confidence interval for the proportion of times that line A is used, at the confi-
dence levels 1 − α = 90%, 1 − α = 95% and 1 − α = 99%, respectively.

Statistics E.P.I. Gijón


SESSION 3. ONE SAMPLE TESTS 49

Exercise 3.2. Give a reasoned answer to the following questions:

a) What is the mean consumption of the 117 data? And its standard deviation?

b) Make the histogram of the consumption. This graph, does it make us think that the data
follow a normal distribution?

c) Test the normality of the distribution of the consumption. What is the p-value? Do we
admit the normality of the data?

d) Using the result of the previous item and the mean and standard deviation of the first one,

• What is the percentage of hours where we expect to have consumption greater than
265 megawatts/hour? And smaller than 99 megawatts/hour? And between 99 and
265 megawatts/hour?
• What is the estimation of the consumption which is only exceeded 2% of the times?

Exercise 3.3. Do these data support the hypothesis that the average consumption is smaller than
130 megawatts/hour?

Exercise 3.4. Do these data support the hypothesis that the average consumption is smaller than
130 megawatts/hour during those hours with high temperature? Draw a box-plot of the variable
consumption for each of the temperatures considered and comment the results.

Exercise 3.5. Can we conclude that the average production of the steel converter is smaller than
260 tons? And different than 250? And different from 240? And different from 180? Calculate
the mean value of the production of the steel converter and comment the results.

Exercise 3.6. Do these data support the hypothesis that the percentage of times line A is used
is greater than 20%?

Exercise 3.7. a) Represent graphically the number of hours in the sample where the overheat-
ing detection system is on and the number of hours when it is off. What is the percentage
of hours when it is on?

b) The acquisition of the overheating detection system is not profitable if, in general, it is used
less than 40% of the time. Using the sample data, can we conclude that the acquisition is
not profitable?

c) A study about this system consists in choosing at random 25 hours of the monthly produc-
tion and determine if the system was on or off during each of them. Assuming that the
population proportion coincides with the estimation obtained in b), what is the probability
that exactly in 9 of the 25 hours the system was on? And no more than 12 hours? And at
least than 10 hours? And between 10 and 12 hours (both values included)? And between
9.5 and 12.5 hours? And more than 9 and less than 13 hours?

E.P.I. Gijón Statistics


50 SESSION 3. ONE SAMPLE TESTS

3.6 Solutions to the exercises


Exercise 3.1. a) Use one-sample t-test. Confidence intervals for the mean consumption:

Confidence level Interval


1 − α = 90% (126.9537, 144.4005)
1 − α = 95% (125.2568, 146.0974)
1 − α = 99% (121.8989, 149.4553)

b) Use 1-sample proportions test. Confidence intervals for the proportion of times line A is
used:
Confidence level Interval
1 − α = 90% (0.4457825, 0.5959867)
1 − α = 95% (0.4316194, 0.6097571)
1 − α = 99% (0.4044922, 0.6359495)

Exercise 3.2. a) From the menu option Statistics → Summaries → Numerical summaries:

mean sd IQR 0% 25% 50% 75% 100% n


135.6771 56.90756 83.39 17.5 99.09 135.1 182.48 290.72 117

the mean is 135.6771 megawatts/hour and the standard deviation is 56.90756 megawatts/hour.

b) It seems so, because the histogram looks like the density function of the normal distribution
(bell shaped).

c) The p-value of the Shapiro-Wilk normality test applied to these data is 0.4207. Since it is
clearly greater than the significance level (the usual value is α = 0.05 and the maximum
value is usually α = 0.1), we do not reject the null hypothesis, so there is not enough
evidence against the normality of the random variable “consumption”.

d) From the previous item, we can assume that the variable follows a normal distribution,
and we estimate its mean and standard deviation with the values of the sample mean
and standard deviation. Hence, we assume that X=“consumption”≡ N (135.677, 56.908).
Taking this into account and using Distributions → Continuous distributions → Normal
distribution we can answer the different questions:

• Since P (X > 265) = 0.01152839, the expected percentage of working hours when the
consumption is greater than 265 megawatts/hour is 1.15%. Similarly, since P (X <
99) = 0.2596268, we estimate the percentage of working hours with less than 99
megawatts/hour is around 25.96%.
From this we deduce that the expected percentage of times when the consumption is
between 99 and 265 megawatts/hour is (100 − 1.15 − 25.96) = 72.89%.2
• We are looking for the value c such that P (X > c) = 0.02. With the function Normal
quantiles we obtain c = 252.5517.
2
Note here the difference between the probability (expected proportion) and the sample proportion. If we
look at part (a) we note that the first quartile in the sample is 99.09 while the cumulative probability of this
number is much greater than 0.25.

Statistics E.P.I. Gijón


SESSION 3. ONE SAMPLE TESTS 51

Exercise 3.3. We perform the test H0 : µ ≥ 130 against H1 : µ < 130, because we have shown
in the previous exercise that we may assume the normality of the variable “consumption” and
can, therefore, apply the t one sample test. The output with R is

One Sample t-test

data: consumption t = 1.0791, df = 116, p-value = 0.8586

alternative hypothesis: true mean is less than 130 95 percent


confidence interval:
-Inf 144.4005
sample estimates:
mean of x
135.6771

and since the p-value is 0.8586, we do not reject the null hypothesis: there are not evidences
against assuming that the mean consumption is greater than or equal to 130 megawatts/hour.

Exercise 3.4. In this case we begin filtering the data, and creating a new dataset that we shall
call Steel.temp_high. We already did this in the first session (Section 1.7). Once this is
done, we test the normality of the consumption when the temperature is high. The output of
the Shapiro-Wilk normality test is:

Shapiro-Wilk normality test

data: consumo W = 0.9448, p-value = 0.02965

, so we cannot assume normality, and use thus the Wilcoxon one sample test. The output is:

Wilcoxon signed rank test

data: consumption V = 245, p-value = 0.000464

alternative hypothesis: true location is less than 130

so we reject the null hypothesis and conclude that the consumption when the temperature is high
is smaller than 130 megawatts/hour.
To draw a box-plot, we must first of all consider again the dataset Steel. By doing this and
determining the box-plot for the variable consumption grouped by temperature we obtain:

E.P.I. Gijón Statistics


52 SESSION 3. ONE SAMPLE TESTS

which clearly shows that the consumption with high temperatures are smaller. We see that the
mean consumption is of 135.6771 megawatts/hour while in the case of high temperatures it is
of 103.5239 megawatts/hour. This explains why in general we cannot conclude that the mean
consumption is smaller than 130, but we can if we restrict ourselves to those hours with high
temperature.

Exercise 3.5. We begin by testing the normality of the data in pr.sc. The output of the Shapiro-
Wilk normality test is:

Shapiro-Wilk normality test

data: pr.sc W = 0.9349, p-value = 2.49e-05

so we reject the normality of the data. In order to make a test about the average production,
we use the Wilcoxon one sample test. Using the option Statistics → Non-parametric tests →
Wilcoxon single sample test we get the following output:

Wilcoxon signed rank test with continuity correction

data: pr.sc V = 3071.5, p-value = 0.151

alternative hypothesis: true location is less than 250

so we do not reject H0 and therefore there are not significant evidence against the median being
greater than or equal to 260 tons.
To test if the median is different from 250, we use the menu Statistics → Non-parametric
tests → Wilcoxon single sample test and select (Alternative hypothesis: two-sided,
Null hypothesis: mu = 250). We get a p-value of 0.302, so again we conclude that it is
admissible that the median is of 250 tons.
If we test H0 : M e = 240 against H1 : M e ̸= 240, the p-value is 0.6244, so again we accept
H0 .

Statistics E.P.I. Gijón


SESSION 3. ONE SAMPLE TESTS 53

However, if we test H0 : M e = 180 against H1 : M e ̸= 180, the p-values is 0.001068, so we


conclude that the median is different from 180.
If we compute the sample median of the 117 data, we obtain the value 225. This has led us
to accept the hypothesis of a population median of 250, 240 or greater or equal to 250, but to
reject the the hypothesis of a population median of 180.

mean sd IQR 0% 25% 50% 75% 100% n


pr.sc 244.9231 167.5311 234 13 99 225 333 677 117

Exercise 3.6. Yes, because the p-value of the test H0 : p ≤ 0.2 against H1 : p > 0.2, where p
represents the proportion of times line A is used, is smaller than 2.2 · 10−16 . Thus, we reject
H0 and conclude that there are significant evidences that the percentage of times line A is used
is greater than 20%.

Exercise 3.7. a) An adequate graph for the number of hours when the overheating detection
system is on is a bar chart of system. In the frequency table we observe that 49.57265%
of the hours it was on.

b) We test H0 : p ≤ 0.6 against H1 : p > 0.6, because OF F < ON and thus p = P (OF F ).
The p-value is 0.9827, so there are no evidences that P (OF F ) > 0.6, or, equivalently,
that P (ON ) < 0.4.

c) If we consider the variable X =“number of hours where the system is on between the 25
chosen at random” and assume that the population proportion coincides with the point
estimation from item a), we obtain that X ≡ B(25, 0.4957). Then,

• P (X = 9) = 6.460361 · 10−2 = 0.06460361, so we estimate that the probability that


the system is on exactly 9 of the 25 hours is 0.06460361.
• P (X ≤ 12) = 0.5173218, so the probability that at most 12 hours out of the 25 the
system is on is 0.5173218.
• P (X ≥ 10) = P (X > 9) = 0.8766434, so we estimate that the probability that the
system is on at least 10 hours is 0.8766434.
• P (10 ≤ X ≤ 12) = P (X ≤ 12) − P (X < 10) = P (X ≤ 12) − (1 − P (X ≥ 10) =
0.5173218 − (1 − 0.8766434) = 0.3939652, so we estimate that the probability that the
system is on between 10 and 12 hours (both included) is 0.3939652.
• P (9.5 ≤ X ≤ 12.5) = P (10 ≤ X ≤ 12) = 0.3939652, so we estimate that the
probability that the system is on between 9.5 and 12.5 hours is 0.3939652.
• P (9 < X < 13) = P (10 ≤ X ≤ 12) = 0.3939652, so again the probability that the
system is on more than 9 hours and less than 13 is 0.3939652.

E.P.I. Gijón Statistics


Session 4

Two sample tests

In the previous part we learned how to make one sample tests. As a summary, we saw how to
perform the following:

• Tests about the proportion:

– Single-sample proportion test

• Tests about the population mean:

a) Make the Shapiro-Wilk normality test.


b) Is there normality?
Yes −→ Single-sample t test
No −→ Single-sample Wilcoxon test

In this part and the next one we shall compare two samples. The outline is the following:

• Tests about proportions: Two-sample proportions test

• Tests about variances: F-test

• Tests about the mean:

a) Make the Shapiro-Wilk normality test;


b) Is there normality?
– Yes; How are the samples?
Independent −→ independent samples t test
Paired −→ paired t test
– No; How are the samples?
Independent −→ Wilcoxon two sample test
Paired −→ Wilcoxon paired test

55
56 SESSION 4. TWO SAMPLE TESTS

4.1 Comparison of proportions


In many cases it is interesting to compare two populations by means of the proportion of
elements with a certain feature in each population (for instance, the proportion of male and
female smokers), asking if the proportion is the same in both. The two sample proportions
test allows us to decide if the differences between the sample proportions are due to random
fluctuations or to a difference in the populations. In order to make this test, the sample sizes
should be relatively big.

Example 4.1. Is the proportion of hours without breakdowns smaller in line A than in line B?

Solution: Let us follow the usual steps for the resolution of an hypothesis test.
Identify the adequate test for the problem
In this problem we are comparing two proportions, so the adequate test is the two sample
test for equality of proportions.
Determine H0 and H1 for this test
The different types of hypotheses we can consider when comparing proportions are:

H0 : pA = pB H0 : pA ≥ pB H0 : pA ≤ pB
H1 : pA ̸= pB H1 : pA < pB H1 : pA > pB
two.sided less greater

where pA and pB are the proportions in populations A and B, respectively. In our case, sample
A will be given by those data where line=="A" and B, by those with line=="B".
By default, R-Commander assumes that proportions pA and pB are associated to the first
class in the alphabetical order, in this case the value A of the variable line . Hence, our
hypotheses are:

H0 : pA ≥ pB (better in line A)
H1 : pA < pB (worse in line A)

To make this test, we must go to:

Statistics
yProportions
yTwo-sample proportions test

Statistics E.P.I. Gijón


SESSION 4. TWO SAMPLE TESTS 57

We select variables line and


breakdowns
yPress: Difference < 0
yOK

We obtain:
2-sample test for equality of proportions without continuity correction

data: .Table X-squared = 0.0673, df = 1, p-value = 0.6024


alternative hypothesis: less
95 percent confidence interval:
-1.0000000 0.1504991
sample estimates:
prop 1 prop 2
0.7704918 0.7500000

Interpret the p-value


Since the p-value (0.6024) is greater than α we do not reject the null hypothesis; there are
not significant evidences that there are more breakdowns in line A than in line B.

When applying the two sample proportions test, we should be careful with three things:

• Identify correctly the response variable and the group variable; in this respect, note that
the response variable is the object of our study, and we compare its behaviour on the
groups determined by the other variable. Note that the values or these groups usually
appear as subindices in H0 , H1 .

• Verify that the proportion is representing what we are interested in. By default, R
considers that p is the proportion of the first category in alphabetical order. If we are
interested in the other category, we should either reorder the factor levels or express
H0 , H1 in terms of the other category.

• The order of the categories of the group variable when choosing the appropriate alternative
hypothesis.

4.2 Comparison of variances


A first step when testing the equality of means in two populations is to determine if the variances
are equal1 R-Commander has three tests for this: the F two variances test, the Barlett test and
1
We talk of homoscedasticity when the variances are equal and heteroscedasticity when they are not.

E.P.I. Gijón Statistics


58 SESSION 4. TWO SAMPLE TESTS

the Levene test. Here we shall use the first one: the two variances F test, because we shall
consider this problem only after verifying that the distributions are normal.

Example 4.2. Are the variances of the consumption the same in lines A and B? (assuming
normality)

Solution: The hypotheses of this test are

H0 : σA2 = σB2 (homoscedasticity)


H1 : σA2 ̸= σB2 (heteroscedasticity)

so we must make

Statistics
yVariances
yTwo-variances F test

Select variables line and consumption


yOK

The output is:

A B
1431.355 2034.651

F test to compare two variances

data: consumption by line F = 0.7035, num df = 60, denom df = 55,


p-value = 0.1834 alternative hypothesis: true ratio of variances is
not equal to 1 95 percent confidence interval:
0.4158963 1.1828332
sample estimates: ratio of variances
0.7034893

Statistics E.P.I. Gijón


SESSION 4. TWO SAMPLE TESTS 59

Since the p-value (0.1834) is greater than α we do not reject the null hypothesis. Hence, we
assume that there are no significant differences between the variances in the two populations
(line A and line B).

This test is usually employed as an auxiliary test for the independent samples t test, that
we shall see later. However, in the context of engineering it is interesting by itself, because a
basic strategy for the improvement of quality is determining the causes of variability, in order
to reduce it. For this reason, it is not uncommon to perform tests of homoscedasticity (usually
with alternative hypothesis of the type H1 : σ12 > σ22 or H1 : σ12 < σ22 ) to see if the strategies
that have been adopted to reduce the variability are effective or not.

4.3 Comparison of averages


To compare the averages in two samples we have to take into account:

a) The relationship between the samples, which may be:

independent: they are two samples corresponding to different elements of the popula-
tion. In R-Commander, two independent samples within the same data set have a
dichotomous factor (that is, a factor with two levels) that distinguishes them; the
quantitative variable under study is in one column only. Assume for instance that
we are studying the consumption with and without breakdowns; the values of the
consumption are all in the same column (consumption) and each data belongs to
one sample or the other depending on the value of variable breakdowns.
paired: in this case, each individual has a value in each of the two samples. In R-
Commander, the data appear in two different columns.

b) If we may assume normality or not.

The following table summarizes the different tests of comparison of means that we shall see in
this subject:
Approximately
Comparison normal Independent? Test type
distributions?
Difference of means Yes Yes Independent samples t test
Mean of the difference Yes No Paired t test
Difference of medians No Yes Two sample Wilcoxon test
Median of the difference No No Wilcoxon paired test

In the following two subsections we will consider the case where the underlying distributions
are normal, and in the two subsequent ones one, the case of arbitrary distributions. In case of
normality, we compare the means of both groups, or, equivalently, the mean of the difference.

E.P.I. Gijón Statistics


60 SESSION 4. TWO SAMPLE TESTS

4.3.1 Independent samples with normality


Example 4.3. Is the average consumption smaller in line A than in line B?

Solution: We must determine first of all if the data come from a normal distribution. This can
be done in a number of ways.
The fastest procedure is to test the normality by groups (note that this feature is not
available in old versions of RCommander):

Statistics
ySummaries
yTest of normality

Here, we have to pick the option test by groups.

Pick test by groups


ySelect variable line
yOK

Another possibility, valid in old versions of RCommander, is to type the following in the
RScript:

Statistics E.P.I. Gijón


SESSION 4. TWO SAMPLE TESTS 61

with(Steel,by(consumption,line,shapiro.test))

Finally, we could also make two filters and create two data sets, corresponding to line A
and B, respectively, and then apply to each of them the Shapiro-Wilk normality test.
Whatever the procedure, we obtain the following output:

line: A

Shapiro-Wilk normality test

data: dd[x, ] W = 0.9708, p-value = 0.1534

-----------------------------------------------------------------------------
line: B

Shapiro-Wilk normality test

data: dd[x, ] W = 0.9746, p-value = 0.2841

The p-value for the consumption in line A is 0.1534 and in line B it is 0.2841. In both
cases it is large enough so as not to reject the null hypothesis (we can accept the normality of
the data). The two samples are independent, and as we saw in Example 4.2 the variances are
equal. Taking all this into account, we can apply the Independent samples t test, assuming
equal variances.
Determine H0 and H1 for this test
For the independent samples t test the possibilities for the null and alternative hypotheses
are:

H0 : µ1 = µ2 H0 : µ1 ≥ µ2 H0 : µ1 ≤ µ2
H1 : µ1 ̸= µ2 H1 : µ1 < µ2 H1 : µ1 > µ2

In this example, we have:

H0 : µA ≥ µB (more mean consumption in line A)


H1 : µA < µB (less mean consumption in line A)

and we proceed as follows:

E.P.I. Gijón Statistics


62 SESSION 4. TWO SAMPLE TESTS

Statistics
yMeans
yIndependent samples t test

Select variables line and consumption


ySelect: Difference < 0
yEqual variances
yOK

The output is:

Two Sample t-test

data: consumption by line t = -10.1697, df = 115, p-value < 2.2e-16


alternative hypothesis: true difference in means is less than 0 95
percent confidence interval:
-Inf -65.32647
sample estimates:
mean in group A mean in group B
98.3182 176.3716

Interpret the p-value


Since the p-value (< 2.2 · 10−16 ) is smaller than α we reject the null hypothesis, so there are
significant evidences of the mean consumption being smaller in line A than in line B.

Let us make now a two-side test of equality of means:

Statistics E.P.I. Gijón


SESSION 4. TWO SAMPLE TESTS 63

Example 4.4. Is the average consumption equal in line A and in line B?

Solution: The hypotheses in this case are:

H0 : µA = µB (the mean consumption is the same in lines A and B)


H1 : µA ̸= µB (the mean consumption is different)

Statistics
yMeans
yIndependent samples t test

Select variables line and consumption


ySelect: Two-sided
yEqual variances
yOK

We obtain:

Two Sample t-test

data: consumption by line t = -10.1697, df = 115, p-value < 2.2e-16


alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-93.25631 -62.85051
sample estimates:
mean in group A mean in group B
98.3182 176.3716

E.P.I. Gijón Statistics


64 SESSION 4. TWO SAMPLE TESTS

Since the p-value (< 2.2 · 10−16 ) is again smaller than α we do reject the null hypothesis,
and conclude that there are significant differences in the mean consumption in lines A and B.
This was to be expected, because if we have significant evidences that the mean consumption
is strictly smaller in line A than in line B, in particular there are significant evidences that the
two consumptions are different.

4.3.2 Paired samples with normality


Example 4.5. Compare the average productions of continuous casting and of painted panels.

We start calculating the variable difference with the option Data → Manage variables in
active data set → Compute new variable with the following input:

This creates a new variable to which we apply the Shapiro-Wilk normality test, in the usual
way: Statistics → Summaries → Shapiro-Wilk normality test. Once we are in Shapiro-Wilk
normality test we select variable difference and obtain the following result:

Shapiro-Wilk normality test

data: Steel$difference W = 0.988, p-value = 0.3948

Since the p-value is 0.3948, we do not reject the hypothesis of normality. Hence, we can
apply the paired t test to the hypotheses

H0 : µcc−pint = 0
H1 : µcc−pint ̸= 0

Statistics E.P.I. Gijón


SESSION 4. TWO SAMPLE TESTS 65

In order to do this, we apply the following:

Statistics
yMeans
yPaired t test

First variable (pick one)


yChoose pr.cc
ySecond variable (pick one)
yChoose pr.pint
yAlternative hypothesis
yBilateral

The output is

Paired t-test

data: pr.cc and pr.pint t = 2.5405, df = 116, p-value = 0.01239


alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
18.56348 149.91515
sample estimates: mean of the differences
84.23932

and we see that the p-value is 0.0139; this allows us to conclude that there is a significant
difference in the means of pr.cc and pr.pint.

E.P.I. Gijón Statistics


66 SESSION 4. TWO SAMPLE TESTS

4.3.3 Independent samples without normality


Assume we are interested in determining the average of the difference of two random variables
(representing two different features of the elements of a population or the same feature measured
in two samples from different populations), and that we cannot assume normality. In that case,
we may use the Wilcoxon test.

Example 4.6. Let us study the average production of type II galvanized steel depending on the
production line.

Solution: We apply the Shapiro-Wilk normality test to each sample of variable pr.galv2 de-
pending on the value of line. To do this, we select first of all those values corresponding to
line A (Figure 4.1).

Figure 4.1: Filter to select the data from line A.

Next, we apply the Shapiro-Wilk normality test, with the following results:
Shapiro-Wilk normality test

data: pr.galv2 W = 0.8955, p-value = 7.985e-05


Hence, we cannot assume the normality of the data of the production of type II galvanized steel
in line A (there are significant evidences against it).
Since in order to apply the independent samples t test both variables must follow a normal
distribution, we must then apply the Two-sample Wilcoxon test. Before doing this, we return
to the data set Steel, using Data → Active data set → Select active data set (Figure 4.2).

Figure 4.2: Returning to the initial data set.

Statistics E.P.I. Gijón


SESSION 4. TWO SAMPLE TESTS 67

Note also that a faster way to make this test, instead of making all the filters, would be to
type in the RScript the following instructions:
with(Steel,by(pr.galv2,line,shapiro.test))

Finally, an even faster procedure would be to test the normality of variable pr.galv2 per
group, similarly to what we did in Section 4.3.1.
Now, if we want to compare the averages in both samples, we may consider the following
hypotheses:

H0 : Me A − Me B = 0 (the average production is the same)


H1 : Me A − Me B ̸= 0 (the average production is different)

We apply then the Wilcoxon two samples test, in the menu (Statistics → Nonparametric
tests → Two-sample Wilcoxon test, Figure 4.3).

Figure 4.3: Two-sample Wilcoxon test.

The output is
Wilcoxon rank sum test with continuity correction

data: pr.galv2 by line


W = 1431, p-value = 0.1314

alternative hypothesis: true location shift is not equal to 0


We conclude that there are no significant evidences that the average production of type II
galvanized steel varies depending on the line, because the p-value (0.1314) is greater than any
reasonable significance level.

4.3.4 Paired samples without normality


Example 4.7. Compare the average productions of type I and type II galvanized steel.

Solution: First of all, we obtain the difference of these two variables with the menu Data →
Manage variables in the active data set → Compute new variable. We shall denote the new
variable dif.
The output of the Shapiro-Wilk normality test on dif is:

E.P.I. Gijón Statistics


68 SESSION 4. TWO SAMPLE TESTS

Shapiro-Wilk normality test

data: dif W = 0.9671, p-value = 0.005665

We reject the normality at the significance level α = 0.05. Hence, instead of making the t
paired test we are going to apply the Wilcoxon paired test. The null and alternative hypotheses
we consider are:
H0 : Me X1 −X2 = 0 (the average production is the same in both cases)
H1 : Me X1 −X2 ̸= 0 (the average production is different for both variables)
We use the menu Statistics → Nonparametric tests → Paired samples Wilcoxon test and
select the variables as described in Figure 4.4.

Figure 4.4: Wilcoxon paired test.

The output of this procedure is:

Wilcoxon signed rank test with continuity correction

data: pr.galv1 and pr.galv2 V = 249, p-value < 2.2e-16

alternative hypothesis: true location shift is not equal to 0

The p-value is < 2.2 · 10−16 ≈ 0, smaller than any reasonable significance level α, so we
conclude that the production of the two types of galvanized steel is different in average.

4.4 Exercises
Exercise 4.1. Give a reasoned answer to the following questions:

a) Make a test to determine if the percentage of hours when the overheating detection system
is off is greater in line A than in line B. How much is the p-value? What do we conclude?

Statistics E.P.I. Gijón


SESSION 4. TWO SAMPLE TESTS 69

b) In line A, what is the percentage of hours when the system was off ?

c) In line B, what is the percentage of hours when the system was off ?

d) Which of these two percentages is greater? Discuss the results.

Exercise 4.2. Are there significant evidences that the proportion of times we use line B is greater
when there are breakdowns?

Exercise 4.3. We want to compare the average consumption when the overheating detection
system is on and when it is off.

a) What would be an adequate test for this?

b) If the alternative hypothesis is that the mean consumption is greater when the system is
off, what is the p-value? What do we conclude?

c) Make a graphical representation that illustrates the conclusions of the previous point.

Exercise 4.4. We want to compare the average production of continuous casting and of the steel
converter.

a) Which test is the most adequate to compare the two averages?

b) For this sample, what is the mean production of continuous casting? And of the steel
converter?

c) If the alternative hypothesis is that the mean production of continuous casting is greater
than that of the steel converter, what is the associated p-value? What do we conclude?

Exercise 4.5. a) Is the production of the steel converter greater, in average, when the over-
heating detection system is on than when it is off ?

b) How much is the sample median of the production of the steel converter when the system
is off ? And when it is on?

c) Make a graph where we can compare the production of the steel converter when the over-
heating detection system is off and on.

Exercise 4.6. a) Is the production of the steel converter smaller, in average, than that of the
hot belt train?

b) How much is the sample median of the production of the steel converter? And that of the
hot belt train?

4.5 Solutions to the exercises


Exercise 4.1. a) We make a two-sample proportions test (H0 : pA ≤ pB against H1 : pA >
pB ) and obtain:

E.P.I. Gijón Statistics


70 SESSION 4. TWO SAMPLE TESTS

system
line OFF ON Total Count
A 50.8 49.2 100 61
B 50.0 50.0 100 56

2-sample test for equality of proportions without continuity correction

data: .Table X-squared = 0.0078, df = 1, p-value = 0.4647


alternative hypothesis: greater 95 percent confidence interval:
-0.1439993 1.0000000
sample estimates:
prop 1 prop 2
0.5081967 0.5000000

The p-value is 0.4647 so we cannot conclude that the % is significantly greater in line A,
the sample differences may be due to randomness.

b) From the table in the previous item we see that the percentage of times the system was off
in line A was 50.8%.

c) Similarly, the percentage of times the system was off in line B is 50%.

d) The % of times the system is off is greater for line A than for line B. Nevertheless,
the differences are not large enough to conclude that this is what happens in the whole
population.

Exercise 4.2. We should apply the two sample proportions test with H0 : pY ES ≤ pN O against
H0 : pY ES < pN O , where p represents the proportion of times we use line B. Since this is the
second category in the variable line, we reorder the factor levels. Moreover, we choose < in
the alternative hypothesis (the difference NO-YES should be negative).
Since we obtain a p-value of 0.3976, we ACCEPT the null hypothesis, and thus conclude
that there are NOT significant evidences that the proportion of times we use line B is greater
when there are breakdowns.

Exercise 4.3. a) • They are independent samples.


• Normality?
> with(Steel,by(consumption,system,shapiro.test))
system: OFF

Shapiro-Wilk normality test

data: dd[x, ] W = 0.9798, p-value = 0.4319

-------------------------------------------------------------------
system: ON

Shapiro-Wilk normality test

Statistics E.P.I. Gijón


SESSION 4. TWO SAMPLE TESTS 71

data: dd[x, ] W = 0.9757, p-value = 0.2958


We may then assume that both samples follow a normal distribution.
• Homoscedasticity?
F test to compare two variances

data: consumption by system F = 0.9125, num df = 58, denom df = 57,


p-value = 0.7292

alternative hypothesis: true ratio of variances is not equal to 1

95 percent confidence interval:


0.5410492 1.5372658
sample estimates: ratio of variances
0.9125423

We may assume that the variances are equal.

From all this we deduce that the adequate test for this problem is the t test for independent
samples, with the option of equal variances.

b) Two Sample t-test

data: consumption by system t = 2.1912, df = 115, p-value = 0.01523


alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
5.518213 Inf

sample estimates:
mean in group OFF mean in group ON
146.9241 124.2362
The p-value is 0.01523, so we reject the null hypothesis at the significance level α = 0.05,
and conclude that the mean consumption is significantly greater when the system is on.

c) Although other graphical representations may also be adequate, we are going to make the
box-plot for OFF and ON, since it is the easiest one among those we saw in the first
session.

Exercise 4.4. a) They are paired samples, so we begin by obtaining the variable difference,
that we shall call dif_cc_sc . To this variable we apply a normality test.
Creation of the variable difference:

Steel$dif_cc_sc <- with(Steel, pr.cc-pr.sc)

E.P.I. Gijón Statistics


72 SESSION 4. TWO SAMPLE TESTS

Normality test:

Shapiro-Wilk normality test

data: dif_cc_sc W = 0.979, p-value = 0.06339

Since we can accept the normality (p-value = 0.06339), we use the paired t test.
b) The mean production of continuous casting is 433.93 tons and that of the steel converter
is 244.92 tons.
c) We obtain:

Paired t-test

data: pr.cc and pr.sc t = 5.6404, df = 116, p-value = 6.079e-08


alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
133.4459 Inf
sample estimates: mean of the differences
189.0085

With a p-value of 6.079 · 10−8 , we conclude that there are significant evidences that the
mean production of continuous casting is greater than that of the steel converter.
Exercise 4.5. a) The p-value of the Shapiro-Wilk normality test on the data of the production
of the steel converter when the system is off is 0.002512, so we reject the normality of one
of the variables. Since we can only apply the t two sample test when there is normality,
we must use the Wilcoxon two sample test instead.
If we consider the hypotheses:

H0 : the production when the system is off is greater or equal


than when it is on, in average
H1 : the production when the system is off is smaller
than when it is on, in average

The associated p-value is 0.07351, so in this case there are no significant evidences than
the production, in average, is greater when the system is on than when it is off.
b) When it is off, 179 and when it is on, 241.
c) We may for instance make a box-plot of the variable pr.ca grouping on the variable
system.
Exercise 4.6. a) The p-value of the normality test for the difference (pr.sc-pr.hbt) is 1.892·
10−7 , so reject normality. The p-value of the paired Wilcoxon test is smaller than 2.2 ·
10−16 , so there are significant evidences that the average production of the steel converter
is smaller than that of the hot belt train.
b) The sample median of the production of the steel converter is 225 tons and that of the hot
belt train is 8062 tons.

Statistics E.P.I. Gijón


Session 5

χ2-test of independence and linear


correlation

In many occasions we are interested in determining whether two variables are related or not.
We may wonder for instance if the salary depends on the type of studies, or if the percentage
of defective components depends on the production line considered. There are quite a few tests
that help us answering these questions. In this session we shall see two of them:

• The χ2 -independence test: it studies the existence of a dependence relationship between


two variables. We shall normally use it when the two variables are qualitative.

• Pearson’s correlation test: it tests the existence or not of a linear relationship between
the two variables. We shall use it on quantitative variables.

In particular, we may be interested in applying the χ2 independence test also on quantitative


discrete variables, and in that case we shall convert them into factors. On the other hand, if we
want to apply the χ2 -independence test to quantitative and continuous variables it is necessary
to bin them first; this, however, will be much more rare in practice.

If we want to study the relationship between one qualitative and one quantitative variable,
we must take into account the number of categories (factor levels) of the qualitative variable. If
they are only two, then we must apply a test for independent samples: the null hypothesis will
be that the mean of the difference is zero, and this is related to the absence of a relationship;
the alternative hypothesis will be that the mean of the difference is non-zero, which points out
towards the existence of a relationship.
When the qualitative variable takes more than two levels, we must use statistical methods
that lie outside the scope of this course, such as the analysis of variance (ANOVA) or Kruskal-
Wallis test.

Indeed, many of the tests we have seen so far can be considered as tests for the existence
of a relationship between two variables. If we call response variable the one whose behaviour
we want to understand and group variable the one that we assume may have some influence
on the response variable, we can summarize in Table 5.1 the main tests:

73
74 SESSION 5. χ2 -TEST OF INDEPENDENCE AND LINEAR CORRELATION

GROUP RESPONSE TYPE OF TEST


VARIABLE VARIABLE NULL HYPOTHESIS

Test for the mean of the difference (t and Wilcoxon)


QUALITATIVE
QUANTITATIVE H0 : the mean of the difference is zero. We can regard this
WITH TWO
as equivalent to the absence of a relationship between the
LEVELS
variables.

Example: Study of whether the mean weight is the same for men and for women.

QUALITATIVE One-way ANOVA or Kruskal-Wallis


WITH MORE QUANTITATIVE H0 : all groups behave the same way, in average. We can
THAN TWO regard this as equivalent to the absence of a relationship
LEVELS between the variables.

Example: Study of whether the mean weight is the same in Spain, USA and Japan.

Two-sample proportions test


QUALITATIVE QUALITATIVE H0 : the proportions of the response variable are the same
WITH TWO WITH TWO for each of its two levels of the group variable. We can
LEVELS LEVELS regard this as equivalent to the absence of a relationship
between the variables.

Example: Study of whether the percentage of smokers is the same for men and women.

χ2 -independence test
QUALITATIVE QUALITATIVE H0 : the probability distribution of the response variable
WITH SEVERAL WITH SEVERAL is the same for each level of the group variable. We can
LEVELS LEVELS regard this as equivalent to the absence of a relationship
between the variables.

Example: Study of whether the percentage of smokers is the same in Spain, USA and Japan.

Pearson’s correlation test


QUANTITATIVE QUANTITATIVE
H0 : There is no linear relationship between the variables.
Example: Study about the existence of a linear relationship between height and weight.

Table 5.1: Some tests about the relationship between two variables.

As we have seen before, we can establish a three step protocol when testing an hypothesis.
In the particular case of the tests we consider in this sessions, the steps to take are:

1. Select the adequate test for the sample.


If we test the existence or not of independence between the variables, we select the χ2 -
independence test.
If we want to know if there is a linear relationship between two quantitative variables,
then we select Pearson’s correlation test.

2. Establish H0 and H1 for this test.


In both cases, the null and alternative hypotheses are:

H0 : there is no relationship between the variables


H1 : there is relationship between the variables

Specifically, in the case of the χ2 -independence test, these hypothesis can be written as:

Statistics E.P.I. Gijón


SESSION 5. χ2 -TEST OF INDEPENDENCE AND LINEAR CORRELATION 75

H0 : there is statistical independence between the variables


H1 : there is statistical dependence between the two variables

In the case of Pearson’s correlation test, the hypotheses are:

H0 : there is linear independence between the variables


(the correlation coefficient is zero)
H1 : there is linear dependence between the variables
(the correlation coefficient is non-zero)

Nevertheless, R-Commander also allows us to put as the alternative hypothesis that the
correlation coefficient is positive (large values of one variable imply large values of the
other one) or negative (large values of one variable imply small values of the other one).

3. Interpret the p-value.

A p-value smaller than the significance level indicates the existence of a relationship
between the variables. Otherwise we say that the data do not provide significant evidences
of a relationship.

5.1 Independence
The χ2 independence test allows us to determine if there is a statistical relationship between
two qualitative variables. Note that this test does not indicate the type of relationship, nor
which of the variables influences the other one.
We can see it as a generalization of the two-sample proportions test to several samples. Let
us explain how it works by means of an example:

Example 5.1. Is there a relationship between the existence of breakdowns and the temperature?

Solution: Both variables are qualitative. Temperature has three levels (High, Medium, Low)
while breakdowns has two levels (Yes, No). Since they are qualitative, we use the χ2 -independence
test. We go to

Statistics
yContingency tables
yTwo-way table. . .

E.P.I. Gijón Statistics


76 SESSION 5. χ2 -TEST OF INDEPENDENCE AND LINEAR CORRELATION

Select variables breakdowns and line


yPick Chi-squared test of independence
yOK

We obtain the following output:


Frequency table:
temperature
breakdowns High Low Medium
No 38 24 27
Yes 8 14 6
Pearson’s Chi-squared test
data: .Table X-squared = 5.1595, df = 2, p-value = 0.07579
The first part of the output is called a contingency table. This is a table in a matrix format
that displays the multivariate frequency distribution of the variables. It is often used (like in
this case) to record and analyze the relation between two or more categorical variables. The
number of rows and columns corresponds to the number of levels of each of the variables.
Since the p-value (0.07579) is greater than all the usual significance levels α, we do not reject
the null hypothesis. Hence, there are no significant evidences that the temperature influences
the existence of breakdowns.
The p-value of this test follows from an approximation of the distribution of the statistic to
a chi-squared distribution. In order for this approximation to be good, certain conditions must
hold: specifically, the expected frequencies when H0 is true must not be too small. Usually it is
required that if there are expected frequencies smaller than 5, these should not represent more
than 20% of the cells in the table.
When the condition is not satisfied, the convention is to group categories of the variables
until the problem is solved, and draw our conclusions with the corresponding p-value. If we
don’t do this, and use the current p-value instead, the conclusions are not too reliable.
We can verify this condition by clicking on Print expected frequencies in the Two-way
table.

Statistics E.P.I. Gijón


SESSION 5. χ2 -TEST OF INDEPENDENCE AND LINEAR CORRELATION 77

Example 5.2. In the previous example, we obtain


Expected Counts:
temperature
breakdowns High Low Medium
No 34.99145 28.905983 25.102564
Yes 11.00855 9.094017 7.897436
We see that there are no cells with an expected frequency smaller than 5, and therefore we can
use the p-value to draw our conclusions.
Another possibility when some of the expected frequencies are smaller than 5 is to use the
option Fisher’s exact test in menu Two-way table. In most cases the p-values won’t be too
different, but if they are we should work with the one of the exact test.
In addition to the expected frequencies, we can also obtain the row, column or total per-
centages. This tells us the percentage of data in each category from the elements of the row,
column or from the whole data set.
Example 5.3. If we select in the previous example the option Percentages of total we obtain,
in addition to the results in the previous example, the following table:
Total percentages:
High Low Medium Total
No 32.5 20.5 23.1 76.1

Yes 6.8 12.0 5.1 23.9

Total 39.3 32.5 28.2 100.0

This table tells us for instance that:


• In 32.5% of the hours in the sample there were no breakdowns and the temperature was
high;
• In 12% of the hours in the sample there were breakdowns and the temperature was low;
• In 76.1% of the hours in the sample there were no breakdowns and
• In 39.3% of the hours in the sample the temperature was high.
If instead we pick the option Row percentages, we obtain:
Row percentages:
temperature
breakdowns High Low Medium Total Count
No 42.7 27 30.3 100 89
Yes 28.6 50 21.4 100 28
which tells us, for instance, that out of the 89 hours in the sample with no breakdowns, in
42.7% of them the temperature was high, in 27% it was low and in the remaining 30.3% it was
medium.
If instead we pick Column percentages we obtain the percentage of hours with and without
breakdowns for each temperature level.

E.P.I. Gijón Statistics


78 SESSION 5. χ2 -TEST OF INDEPENDENCE AND LINEAR CORRELATION

Entering the frequencies directly If instead of having the data set and selecting the variables
of interest we are given already the contingency table, we can also apply the χ2 test of inde-
pendence to these data. For this, we should consider the option Enter and analyze a two way
table.

Statistics
yContingency tables
yEnter and analyze a two way table

Then, we determine the size of the table, by choosing the number of rows (=number of
different values observed for the first variable) and the number of columns (=number of different
values observed for the second variable), and enter the joint frequencies inside the table. The
rest of the options of the test are the same as before.

5.2 Pearson’s correlation test


The χ2 -independence test is used to verify the independence of two statistical variables. Un-
der certain normality assumptions (and only in those cases), the condition of independence is
equivalent to the lack of correlation: two variables are independent if and only if Pearson’s
correlation coefficient (ρ) is equal to zero. However, in general this coefficient only measures
the linear relationship between the two variables. It is possible then that two variables are
related by means of a non-linear function, even if their correlation coefficient is small.
Thus, we can quantify the strength of the linear relationship between two variables by means
of Pearson’s correlation coefficient. Recall that this coefficient is defined as the ratio between
the covariance of the two variables and the product of their standard deviations, and it takes
values between -1 and 1. The value ±1 indicates a perfect linear relationship (positive when
ρ = 1 and negative when ρ = −1). A correlation close to 0 indicates the absence of a linear
relationship between the two variables.
The problem is then to decide if the observed correlation between two variables is significant
or not. This is done by means of Pearson’s correlation test.
Example 5.4. In the data set Steel.Rdata we have observations about the emissions of some
gases, as well as about the total production:
TotalProd Total production per hour (steel tons).
NOx Nitrogen mixture emissions per hour (t NO2/h).
CO Carbon monoxide emissions per hour (t/h).
COV Volatile organic component emissions per hour (t/h).
SO2 Sulphur dioxide emissions per hour (t/h).
CO2 Carbon dioxide emissions per hour (t/h).
N2O Nitrogen oxide emissions per hour (t/h).

Statistics E.P.I. Gijón


SESSION 5. χ2 -TEST OF INDEPENDENCE AND LINEAR CORRELATION 79

According to these data, is there a linear relationship between the energy consumption and the
emissions of carbon dioxide?

Solution: Since both variables are quantitative and continuous, we may use Pearson’s correla-
tion test:

Statistics
ySummaries. . .
yCorrelation test

Select variables CO2 and consumption


yOK

The output is:


Pearson’s product-moment correlation

data: CO2 and consumption t = 35.1003, df = 115, p-value < 2.2e-16


alternative hypothesis: true correlation is not equal to 0 95
percent confidence interval:
0.9376074 0.9695667
sample estimates:
cor
0.9563613

Since the p-value is smaller than 2.2 · 10−16 , it is in particular smaller than all the usual
significant levels α, so we reject the null hypothesis. Thus, we can conclude that there are
significant evidences of a linear relationship between the consumption and the emissions of
CO2.
We observe also that the point estimation of the correlation coefficient between both vari-
ables is 0.9563613, which leads to a confidence of 95% that the true value of the correlation coef-
ficient is between 0.9376074 and 0.9695667: 95 percent confidence interval: 0.9376074
0.9695667. This can also helps us deducing that there are evidences against the absence of a
linear relationship, because the latter means that the correlation coefficient is zero, and this is
incompatible with the confidence interval above.

E.P.I. Gijón Statistics


80 SESSION 5. χ2 -TEST OF INDEPENDENCE AND LINEAR CORRELATION

Since the estimation of the correlation coefficient 0.9563613 is very close to one and the
p-value is very small, we conclude that the degree of linear relationship between both variables
is very high. Since moreover the estimation is positive and the p-value of the test H0 : ρ ≤ 0
against H1 : ρ > 0 is again smaller than 2′ 2 · 10−16 (we can obtain it by selecting Correlation
> 0 in the alternative hypothesis), we see that the linear relationship between both variables
is positive, so big consumptions imply large emissions of CO2, and small consumptions imply
small emissions.

The calculation of the p-value of this test assumes that the joint distribution of both variables
is a bivariate normal distribution 1 . There is not a completely satisfactory way of verifying this.
One partial verification can be made by testing the normality of each of the variables separately:
if we reject it in any of the two cases, then we should not apply Pearson’s correlation test when
the sample size is small. In that case we can use a non-parametric correlation coefficient, such
as Spearman’s coefficient.

Example 5.5. In the previous example the conclusions obtained are valid, because the outcome
of the Shapiro-Wilk normality test on each of the variables is:

Shapiro-Wilk normality test


data: Steel$consumption
W = 0.9884, p-value = 0.4207

Shapiro-Wilk normality test

data: CO2
W = 0.9924, p-value = 0.771

so in both cases we accept the normality.

Quite often we are interested in determining which of the variables in a given set has the
strongest linear relationship with another one. To analyze this, it is usual to represent all the
correlation coefficients in the correlation matrix.

Example 5.6. In the previous example we may be interested in analyzing whether the energy
consumption is linearly related to the emission of CO, CO2 and SO2. In order to obtain the
correlation coefficients and the p-values, we must do the following:

Statistics
ySummaries
yCorrelation matrix.

1
The details of this distribution lie outside the scope of this course.

Statistics E.P.I. Gijón


SESSION 5. χ2 -TEST OF INDEPENDENCE AND LINEAR CORRELATION 81

Pick CO, CO2, consumption and SO2.


yPearson product-moment
ypairwise p-values
for Pearson or Spearman correlations
yOK

We obtain the following:


Pearson correlations:
CO CO2 consumption SO2
CO 1.0000 0.9442 0.9198 0.0444

CO2 0.9442 1.0000 0.9564 -0.0286

consumption 0.9198 0.9564 1.0000 -0.0076

SO2 0.0444 -0.0286 -0.0076 1.0000

Number of observations = 117

Pairwise two-sided p-values:


CO CO2 consumption SO2
CO <.0001 <.0001 0.6347

CO2 <.0001 <.0001 0.7599

consumption <.0001 <.0001 0.9352

SO2 0.6347 0.7599 0.9352


Adjusted p-values (Holm’s method)
CO CO2 consumption SO2
CO <.0001 <.0001 1

CO2 <.0001 <.0001 1

consumption <.0001 <.0001 1

SO2 1 1 1
The second matrix tells us that there are significant evidences of a linear relationship between
the energy consumption and the emissions of CO and CO2, but not with SO22 . Out of the variables
2
Note that we do not reject the hypothesis of normality for any of these four variables, because the p-
values of the Shapiro-Wilk normality tests are p-value(CO)=0.1485, p-value(CO2)=0.771, p-value(SO2)=0′ 2773
and p-value(consumption) =0.4207. Thus, it makes sense to interpret the p-values of the Pearson’s correlation
tests.

E.P.I. Gijón Statistics


82 SESSION 5. χ2 -TEST OF INDEPENDENCE AND LINEAR CORRELATION

with a significant linear relationship, the stronger one is that of CO2, because its correlation
coefficient (see the first matrix) is 0.9564 instead 0.9198.

The study of the correlation is usually a first step when determining the relationship between
the variables, and, from it, predicting the value of one of them given the value of the other. The
corresponding techniques, known as regression models, will be the subject of the next session.

5.3 Exercises
Give a reasoned answer to the following questions, using the database Steel.Rdata.

Exercise 5.1. a) Is there a relationship between the temperature and the state of the over-
heating detection system?

b) In the sample, how many times has the system being OFF and the temperature high? And
how many times has it been ON and the temperature has been medium?

c) If there was statistical independence between the two variables, how many times would be
expect the system to be OFF and the temperature to be high?

d) Out of all the hours with medium temperature, what is the percentage of hours with the
overheating detection system ON?

Exercise 5.2. a) Without making calculations, what test may we use to analyze the relation-
ship between the consumption and the total production, the χ2 test of independence or
Pearson’s correlation test?

b) Are the requirements necessary to apply the test satisfied?

c) According to the p-value, what do we conclude about the linear relationship between the
consumption and the total production?

d) What is the point estimation of Pearson’s correlation coefficient (ρ)? According to this
value, what do we expect to happen when we increase the total production, that the con-
sumption increases or decreases?

e) According to the confidence interval for ρ we have obtained, is it admissible to consider


that the true correlation coefficient is 0.75?

f) Out of all the gas emissions, how many of them have a significant relationship with the
energy consumption? Which of them has the stronger relationship?

Exercise 5.3. a) Is the existence of breakdowns independent of the shift?

b) What is the proportion of times that there are breakdowns in the night shift? And in the
morning and the afternoon shifts?

Exercise 5.4. Is there a linear relationship between the productions of type 1 and type 2 galva-
nized steel?

Statistics E.P.I. Gijón


SESSION 5. χ2 -TEST OF INDEPENDENCE AND LINEAR CORRELATION 83

5.4 Solutions
Exercise 5.1. a) The p-value of the χ2 -independence test is 0.9471, so there is no statisti-
cal evidence that the system is ON more frequently depending on the temperature. The
conclusions of this test are reliable because the expected frequencies are:

> .Test$expected # Expected Counts


system
temperature OFF ON
High 23.19658 22.80342
Low 19.16239 18.83761
Medium 16.64103 16.35897

and all of them are greater than or equal to 5.

b) In the sample, in 24 occasions the temperature was high and the system was OFF, and in
17 occasions the system was ON and the temperature was medium.

c) If there was statistical independence between both variables, we would expect to have
23.19658 hours with high temperature and the system OFF out of the 117 in the sam-
ple.

d) Out of the hours with medium temperature, in 51.5% the overheating detection system
was ON.
Exercise 5.2. a) Pearson’s correlation test, because they are quantitative and continuous.

b) The p-value of Shapiro-Wilk normality test for consumption is 0.4207 and for TotalProd
is 0.8543, so in both cases we accept the normality. These are the minimum requirements
to be able to apply Pearson’s correlation test.

c) The p-value of Pearson’s correlation test is smaller than 2′ 2·10−16 , so there are significant
evidences of ρ ̸= 0, i.e., of the existence of a linear relationship between the consumption
and the total production.

d) The point estimation of Pearson’s correlation test is R = 0.9496154. Since this value is
positive, we expect the consumption to increase as the total production increases.

e) The confidence interval at 95% for ρ is (0.9280690, 0.9648255). Thus, a value of ρ equal
to 0′ 75 is not admissible.

f) All of them have a significant relationship but SO2, because the matrix of p-values of
Pearson’s correlation test is

CO CO2 consumption COV N2O NOx SO2


CO <.0001 <.0001 <.0001 <.0001 <.0001 0.6347

CO2 <.0001 <.0001 <.0001 <.0001 <.0001 0.7599

E.P.I. Gijón Statistics


84 SESSION 5. χ2 -TEST OF INDEPENDENCE AND LINEAR CORRELATION

consumption <.0001 <.0001 <.0001 <.0001 <.0001 0.9352

COV <.0001 <.0001 <.0001 <.0001 <.0001 0.7414

N2O <.0001 <.0001 <.0001 <.0001 <.0001 0.9398

NOx <.0001 <.0001 <.0001 <.0001 <.0001 0.1753

SO2 0.6347 0.7599 0.9352 0.7414 0.9398 0.1753

and we can use these values because the p-values provided by Shapiro-Wilk normality test
are:

Shapiro-Wilk normality test


data: CO
W = 0.9831, p-value = 0.1485

Shapiro-Wilk normality test


data: CO2
W = 0.9924, p-value = 0.771

Shapiro-Wilk normality test


data: consumption
W = 0.9884, p-value = 0.4207

Shapiro-Wilk normality test


data: COV
W = 0.9944, p-value = 0.9229

Shapiro-Wilk normality test


data: N2O
W = 0.9922, p-value = 0.7518

Shapiro-Wilk normality test


data: NOx
W = 0.9797, p-value = 0.07302

Shapiro-Wilk normality test


data: SO2
W = 0.9862, p-value = 0.2773

so we can accept that all variables under study follow a normal distribution. In fact,
the large sample size (n=117) already allows to draw reliable conclusions by means of
Pearson’s correlation test.
The one with the stronger relationship is CO2, because it is the one with the greater cor-
relation coefficient in absolute value. The correlation matrix is:

Statistics E.P.I. Gijón


SESSION 5. χ2 -TEST OF INDEPENDENCE AND LINEAR CORRELATION 85

CO CO2 consumption COV N2O NOx SO2


CO 1.0000 0.9442 0.9198 0.9950 0.8196 0.5195 0.0444

CO2 0.9442 1.0000 0.9564 0.9650 0.8540 0.5685 -0.0286

consumption 0.9198 0.9564 1.0000 0.9334 0.8274 0.5384 -0.0076

COV 0.9950 0.9650 0.9334 1.0000 0.8359 0.5344 0.0308

N2O 0.8196 0.8540 0.8274 0.8359 1.0000 0.5317 0.0071

NOx 0.5195 0.5685 0.5384 0.5344 0.5317 1.0000 -0.1262

SO2 0.0444 -0.0286 -0.0076 0.0308 0.0071 -0.1262 1.0000

Exercise 5.3. a) Since both breakdowns and shift are qualitative variables, we apply the χ2
test of independence. We obtain a p-value of 1.527e−15, so there are significant evidences
that the two variables are NOT independent.

b) If the variable breakdowns is in the row, we select row percentages in the options of
the χ2 test of independence. We observe that 100% of the breakdowns happened in the
night shift, and as a consequence 0% of the breakdowns ocurred in the in the morning and
afternoon shifts.

Exercise 5.4. Since both pr.galv1 and pr.galv2 are quantitative variables, we must apply a
correlation test. In order to check if Pearson’s correlation test is applicable, we test first the nor-
mality of these two variables. We obtain the p-values 0.00957 for pr.galv1 and 0.0000003397
for pr.galv2, so we reject normality. Thus, we cannot apply Pearson’s correlation test, be-
cause this test needs that both variables are normally distributed. If we apply Spearman’s test
we obtain a p-value of 0.127. Since this p-value is greater than 0.05, we conclude that there are
not significant evidences that the two variables are correlated.

E.P.I. Gijón Statistics


Session 6

Linear regression

In many applications in engineering, it is of interest to model the relationships between sets


of variables. For instance, we may want to model the performance of a process as a function
of the temperature and the pressure conditions, the demand as a function of the number of
customers, etc.
In the previous session, we saw how to analyze these relationships, by means of Pearson’s
correlation test. Now, we shall go further, and develop a prediction method: a procedure to
estimate the value of one of the variables as a function of the others, using the experimental
data. The statistical problem is then to obtain the best estimation of the relationship between
the variables.
Thus, in some situations we shall determine which changes should be made in the controlled
variables in order to optimize the results in another variable which is non-controlled. For
instance, we may determine at which temperature and pressure we obtain the best performance,
etc.
Linear regression analysis is the statistical technique used to study and model the relation-
ship between quantitative variables. In most cases we do not know the true prediction equation
and we must an adequate approximation. We say adequate here because usually we will have
no means of determining the optimality of the approximation. In addition to determining a
well-fitting model, we shall also look for a model which is as simple as possible.
In any of the models we distinguish the variables according to their role. Often there is only
one response variable, which cannot be controlled in the experiment. This response variable
depends on one or more explanatory variables. The case where we have only one response
variable and one explanatory variable is called simple linear regression, and is the one we shall
consider in this session. More complex models, with several explanatory variables (multiple
linear regression) are outside the scope of this course.
The process in a regression model can always be summarized in the following steps:

a) Make an assumption about the type of relationship between the response and the ex-
planatory variables.

b) Use the data to estimate the parameters in the model.

c) Diagnose the adequacy of the model.

d) Use the model to make estimations and predictions, in case it has been deemed adequate.

87
88 SESSION 6. LINEAR REGRESSION

We shall focus in this session in linear regression. It refers to a model where the conditional
mean of Y given the value of X is a linear function of X.
Linear regression was the first type of regression analysis to be studied rigorously, and to
be used extensively in practice. The reason is that this type of models are easier to treat
mathematically, and moreover the statistical properties of the estimators involved are easier to
determine.
Linear regression can be used to fit a predictive model to an observed data set of y and x
values. After developing such a model, if an additional value of X is then given without its
accompanying value of Y , the fitted model can be used to make a prediction of the value of Y .
Given a variable Y and a number of variables X1 , . . . , Xp that may be related to Y , linear
regression analysis can also be applied to quantify the strength of the relationship between Y
and the Xj , to assess which Xj may have no relationship with Y at all, and to identify which
subsets of the Xj contain redundant information about Y .
Linear regression models are often fitted using the least squares approach; there are, nonethe-
less, other possibilities that optimize the fitness with respect to other criteria. Thus, while the
terms “least squares” and “linear model” are closely linked, they are not equivalent.

6.1 Step 1: Search for a model


The easiest model of the relationship between two variables is a linear function. Out of all the
possible ones, we must select the optimal according to some criterion. In order to do this, we
usually study the existence of a linear relationship between the response variable and all the
possible explanatory variables, and estimate the linear correlation coefficients between them.
A graphical representation that is usually made within this study are the scatterplots.

Example 6.1. Assume we want to predict the value of variable N2O as a function of the other
emissions of the factory (CO, CO2, NOx and SO2). In order to do this, we are going to study
which of them is the best explanatory variable.

Solution: We begin by obtaining the correlation matrix:

Statistics
ySummaries
yCorrelation matrix.

Select CO, CO2, N2O, NOx and SO2.


yPearson product-moment
yPairwise p-values
for Pearson or Spearman correlations
yOK

Statistics E.P.I. Gijón


SESSION 6. LINEAR REGRESSION 89

The output is:


Pearson correlations:
CO CO2 N2O NOx SO2
CO 1.0000 0.9442 0.8196 0.5195 0.0444

CO2 0.9442 1.0000 0.8540 0.5685 -0.0286

N2O 0.8196 0.8540 1.0000 0.5317 0.0071

NOx 0.5195 0.5685 0.5317 1.0000 -0.1262

SO2 0.0444 -0.0286 0.0071 -0.1262 1.0000

Number of observations: 117

Pairwise two-sided p-values:


CO CO2 N2O NOx SO2
CO <.0001 <.0001 <.0001 0.6347

CO2 <.0001 <.0001 <.0001 0.7599

N2O <.0001 <.0001 <.0001 0.9398

NOx <.0001 <.0001 <.0001 0.1753

SO2 0.6347 0.7599 0.9398 0.1753


Adjusted p-values (Holm’s method)
CO CO2 N2O NOx SO2
CO <.0001 <.0001 <.0001 1.0000

CO2 <.0001 <.0001 <.0001 1.0000

N2O <.0001 <.0001 <.0001 1.0000

NOx <.0001 <.0001 <.0001 0.7011

SO2 1.0000 1.0000 1.0000 0.7011


The third row in the first table gives us the correlation coefficients of N2O with the other
variables. The next table gives us the p-values of Pearson’s correlation test. The third row
shows that all these p-values are almost equal to 0, except for the last one, which is greater than
all the usual significance levels, so we conclude that N2O has a significant linear relationship
with CO, CO2 and NOx, but not with SO2.1 .
1
Note that we do not reject the hypothesis of normality for any of the five variables under study, because the
p-values of the Shapiro-Wilk normality test are: p-value(CO)=0.1485, p-value(CO2)=0.771, p-value(N2O)=0.7518,
p-value(NOx)=0.07302 and p-value(SO2)=0.2773. As a consequence, it makes sense to apply Pearson’s correlation

E.P.I. Gijón Statistics


90 SESSION 6. LINEAR REGRESSION

Out of the three variables it has relationship with, the one with the greatest correlation
coefficient in absolute value is CO2 (0.8540), so we choose this one as a explanatory variable for
N2O.
We can support these conclusions graphically by means of the scatterplot matrix, that we
can obtain as follows:

Graphs
yScatterplot matrix

Select CO, CO2, N2O, NOx and SO2.


yOK

We obtain the following figure:

Out of all these graphs, the interesting ones for our problem are those in the third row,
because the response variable (in our case N2O) is usually plotted in the Y axis while the
explanatory variables appear in the X axis.
test.

Statistics E.P.I. Gijón


SESSION 6. LINEAR REGRESSION 91

Which scatterplot shows a stronger relationship between N2O and the other variables? We
see that there does not seem to be any relationship with SO2, not a linear relationship with NOx
and that there are strong linear relationships with CO and CO2.

Once the response and explanatory variables have been determined, we make a scatterplot
of these two variables, in order to see if the linear regression model seems adequate in this case.

Example 6.2. Plot the scatterplot of N2O in the Y axis and CO2 in the X axis.

Solution: Follow the instructions below:

Graphs
yScatterplot

Select CO2 and N2O


yOK

We obtain the following.

E.P.I. Gijón Statistics


92 SESSION 6. LINEAR REGRESSION

The abscissa axis is the emission of CO2 and the ordinate axis displays the variable N2O.
There are two different lines in this graphic. The first one is the linear regression line of y on
x, and the other line is the nonparametric regression (this one is the best adjustment using
least square regression). When both lines are very similar, the linear model will provide a good
adjustment.

6.2 Step 2: Model estimation


In simple linear regression, the goal is to determine the linear model that best explains the
response variable Y as a function of the explanatory variable X. Hence, we consider a model
of the form
Y = β0 + β1 X + ϵ

To determine the best such model, we use the least squares estimation of the regression
parameters β0 and β1 . The estimations are the parameters minimizing the residual sum of
squares (RSS)
Xn
RSS = (yi − (β0 + β1 xi ))2 ,
i=1

where n is the sample size (the number of pairs of observations in our sample). Let us see an
example.

Example 6.3. Estimate the emission of N2O as a function of CO2.

Solution: Our goal is to determine the coefficients β0 and β1

N2O = β0 + β1 · CO2 + ϵ

where ϵ represents the random error. Follow these instructions:

Statistics E.P.I. Gijón


SESSION 6. LINEAR REGRESSION 93

Statistics
yFit models
yLinear regression

Response variable: N2O


yExplanatory variable: CO2
yOK

We may give this model a particular name or leave it with the default name. The output
is:

Call: lm(formula = N2O ~ CO2, data = Steel)

Residuals:
Min 1Q Median 3Q Max
-2.2585 -0.7287 0.0404 0.6511 2.9353

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.526865 0.280149 5.45 2.91e-07 ***

CO2 0.043850 0.002491 17.60 < 2e-16 ***


---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 1.111 on 115 degrees of freedom


Multiple R-squared: 0.7293, Adjusted R-squared: 0.7269
F-statistic: 309.8 on 1 and 115 DF, p-value: < 2.2e-16

The Estimate column presents the estimations of the coefficients. We deduce that the
equation of the linear model is

N2O = 1.526865 + 0.043850 · CO2 (6.1)

Thus, β0 is the Intercept and equals to 1.526865. This coefficient represents the quantity of
N2O when there is no CO2. As its p-value= 2.91 · 10−7 , this coefficient is statistically significant
non-zero ( H0 : β0 = 0, H1 : β0 ̸= 0).
The estimation of β1 is 0.043850. It is significant, since its p-value is less than 2 · 10−16 (
H0 : β1 = 0, H1 : β1 ̸= 0). The interpretation of this value is that, in average, for each unit of
CO2, the N2O increases in 0.043850 units.

E.P.I. Gijón Statistics


94 SESSION 6. LINEAR REGRESSION

6.3 Step 3: Diagnosis


The validity of the inference depends on the assumptions concerning the linear model. It is
required that the errors have mean equal to zero, constant variance and are uncorrelated. In
addition, the hypothesis tests we have seen and the confidence intervals we shall see require
that the errors follow a normal distribution. In addition, the adjustment to the regression line
should be good.
Before using the model to make predictions and estimations, we must make sure that it is
adequate. This can be done in several ways:
• By means of the adjusted R-squared. This coefficient should be close to 1.
• By means of the p-values of the hypotheses tests of the coefficients, and in particular of
the test associated to β1 . This coefficient should be significantly different from 0.
• By means of the residual analysis. As we said, the distributions of the residuals should
be normal, uncorrelated, with mean 0 and constant variance. Although a thorough study
of these conditions lies outside the scope of this course, we shall see how to verify if
they hold or not. This can be made with RCommander by means of Models -> Graphs
-> Basic diagnosis plots. Out of the four plots we obtain here, the most important
one is Residuals vs. Fitted. It is a graph where in the X axis we have the adjusted
values by means of the regression model and in the Y axis we have the residuals of these
adjustments. To verify the aforementioned conditions by means of this graph, we have
that:
– ZERO MEAN: the line of the means should coincide with the X axis, that represents
that the residuals are equal to 0.
– HOMOSCEDASTICITY: this hypothesis is violated when the size of the residuals
increases or decreases systematically depending on the adjusted values.
– LINEARITY: we should not observe any non-linear pattern.
In addition, the residuals should follow a normal distribution, in order to be able to
compute the confidence intervals for the regression coefficients or the predictions. This is
particularly important when the sample size is small. We can analyze this by means of
the Normal Q-Q plot:
– NORMALITY: if the distribution is normal, the points in the Normal Q-Q plot are
on a line.
Finally, the graph Residuals vs Leverage lets us see the existence of outliers, that may
have a strong influence in the estimation of the parameters.
Example 6.4. Diagnose the model of Example 6.3.
Solution: • The determination coefficient R2 = 72.93% represents the percentage of varia-
tion of the response variable explained by the model. In this case, 72.93% of the variation
of N2O is explained by the emission of CO2.
Together with this value, we also have the adjusted R square, which in this case is Ra2 =
0.7269. Since it is not too small, we can consider that the model is adequate, for the time
being.

Statistics E.P.I. Gijón


SESSION 6. LINEAR REGRESSION 95

• The p-values of the coefficients (2.91 · 10−7 , 2 · 10−16 ) are smaller than the significance
level 0.05, so the linear model is not inadequate.
The ANOVA p-value (last line in the output) coincides in the case of simple linear regres-
sion with the p-value of the explanatory variable, so it doesn’t need a separate analysis
in this case.

• To make the diagnostic plots, we use the following instructions:


Models
yGraphs
yBasic diagnosis plots

We see that in the Residuals vs. Fitted plot the hypothesis of mean 0 of the resid-
uals seems admissible, there are no evidences of heteroscedasticity and the hypothesis of
linearity does not seem inadequate.
In the Normal Q-Q, plot, the points are on the straight line y = x. Therefore, the errors
follow a normal distribution.
The Residuals vs. Leverage plot does not show any outlier.
We conclude that this model seems adequate to predict the emission of N2O from the emission
of CO2.

As we said before, the above graphical procedure is only preliminary. In practice, we should
complement it with some hypotheses tests, such as the Breusch-Pagan test to check the ho-
moscedasticity, the Reset test to verify the hypothesis of linearity and the Bonferroni test to
check the existence of outliers. All these tests can be found under the menu Models-> Nu-
merical diagnostics. We can also verify the normality of the residuals by means of the

E.P.I. Gijón Statistics


96 SESSION 6. LINEAR REGRESSION

Shapiro-Wilk test of normality. In any case, the study of these tests lies outside the scope of
this course, and here we shall only do a graphical diagnose.

6.4 Step 4: Prediction


The value of the regression line for a given value of the explanatory variable can be seen as
predictions of the value of response variable, or estimations of its mean.
For both these predictions and estimations, we can also give confidence intervals at a fixed
confidence level (usually 95%).
Example 6.5. Use the model of the Example 6.3 to estimate the emission of N2O when the
emission of CO2 is 110t/h.
Solution: The fastest procedure is to type the following in the RScript:
predict(RegModel.1,data.frame(CO2=c(110)))

where RegModel.1 is the name that was given to the model by the function lm (that we can see
on the top right of the screen) and where we specify next the value of the explanatory variable
we want to use in our prediction.
The output tells us that the point estimation, by means of this linear regression model, of
the emission of N2O that happens in an hour where the emission of CO2 is of 110 tons is 6.350341.
If in addition we want to obtain a prediction interval, we must type

predict(RegModel.1,data.frame(CO2=c(110)),interval="prediction")

The output is
fit lwr upr
1 6.350341 4.140992 8.55969
This means that, with a confidence interval of 95%, the emission of N2O lies between
4.140992t/h and 8.55969t/h when the emission of CO2 is 110t/h. The value 6.350341 is the
point estimate of N2O with CO2=110 in the linear equation.
In addition, we can add an option of the type level=0.99 in order to modify the confidence
level of the interval, which by default is of 95%. We would obtain:

predict(RegModel.1,data.frame(CO2=c(110)),interval="prediction",level=0.99)

With the option interval=.prediction. we tell that we want a confidence interval for the
prediction of the response variable for a certain value of the explanatory variable. If instead
we want to obtain a confidence interval for the mean of the response variable, we must replace
this by interval=’confidence’.
If we compute

predict(RegModel.1,data.frame(CO2=c(110,100)),interval="confidence")
then

Statistics E.P.I. Gijón


SESSION 6. LINEAR REGRESSION 97

fit lwr upr


1 6.350341 6.145248 6.555434
2 5.911843 5.707193 6.116494

In the first case, the mean of N2O when CO2 is dropping 110t/h will be between 6.145248 and
6.555434 units. In the second case we obtain a confidence interval for the mean when CO2 is
100.

6.5 Exercises
Exercise 6.1. Open the database Steel.Rdata.

a) Is there any linear relationship between consumption and SO2? And between consump-
tion and N2O?

b) In order to predict the value of consumption, which of the following explanatory variables
is the most adequate: N2O, SO2 or NOx?

c) Plot the scatterplots of the variable consumption versus the explanatory variable obtained
in the previous point.

d) Fit a linear model to predict consumption from N2O.

e) Estimate and explain the determination coefficient of this model.

f) Is the regression coefficient (the coefficient of the explanatory variable) significantly dif-
ferent from zero?

g) Diagnose the adequacy of the model.

h) How many units does the variable consumption increase per unit of N2O?

i) Predict the energy consumption when the emission of N2O is 6t/h, by means of a point
estimation and a confidence interval with a confidence level of 95%.

j) Predict the average energy consumption when the emission of N2O is 6t/h, by means of a
confidence interval at a 95% confidence level.

Exercise 6.2. a) Compute a new variable, named Y, with the following equation

Y = TotalProd + 20 * TotalProd^2

b) Make a scatterplot of Y and TotalProd.

c) Fit a linear model with Y as the response and the TotalProd as predict.

d) Check the adequacy of the model.

E.P.I. Gijón Statistics


98 SESSION 6. LINEAR REGRESSION

6.6 Solutions
Exercise 6.1. a) We do not reject the normality for any of the variables, because the p-values
of the Shapiro-Wilk test are:

data: Steel$consumption
W = 0.9984, p-value = 0.4207

data: Steel$N2O
W = 0.9922, p-value = 0.7518

data: Steel$NOx
W = 0.9797, p-value = 0.07302

data: Steel$SO2
W = 0.9862, p-value = 0.2772

The correlation matrix is

Pearson correlations:
consumption N2O NOx SO2
consumption 1.0000 0.8274 0.5384 -0.0076

N2O 0.8274 1.0000 0.5317 0.0071

NOx 0.5384 0.5317 1.0000 -0.1262

SO2 -0.0076 0.0071 -0.1262 1.0000

and the p-values:

consumption N2O NOx SO2


consumption <0.0001 <0.0001 0.9352

N2O <0.0001 <0.0001 0.9398

NOx <0.0001 <0.0001 0.1753

SO2 0.9352 0.9398 0.1753

It is clear that there is no linear relationship between the consumption and the emission
of SO2. There are linear relationships among consumption and N2O and NOx.

b) The strongest correlation is with N2O.

c) The scatterplot of both variables shows a linear relationship.

Statistics E.P.I. Gijón


SESSION 6. LINEAR REGRESSION 99

d) The linear model is:

consumption = 0.2003 + 22.1556 · N2O.

e) The determination coefficient is R2 = 68.46%. Therefore, about 68.46% of the energy


consumption is due to the emission of N2O.

f) Yes, in the test H0 : β1 = 0 versus H1 : β1 ̸= 0, the p-value is less than 2 · 10−16 , so this
coefficient is not zero.

g) The determination coefficient (R2 = 0.6846; the adjusted R2 is Ra2 = 0.6819) is not too
close to zero, and the p-values of the tests on the coefficients suggest that this is not a bad
model. Now, we will check the residuals:

E.P.I. Gijón Statistics


100 SESSION 6. LINEAR REGRESSION

The graph Residuals vs Fitted shows that errors are homoscedastic (equal variance).
In the plot Normal Q-Q, the errors seem to follow a normal distribution.

In conclusion, the model seems to exhibit good fit to the data.

h) We estimate that the consumption grows 22.1556 units per unit of N2O.

i) The point estimation when N2O=6 is 133.1339 ton and the confidence interval at 95% is
(69.28326, 196.9846).

j) At a confidence level of 95%, the average consumption will lie between 127.2474 and
139.0204 tons.

Exercise 6.2. a) Go to: Data → Manage variables in the active data set → Compute new
variable. Then in the menu Compute new variable, we fill it with:

• New variable name: Y

• Expression to compute: TotalProd + 20 * TotalProd^2

b) The scatterplot of Y and TotalProd is:

c) The model is: Y = −2.396 · 10−9 + 4.669 · 105 · TotalProd .

d) Adjusted R-squared: 0.9452 (very high).

p-value for the slope <2e-16 ***. As consequence, β1 is not zero.

Residual plots:

Statistics E.P.I. Gijón


SESSION 6. LINEAR REGRESSION 101

There is no normality on the residuals. The graph Residuals vs Fitted shows a


parabolic curve. We see then that, although R2 is very high and the test on the coef-
ficients show that β1 ̸= 0, we should not use this linear model to predict Y.

E.P.I. Gijón Statistics


Appendix A

Summary of the main hypothesis tests

• Goodness of fit test.

– Shapiro-Wilk normality test, page 41


Statistics → Summaries → Shapiro-Wilk normality test...

• Average of a population.

– Single-sample t-test (normal population), page 41


Statistics → Means → Single-sample t test...
– Single-sample Wilcoxon test (NO normal population), page 44
Statistics → Non-parametric tests → Single sample Wilcoxon test...

• Population proportion, page 45.


Statistics → Proportions → Single-sample proportion test...

• Comparison of proportions, page 56.


Statistics → Proportions → Two-sample proportions test...

• Comparison of variances.

– Two variances F test (normal population), page 58


Statistics →Variances → Two-variances F test...

• Comparison of averages.

– Independent samples t test (normal population, independent samples), page 61


Statistics → Means → Independent samples t-test...
∗ Equal variances
∗ Not equal variances
– Paired t-test (normal population, paired data), page 64
Statistics → Means → Paired t test...
– Two-sample Wilcoxon test (NO normal populations, independent samples), page 67
Statistics → Nonparametric tests → Two-sample Wilcoxon test...

103
104 APPENDIX A. SUMMARY OF THE MAIN HYPOTHESIS TESTS

– Wilcoxon paired test (NO normal population, paired data), page 68


Statistics → Nonparametric tests → Paired samples Wilcoxon test...
• Independence and correlation test.

– Chi-Square Test for Independence, page 75


Statistics → Contingency tables → Two-way table... → Chi-squared test of
independence
– Pearson’s correlation test, page 78
Statistics → Summaries → Correlation test...

Statistics E.P.I. Gijón


Appendix B

Information on the Data Sets

B.1 Steel company


In order to analyze the energy consumption of a steel company, we have inspected randomly
the production of the company. The inspection consists in registering the most relevant values.
In all we have 117 observations (117 working hours) of the following variables:

consumption Energy consumption of the company (megawatts/hour).

pr.hbt Production of the hot belt train (steel tons).

pr.cc Production of continuous casting (steel tons).

pr.sc Production of the steel converter (steel tons).

pr.galv1 Production of type I galvanized steel (steel tons).

pr.galv2 Production of type II galvanized steel (steel tons).

pr.paint Production of painted panels (steel tons).

line Line of production used (A or B).

shift Shift when data were collected: morning (M), afternoon (A), night (N).

temperature Temperature of the system: High, Medium and Low.

breakdowns Existence of breakdowns (Yes, No).

nbreakdowns Number of breakdowns detected.

system Activation of the overheating detection system: ON, OFF.

TotalProd Total production per hour (steel tons).

NOx Nitrogen mixture emissions per hour (tons/h).

CO Carbon monoxide emissions per hour (tons/h).

COV Volatile organic component emissions per hour (tons/h).

105
106 APPENDIX B. INFORMATION ON THE DATA SETS

SO2 Sulphur dioxide emissions per hour (tons/h).

CO2 Carbon dioxide emissions per hour (tons/h).

N2O Nitrous oxide emissions per hour (tons/h).

B.2 Data on 1st year students of Faculty of Engineering


The dataset provides the results of a questionnaire filled in by the students of the first year of
the Faculty of Engineering. The variables in this dataset are:

1. Group: group of the student

2. Sex: sex of the student (Female or Male)

3. Age: age of the student (in years)

4. Siblings: Number of siblings of the student (excluding himself or herself)

5. DrivingLicense: variable indicating if the student has driving license (Yes or No)

6. PublicTransport: variable indicating if the student uses the public transport regularly to
come to the Campus (Yes or No)

7. Distance: Distance between the campus and his/her home

8. TimeArriving: Time (in minutes) the student takes to reach the Campus from his/her
home.

9. TimeinCampus: Time (in hours) the student spends at the Campus per week

10. Networks: Time (in hours) the student spends on social networks (facebook, twitter, etc)
and on messenger programs (MSN, Yahoo Messenger, etc), in a regular week

11. TV: Time (in hours) the student spends watching television or playing computer games,
in a regular week

12. StudyMontoFri: Time (in hours) the student spends studying between Monday and Fri-
day, in a regular week

13. StudyWeekend: Time (in hours) the student spends studying during the weekend, in a
regular week ,

14. CallReceived: Duration (in minutes) of the last call received on his/her mobile phone

15. CallMade: Duration (in minutes) of the last call made from his/her mobile phone

16. Smoking: Are you a smoker?

• Nonsmoker (No)
• Only occasionally (Casual)
• Regularly (Regular)

Statistics E.P.I. Gijón


APPENDIX B. INFORMATION ON THE DATA SETS 107

17. Sports Do you play sports?

• Never (Never)
• Yes, but not every week (Casual)
• Yes, once or twice a week (1or2)
• Yes , at least three times a week ( 3ormore )

18. Access method to the University:

• University Entrance Exam (PAU)


• Higher-Level Vocational Training Cycles (Module)
• Graduate (Graduate)
• Other (Other)

19. Mark: University entry mark of the student

E.P.I. Gijón Statistics

You might also like