ComputerLabNotes 2024
ComputerLabNotes 2024
ComputerLabNotes 2024
y Didáctica de la Matemática
STATISTICS
Material for the computer lessons
February 2024
Contents
1 Descriptive Statistics 3
1.1 R-Commander . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Dataset Steel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Types of statistical variables . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3 Other types of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Bar chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Pie chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.3 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.4 Box plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Measures of central tendency and dispersion . . . . . . . . . . . . . . . . . . . . 17
1.6 Generation of new variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.1 Computing a new variable . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.2 Recoding variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 Distribution models 29
2.1 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Solutions to the exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1
2 CONTENTS
6 Linear regression 87
6.1 Step 1: Search for a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Step 2: Model estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3 Step 3: Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4 Step 4: Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6 Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
1.1 R-Commander
Let us introduce the software we shall use in the computer classes. It is called R-Commander,
and it is a graphical interface of the statistical computation environment R.
1.1.1 Installation
With an internet connection, the installation of R-Commander (Rcmdr) in the usual operating
systems is simple.
3
4 SESSION 1. DESCRIPTIVE STATISTICS
The next time you need to open the program, it suffices to open R and write library(Rcmdr)
in the console.
Note: If you have some trouble installing the R-Commander package, check your antivirus:
it may be blocking the process.
Option 2 :
Alternatively, you may download R-Commander following these steps:
a) From the website http://knuth.uca.es/R/doku.php, open the link Versión X.Y.Z Pa-
quete R-UCA para windows, similarly to what is shown Figure 1.1 (the version number
of R may change, download the newest version unless instructed otherwise).
b) Once you have downloaded the package, execute it to proceed with the installation, that
shall open a window similar to that in Figure 1.2.
c) Once the installation is completed, look in the list of programs of the initial menu the
link Rterm. When executing it, you should obtain a DOS window that will also start
R-Commander (figure 1.3).
Note that by default, R uses the language from the Windows version of your computer.
However, it is possible to change the language to English (en), French (fr), Italian (it),
etc.
In order to change the language to English, we should type in the R Console window:
Sys.setenv(LANGUAGE="en")
Press return.
In this manner you can install the latest R version available at R-UCA. This version may differ
from the one installed in the computers at the school of Engineering.
The next time one wants to use this program it suffices to repeat this last step.
Note: A list of problems that may appear during the installation process is collected here,
along with their possible solutions:
In order to do that, open the Terminal (usually located in Applications → Utilities), write
R and press Enter. This will open the R program in the Terminal window. To launch the
Rcmdr packge, write library(Rcmdr) and press Enter.
1.1.2 Structure
• Packages → Load package.
The R-Commander window has the following parts: menu, active parts (data and models),
instructions, results, and messages (Fig. 1.5).
1.2 Data
1.2.1 Dataset Steel
In order to analyze the energy consumption of a steel company, we have inspected the production
of the company. The inspection consists in registering the most relevant values during several
working hours selected randomly.
Answer: To open a database, we must go to the submenu Data; and if we want to use a file
with the R format (.rda, or .RData), we must select Load data set.
Data
yLoad data set
Example 1.2. Identify the number of variables and the number of observations in the database
Steel.
Answer: There are several ways to proceed. The easiest consists in viewing the database.
Data set
yWe select Steel (if there were many)
yView data set
The values taken by a statistical variable are called modalities. If they are numerical quantities,
the variable is called quantitative (for instance the speed, age, time, etc); if they are names
(labels, categories, levels, etc), the variable is called qualitative or a factor.
When working with a statistical variable within R-Commander, it is important to know if
it is a quantitative or a qualitative variable, because some procedures can only be applied to
one of the two types. For instance, we can only make a bar chart with qualitative variables; in
case the variable is quantitative and discrete, we should first turn it into a factor.
It is also possible to create our own data set directly in RCommander. In order to do this, we
must go to Data-> New data set. As in our previous example with the dataset Steel, in a data
set each column represents a variable and each row represents an element in the sample. We
can enter these values directly by filling in the elements of the grid.
Data
yWe select New data set
yand enter the name of the data set
In addition, we may also import data sets with a similar grid structure from other programs,
such as Excel and SPSS into RCommander:
Data
yImport data
ySelect the appropriate type
Finally, we can save our changes by following the path Data -> Active data set -> Save
active data set.
1.3 Frequencies
Let us see how to obtain the frequencies of the different values of a statistical variable.
Example 1.3. Determine the frequencies of the statistical variable breakdowns.
Answer: We proceed in the following way:
Statistics
ySummaries
yFrequency distributions
percentages: breakdowns
No Yes
76.07 23.93
Hence, we have obtained the absolute and relative frequencies for the different values of the
variable within the sample.
Example 1.4. Obtain the frequency distribution of the statistical variable nbreakdowns.
Answer: In this case, because it is a quantitative statistical variable, R considers it by default
to be continuous, and it does not provide the absolute and relative frequencies 1 . To be able to
determine the frequencies, we must create a new qualitative variable with these data.
1
Because for a continuous statistical variable the absolute frequency of each of its values will usually be 1.
Data
yManage variables in the active data
set. . .
yConvert numeric variables to factors
There are two possibilities. When turning a quantitative variable into a qualitative one, the
most convenient one is to use the same values as categories:
On the other hand, take into account that, if for New name we use the default option <same
as variables>, we lose the quantitative character of the variable. This means that if we later
want to obtain descriptive measures of the data (for instance the mean), we must to explicit a
new name different from the original one (in this case, nbreakdowns).
Once this is done, we proceed as in the previous case. We obtain the following output:
counts: nbreakdowns
0 1 2 3 4
89 2 9 9 8
percentages: nbreakdowns
0 1 2 3 4
76.07 1.71 7.69 7.69 6.84
1.4 Graphs
1.4.1 Bar chart
Example 1.5. Represent variable breakdowns by means of a bar chart.
Answer: Since this is a qualitative statistical variable, bar charts are an adequate graphical
representation. They can be obtained with the menu Graphs; specifically,
Graphs
yBar graph
Note that we can also modify the labels of the two axis, by filling the options under Plot
labels: in this way, we can give a different label to the x-axis, the y-axis, and the graph.
With this procedure we obtain the following bar chart:
Sí
avería
No
80 60 40 20 0
Frecuencia
Answer: Recall that R considers that all quantitative statistical variables are continuous, and,
as a consequence, it does not allows us to make this representation. We must then create first
of all a new qualitative variable with the same values, as we did in Example 1.4.
Once this is done, we obtain the bar chart in the same way as in Example 1.5:
Graph
yBar graph
4
3
Número de averías
2
1
0
80 60 40 20 0
Frecuencia
Answer: Pie charts are one of the options of the menu Graphs; specifically,
Graphs
yPie chart
ySelect variable breakdowns
yAccept
1.4.3 Histogram
Example 1.8. Obtain the histogram of variable consumption.
Graphs
yHistogram. . .
Histogram of acero$consumo
30
Frequency
20
10
0
acero$consumo
• detect outliers;
Graphs
yBox plot...
150
100
50
From this diagram we observe, for instance, that there are no outliers for the variable
consumption in this sample.
Example 1.10. Obtain the box-plots of variable consumption for each level of temperature.
300
88
250
106
71
84
200
consumption
150
100
79
86
50
54
60
temperature
From this plot we can clearly see that the consumption decreases with extreme temperatures.
There are also some outliers. To detect them we can click on the corresponding point, if before
making the diagram we have activated the option Identify outliers with mouse.
Statistics
ySummaries
yNumerical summaries
This output indicates that the mean is 0.675 breakdowns/hour, with a standard deviation
of 1.29. The number of breakdowns varies from 0 to 4, and at least 75% of the observations do
not present breakdowns. In all we have 117 observations.
Statistics
ySummaries
yNumerical summaries
With this information we conclude that the mean consumption is around 135.68 MWh, with
a standard deviation of 56.91 MWh. The minimum consumption is of 17.5 and the maximum
one of 290.72. In 25% of the data the consumption is no greater than 99.09 MWh; in 50%, no
greater than 135.1; and 25% consumes more than 182.48.
Data
y Manage variables in active dataset
y Compute new variable...
The notation for the elementary operations within the cell Expression to compute are the
usual ones: + − ∗ / y ˆ .
For instance, if we want to generate variable cost from consumption, if the relationship
between them was cost = 2.34 · consumption, we should use the expression 2.34*consumption.
The name of the variable we are transforming can be either typed or transferred to the cell by
clicking twice on its name in the list of Current variables.
It is important to note that if we need to introduce decimals in Expression to compute as
in the example above we must use the point instead of the comma.
we should type:
lo:100="Low"
100:200="Medium"
200:hi="High"
We can also recode several variables simultaneously, simply by selecting all of them together.
One interesting instance is that of the dichotomy of a numerical variable. Let us give an
example:
Example 1.13. Define a new variable called Production that takes the value Failure when the
total production is of at most 10000 tons and Success when it is greater.
Answer: We should simply follow the steps of the previous example, but typing now
lo:10000="Failure"
10000:hi="Success"
in Enter recode directives y Production in New variable name or prefix for multi-
ple recodes: inside Recoding variables.
1.7 Filters
Similarly to the case of box-plots, we can obtain descriptive measures of a quantitative statistical
variable for each value of a qualitative statistical variable. However, in some cases we only want
to analyze part of the data: those satisfying some condition. Let us see how to make this filtering
of the data.
Example 1.14. Determine the frequency distribution of the variable breakdowns for those cases
where the temperature is high.
Answer: To filter the data with the given condition, we must follow:
Data
y Active data set
y Subset active data set...
We see a new window called Subset data set. In its upper part we can select a few columns
(variables); usually we shall consider the default option: Include all variables.
Subset expression. In our case, we consider the following:
temperature=="High"
Name of the new data set. The default option is <same as the active data set>. It is
advisable to change it into another one, because otherwise the new data set replaces the one
we had. In our case, we give the new name Steel.temp_high. Note that in the new name we
can use letters, numbers, points and underscores, but not spaces nor scripts.
After making sure that the active data set is
Data set: Steel.temp_high
we proceed as in Example 1.3 to obtain
counts: breakdowns
No Yes
38 8
percentages: breakdowns
No Yes
82.61 17.39
Observe that the logical condition for equality is (==) instead of (=). The following table
shows how too make the different logical conditions with R-Commander:
equal (=) ==
different (̸=) !=
smaller or equal (≤) <=
greater or equal (≥) >=
conjunction (and) &
disjunction (or) |
We may also use parentheses to group conditions.
On the other hand, when using
Note: If after making a filter we want to work again with our initial data set, we shall
change the active set in the menu Data set.
1.8 Exercises
Exercise 1.1. Represent variable nbreakdowns in a pie chart. Is this an adequate graphical
representation?
Answer: We must convert the variable into a factor first. In this case a bar chart would be
more adequate, because the values represent quantities, and the relationships between them
would not be represented in a pie chart.
Exercise 1.2. Which graphical representation is the most adequate for the hot belt train? Make
this representation using percentages in the Y axis.
Answer: Histogram.
Exercise 1.3. How many of the 117 observations correspond to a high temperature?
Answer: We calculate the frequency table of the temperature and obtain the value 46.
Exercise 1.4. Observe the distribution of the variable production of continuous casting. How
much is the mean production? How much is the median?
Solution: We make an histogram to observe graphically the distribution of this variable, and
calculate the main descriptive measures of pr.cc by selecting the menu option Numerical
summaries. The output is
mean sd IQR 0% 25% 50% 75% 100% n
433.9316 276.8536 406 33 201 380 607 1204 117
The hot belt train has the highest production (10 955 tons).
Exercise 1.7. Which graph is more appropriate to detect outliers in the emissions of SO2?
Represent it. Which outliers do you observe?
Solution: Box plots. To identify the outliers with the mouse, we have to click on the corre-
sponding points in the box-plot.
There are several outliers: clearly observations 68 and 85 are outliers, but the others cannot
be seen clearly. By looking at the output window we observe that the outliers correspond to
data 85, 87, 106 and 68. Their respective emissions are 0.014, 0.002, 0.001 and 0.127.
Exercise 1.8. a) How many data correspond to a medium temperature? And to a high tem-
perature?
d) Which is the most frequent temperature in the sample? And the least frequent?
e) Can we represent the temperature data in a bar chart? And in a pie chart? And in an
histogram?
f) Make a pie chart for the variable temperature, with the title “System temperature”.
Solution. In order to answer the first four questions we obtain the frequency distribution of the
temperature:
a) 33. 46.
b) 60.68%
c) 33.
d) The most frequent temperature is High and the least frequent is Medium.
e) Yes, because it is qualitative. For the same reason, we can also use a pie chart. We should
not use an histogram, because it is not even a quantitative variable.
f) It suffices to write the label ‘System temperature’ in the cell we see when selecting the
option Graph -> Pie chart.
Exercise 1.9. Make a graphical representation of the data about the energy consumption of the
company, so that in the Y axis we have the percentages and with the labels “Consumption” and
“%”. Using this graph, answer the following questions:
a) If this study has been made to determine if the company is fulfilling the goals not exceeding
a consumption of 400 megawatts/hour, do these data support the hypothesis? What if the
goal was that the consumption is below 200 megawatts/hour?
b) According to these data, can we assume that approximately 40% of the time the consump-
tion is greater than 250 megawatts/hour?
c) Assuming these data come from a stable process, and that they represent the usual be-
haviour of the company, around which values lies usually the energy consumption of the
company?
a) Yes. No.
b) No.
Exercise 1.10. Make a bar chart of the variable consumption and comment its properties.
Solution. The program does not allow us to make this graph because it is a quantitative variable.
We turn it into a qualitative one with (Data → Manage variables in the active dataset →
Convert numeric variables to factors) and make the corresponding bar chart. This is not an
useful graph because most of the absolute frequencies are equal to 1, because it is a continuous
quantitative variable. It is not an adequate graph for this variable.
Exercise 1.11. The experiment design suggested to have approximately the same number of data
with the overheating detection system on and with the system off. Does the sample correspond
to this design? How many data do we have with the system off ? Which percentage of the sample
size does this represent?
Solution. Yes, because more or less we have the same amount of data with the system on and
off (50.43% off and 49.57% on). With the system off we have 59 observations out of 117, which
is 50.43% of the data.
Exercise 1.12. Make a graphical representation of the energy consumption of the company with
two box-plots, one with the data with the overheating detection system on and another one when
it is off. Analyze this graph. How much is the mean consumption in each of these two cases?
Solution. Box-plot: it seems that there is less consumption when the system is on. The mean
consumption in this case is 124.24 and when the system is off, 146.92.
Exercise 1.13. If the production of a new product X can be obtained as the difference between
the total production and the sum of the six productions given (hot belt train, continuous casting,
steel converter, type I and type II galvanized steel and painted panels), how much is the average
production of X?
Solution. First of all we must compute the new variable prodX and then we must compute its
average. We obtain that the mean production of X is 2282.761 tons.
What is the percentage of times when the production has been moderate?
Solution. 17.09%
Exercise 1.15. To analyze the behaviour of the overheating detection system, we consider only
the data when this system is on. For those data, determine:
b) The number of breakdowns that has half of the observations below it and half of them
above it.
e) Make a graphical representation of the line of production used and analyze it.
g) Make a graphical representation of the production of painted panels and analyze it.
b) Median=0
d) 51.72%
e) We may use the bar chart or the pie chart (it is a qualitative variable). We observe
approximately the same number of data in each of the lines.
f) We may use the bar chart or the pie chart (it is a discrete quantitative variable). We have
0 breakdowns almost all of the time, and 1, 2, 3 and 4 breakdowns in similar proportions.
Exercise 1.16. From the sample, can we assure that in less of 25% of the data the consumption
is greater than 150?
Solution. No, because the 75-th percentile is 182, and the percentage of data greater than 150
is at least 25%.
Exercise 1.17. Calculate the mean, median, mode, range, standard deviation and variance of
the following variables whenever it is possible:
a) Existence of breakdowns.
b) Number of breakdowns.
Solution. a) Since it is a categorical variable, it only makes sense to calculate the mode,
which is No
c) Since it is a continuous quantitative variable, it does not make much sense to calculate
the mode. The other measures are: x = 244.92, M e = 225, R = 664, s = 167.53 and
s2 = 167.532
1.9 Appendix
Bar graph options
Bar graph with percentages instead of absolute frequencies
barplot(100*table(Steel$temperature)/sum(table(Steel$temperature)),
xlab="temperature", ylab="Percentage")
Adding a title
barplot(table(Steel$temperature), xlab="temperature",
ylab="Frequency", main="Working title")
Pie chart with percentages instead of absolute frequencies and with labels
tabla<-round(table(Steel$temperature)*100/sum(table(Steel$temperature))
*100+0.5)/100
pie(table, labels=paste(levels(Steel$temperature),tabla,"%"), main="temperature",
col=rainbow_hcl(length(levels(Steel$temperature))))
Histogram options
Hist(Steel$consumption, scale="frequency",
breaks="Sturges",col="darkgray",
main="Title",xlab="Consumption",ylab="Frequency")
Distribution models
Quantiles. It is the smallest value c such that Pr[X ≤ c] ≥ p (lower tail) or Pr[X > c] ≤ p
(upper tail).
Probabilities. In a discrete random variable, this gives the values of the probability mass func-
tion, that is, Pr[X = k] for a given k. For continuous random variables, it gives the tail
probabilities (see next entry).
Tail probabilities. For a given k, it gives the probability Pr[X ≤ k] (lower tail) or Pr[X > k]
(upper tail).
Plot. It makes a graphical representation of the density function (for continuous variables)
or the probability mass function (for discrete random variables), and also that of the
distribution function.
Sample. It generates a random data set of a given size following a certain distribution.
29
30 SESSION 2. DISTRIBUTION MODELS
,→ Distributions
,→ Continuous distributions
,→ Normal distribution
,→ Normal probabilities
,→ Variable value(s): -0.2
,→ Mean: 0 (default)
,→ Standard deviation: 1 (by default)
,→ Lower tail
,→ OK
(e) It coincides with (d), because since X follows a continuous distribution we have P (X >
7) = P (X ≥ 7).
(f) In this case there are several ways to proceed. One would be taking into account that
P (4 < X < 7) = P (X < 7) − P (X ≤ 4).
The first value was obtained in (a). In a similar manner (changing 7 by 4) we can obtain
P (X ≤ 4) = 0.3085375, whence
P (4 < X < 7) = 0.8413447 − 0.3085375 = 0.5328072.
This last calculation can be made by typing it in RScript.
(g) Again, since we have a continuous distribution it holds that
P (4 ≤ X ≤ 7) = P (4 < X < 7),
so the answer is the same as in (f).
Example 2.3. Draw an histogram of a random sample of 10 000 values following a normal
distribution with mean µ = −3 and standard deviation σ = 2. Use approximately 50 bars in
the histogram.
Solution: Follow these instructions:
,→ Distributions
,→ Continuous distributions
,→ Normal distribution
,→ Sample of a normal distribution
,→ Enter name for data set: NormalSam-
ples.
,→ Mean: -3
,→ Standard deviation: 2
,→ Number of samples (rows): 10000
,→ Number of observations (columns): 1
,→ Sample means: disable
,→ Sample sums: disable
,→ Sample standard deviations: disable
,→ OK
Now we only have to make the histogram:
,→ Graphs
,→ Histogram...
,→ Variable (pick one): obs
,→ Number of bins: 50
,→ OK
but not exactly the same, because we are using random observations.
Example 2.4. Generate 100 random values of an exponential distribution with mean 2.
(b) Since the exponential is a continuous distribution each single outcome has probability 0,
so P (X = 7) = 0.
(c) The answer is the same as in (a), because in a continuous distribution the probability of
a set does not change if we add or remove a point.
(d) The only difference with respect to (a) is that we should choose now Upper tail. The
answer is 0.4965853. We can also obtain this by taking into account that P (X > 7) =
1 − P (X ≤ 7).
(e) It coincides with (d), because since X follows a continuous distribution we have P (X >
7) = P (X ≥ 7).
(f) We have that P (4 < X < 7) = P (X < 7) − P (X ≤ 4), that P (X < 7) = 0.5034147 and
in a similar way we can obtain that P (X ≤ 4) = 0.32968, so P (4 < X < 7) = 0.1737347.
Example 2.6. A similar procedure could be made with a Weibull distribution W (2, 3). We
obtain:
P (W (2, 3) ≤ 7) = P (W (2, 3) < 7) = 0.9956798
P (W (2, 3) > 7) = P (W (2, 3) ≥ 7) = 0.0043202
P (4 < W (2, 3) < 7) = P (4 ≤ W (2, 3) ≤ 7) = P (4 ≤ W (2, 3) < 7) = P (4 < W (2, 3) ≤ 7) = 0.1646931
(c) As we just said, we do not have that P (X < 7) = P (X ≤ 7). To compute P (X < 7),
we must take into account that in the case of the binomial distribution we have P (X <
7) = P (X ≤ 6). Proceeding as in (a), but with the value 6 instead of 7, we obtain
P (X < 7) = 0.9452381.
(d) To obtain P (X > 7) we proceed as in (a) but choose Upper tail instead of Lower tail.
We obtain P (X > 7) = 0.01229455.
(e) Again we do not have the equality P (X ≥ 7) = P (X > 7) that we had in the continuous
case. One way of solving this would be using that in the case of a binomial distribution
P (X ≥ 7) = P (X > 6). Proceeding as in (d), we obtain P (X > 6) = 0.05476188.
(h) In the case of a binomial distribution B(10, 0.4) the possible values are 0, 1, 2, ..., 10, so
P (X = 2.3) = 0.
Example 2.8. Determine the 95-th percentile of a Poisson distribution with parameter λ = 3.5.
Solution: Follow these instructions:
,→ Distributions
,→ Discrete distributions
,→ Poisson distribution
,→ Poisson quantiles
,→ Probabilities 0.95
,→ Mean 3.5
,→ Lower tail
,→ OK
The output is 7.
2.3 Exercises
Exercise 2.1. The measurement errors in a machine follow a normal distribution N (0, 2). Com-
pute:
a) The probability that the error is smaller than 1.
c) The value k such that 30% of the times the error is smaller than k.
d) The value k ′ such that 20% of the times the error is greater than k ′ .
Exercise 2.2. If the lifetime of a component follows an exponential distribution with mean 2
years, determine:
d) The guarantee we should give so that at most 40% of the components to be repaired are
in the guarantee period.
Exercise 2.3. If the lifetime of a component follows a Weibull distribution with shape parameter
2 and scale parameter 3, determine the probability that it lasts more than 5 years.
Exercise 2.4. A study with roller bearings has determined that their lifetime (in hundreds of
hours) follows a Weibull distribution with parameters k = 0.4 and λ = 4.
(a) What is the probability that they fail before 160 hours?
(b) Given a batch of 10 roller bearings selected at random, what is the probability that none
of them fails before 160 hours? And the probability that at most one of them fails before
160 hours?
Exercise 2.5. The number of breakdowns in a factory during an 8 hour shift follows a Poisson
distribution with parameter λ = 16.
(a) What is the probability that there are more than 20 breakdowns during a given shift?
(b) What is the probability that the time between two consecutive breakdowns is longer than
1 hour?
a) P (X < 1) = 0.6914625.
c) −1.048801.
d) 1.683242.
Exercise 2.2. If X denotes the random variable “lifetime of the component in years”, it follows
that X ∼ exp(0, 5), because we are told that E(X) = 2. With this, we obtain the following
solutions:
a) P (X > 5) = 0.082085.
b) P (X < 6) = 0.9502129.
Exercise 2.3. If X denotes the random variable “lifetime of the component in years”, it follows
that X ∼ W (2, 3) and P (X > 5) = 0.06217652.
Exercise 2.4. If X denotes the random variable “lifetime of a roller bearing in hundreds of
hours”, then X ∼ W (0.4, 4).
(b) The random variable Y :=“number of roller bearings that fail before 160 hours” follows a
binomial distribution B(10, 0.4999988). Then the requested probabilities are:
– P (Y = 0) = 0.0009765859.
– P (Y ≤ 1) = 0.0107424.
Exercise 2.5. (a) If the number of breakdowns per shift follows a Poisson distribution with
parameter λ = 16 and we denote this variable X, then we must compute P (X > 20). We
can do this using the option Upper tail in the manu Poisson tail probabilities.
We obtain P (X > 20) = 0.131832.
(b) If the number of breakdowns in 8 hours follows a Poisson P(16), then the time in hours
between consecutive breakdowns follows an exponential exp(2). Thus, we must compute
P (exp(2) > 1) = 0.1353353.
A test is a statistical process that is used the determine the truth of an hypothesis (the null
one). If the sample data are not very plausible when this hypothesis is true, then we reject it.
Otherwise, we say that there is not enough significant evidence to reject the hypothesis, and
we accept it.
To present the result of an hypothesis testing, we use the p-value, which is the smallest
significance level for which we reject the null hypothesis H0 . Once this is known, we compare the
p-value with a particular significance level (that may be pre-fixed or decided at that moment).
DECISION RULE
p-value < α =⇒ Reject H0
p-value ≥ α =⇒ Accept H0
We usually take α = 0.05.
39
40 SESSION 3. ONE SAMPLE TESTS
• Proportion within a population: Is the percentage of hours with high consumption (>
200) less than 1%?
• Comparison of means: Is the mean consumption the same irrespective of whether there
are breakdowns?
• Comparison of proportions: Is the percentage of hours with high consumption the same
irrespective of whether there are breakdowns?
In particular, if we want to test an hypothesis about the mean, we must first of all determine
if the data follow a normal distribution or not; depending on the answer, we use a different
test: (see Table 3.1)1 .
To test if the data come from a normal distribution, we shall use in this course the Shapiro-
Wilk test. For this kind of test, the hypotheses are:
1
The Wilcoxon one sample test is only useful when the distribution of the data is symmetrical. Otherwise,
we may use different tests which lie outside the scope of this course.
DECISION RULE
p-value < α =⇒ Reject H0 (the distribution is not normal)
p-value ≥ α =⇒ Accept H0
We usually take α = 0.05 .
Example 3.1. Study the normality of the distribution of the variable consumption.
Statistics
ySummaries
yTest of normality. . .
Select consumption
yOK
Since the p-value (0.4207) is greater than α (α = 0.05 by default) we do not reject the null
hypothesis, and conclude that the data follow a normal distribution.
Since the consumption follows a normal distribution, we can apply a test about the mean.
We must use the t one sample test, whose hypothesis can be of three types:
Statistics
yMeans
ySingle-sample t test
the p-value (0.0002210) is smaller than α and we reject the null hypothesis (H0 ); we conclude
thus that the mean is different from 120.
Solution: Since we are dealing again with the variable consumption, we already know that it
follows a normal distribution, and we can use the t one sample test. Hence, an adequate test
for this statement is:
H0 : the mean consumption is not smaller than 140
H1 : the mean consumption is smaller than 140
Statistics
yMeans
ySingle-sample t-test
Since the p-value (0.2065) is greater than α, we do not reject the null hypothesis. Hence,
there is not enough evidence in the sample to assume that the mean is smaller than 140.
Example 3.4. We want to make a test about the mean production of type I galvanized steel.
In order to select the adequate test, we must answer the following question: Does the variable
pr.galv1 follow a normal distribution?
Statistics
ySummaries
yTest of normality. . .
Select pr.galv1
yOK
Since the p-value (0.00957) is smaller than α, we reject the null hypothesis; therefore, we
conclude that the variable does not follow a normal distribution.
Example 3.5. Is the mean production of type I galvanized steel less than 400?
Solution: Since there is no normality we must make the Wilcoxon one sample test. The different
types of hypotheses for the median are:
H0 : M e = 400 H0 : M e ≥ 400 H0 : M e ≤ 400
H1 : M e ̸= 400 H1 : M e < 400 H1 : M e > 400
two.sided less greater
We are interested in the following hypothesis:
Statistics
yNon-parametric tests
ySingle-sample Wilcoxon test
Example 3.6. In our example, is the percentage of hours with breakdowns significantly greater
than 10%?
Statistics
yProportions
ySingle-sample proportion test
OTHER POSSIBILITIES
Test using the binomial distribution. We have performed the previous test using the
default option Normal approximation in Type of test. When the sample size is small, it is
better to use the option Exact binomial. In our example the differences are minimal, and we
obtain again a very small p-value:
data: rbind(.Table)
number of successes = 89, number of trials = 117, p-value = 1.002e-05
alternative hypothesis: true probability of success is less than 0.9
95 percent confidence interval:
0.0000000 0.8242563
sample estimates:
probability of success
0.7606838
Reordering the factor levels. Another approach to the problem is to reorder the factor
levels and to put YES as the first one.
Data
yManage variables in active data set
yReorder factor levels
We reorder as we please
yOK
Statistics
yProportions
ySingle-sample proportion test
3.5 Exercises
Exercise 3.1. a) Obtain a confidence interval for the mean consumption at the confidence
levels 1 − α = 90%, 1 − α = 95% and 1 − α = 99%, respectively.
b) Obtain a confidence interval for the proportion of times that line A is used, at the confi-
dence levels 1 − α = 90%, 1 − α = 95% and 1 − α = 99%, respectively.
a) What is the mean consumption of the 117 data? And its standard deviation?
b) Make the histogram of the consumption. This graph, does it make us think that the data
follow a normal distribution?
c) Test the normality of the distribution of the consumption. What is the p-value? Do we
admit the normality of the data?
d) Using the result of the previous item and the mean and standard deviation of the first one,
• What is the percentage of hours where we expect to have consumption greater than
265 megawatts/hour? And smaller than 99 megawatts/hour? And between 99 and
265 megawatts/hour?
• What is the estimation of the consumption which is only exceeded 2% of the times?
Exercise 3.3. Do these data support the hypothesis that the average consumption is smaller than
130 megawatts/hour?
Exercise 3.4. Do these data support the hypothesis that the average consumption is smaller than
130 megawatts/hour during those hours with high temperature? Draw a box-plot of the variable
consumption for each of the temperatures considered and comment the results.
Exercise 3.5. Can we conclude that the average production of the steel converter is smaller than
260 tons? And different than 250? And different from 240? And different from 180? Calculate
the mean value of the production of the steel converter and comment the results.
Exercise 3.6. Do these data support the hypothesis that the percentage of times line A is used
is greater than 20%?
Exercise 3.7. a) Represent graphically the number of hours in the sample where the overheat-
ing detection system is on and the number of hours when it is off. What is the percentage
of hours when it is on?
b) The acquisition of the overheating detection system is not profitable if, in general, it is used
less than 40% of the time. Using the sample data, can we conclude that the acquisition is
not profitable?
c) A study about this system consists in choosing at random 25 hours of the monthly produc-
tion and determine if the system was on or off during each of them. Assuming that the
population proportion coincides with the estimation obtained in b), what is the probability
that exactly in 9 of the 25 hours the system was on? And no more than 12 hours? And at
least than 10 hours? And between 10 and 12 hours (both values included)? And between
9.5 and 12.5 hours? And more than 9 and less than 13 hours?
b) Use 1-sample proportions test. Confidence intervals for the proportion of times line A is
used:
Confidence level Interval
1 − α = 90% (0.4457825, 0.5959867)
1 − α = 95% (0.4316194, 0.6097571)
1 − α = 99% (0.4044922, 0.6359495)
Exercise 3.2. a) From the menu option Statistics → Summaries → Numerical summaries:
the mean is 135.6771 megawatts/hour and the standard deviation is 56.90756 megawatts/hour.
b) It seems so, because the histogram looks like the density function of the normal distribution
(bell shaped).
c) The p-value of the Shapiro-Wilk normality test applied to these data is 0.4207. Since it is
clearly greater than the significance level (the usual value is α = 0.05 and the maximum
value is usually α = 0.1), we do not reject the null hypothesis, so there is not enough
evidence against the normality of the random variable “consumption”.
d) From the previous item, we can assume that the variable follows a normal distribution,
and we estimate its mean and standard deviation with the values of the sample mean
and standard deviation. Hence, we assume that X=“consumption”≡ N (135.677, 56.908).
Taking this into account and using Distributions → Continuous distributions → Normal
distribution we can answer the different questions:
• Since P (X > 265) = 0.01152839, the expected percentage of working hours when the
consumption is greater than 265 megawatts/hour is 1.15%. Similarly, since P (X <
99) = 0.2596268, we estimate the percentage of working hours with less than 99
megawatts/hour is around 25.96%.
From this we deduce that the expected percentage of times when the consumption is
between 99 and 265 megawatts/hour is (100 − 1.15 − 25.96) = 72.89%.2
• We are looking for the value c such that P (X > c) = 0.02. With the function Normal
quantiles we obtain c = 252.5517.
2
Note here the difference between the probability (expected proportion) and the sample proportion. If we
look at part (a) we note that the first quartile in the sample is 99.09 while the cumulative probability of this
number is much greater than 0.25.
Exercise 3.3. We perform the test H0 : µ ≥ 130 against H1 : µ < 130, because we have shown
in the previous exercise that we may assume the normality of the variable “consumption” and
can, therefore, apply the t one sample test. The output with R is
and since the p-value is 0.8586, we do not reject the null hypothesis: there are not evidences
against assuming that the mean consumption is greater than or equal to 130 megawatts/hour.
Exercise 3.4. In this case we begin filtering the data, and creating a new dataset that we shall
call Steel.temp_high. We already did this in the first session (Section 1.7). Once this is
done, we test the normality of the consumption when the temperature is high. The output of
the Shapiro-Wilk normality test is:
, so we cannot assume normality, and use thus the Wilcoxon one sample test. The output is:
so we reject the null hypothesis and conclude that the consumption when the temperature is high
is smaller than 130 megawatts/hour.
To draw a box-plot, we must first of all consider again the dataset Steel. By doing this and
determining the box-plot for the variable consumption grouped by temperature we obtain:
which clearly shows that the consumption with high temperatures are smaller. We see that the
mean consumption is of 135.6771 megawatts/hour while in the case of high temperatures it is
of 103.5239 megawatts/hour. This explains why in general we cannot conclude that the mean
consumption is smaller than 130, but we can if we restrict ourselves to those hours with high
temperature.
Exercise 3.5. We begin by testing the normality of the data in pr.sc. The output of the Shapiro-
Wilk normality test is:
so we reject the normality of the data. In order to make a test about the average production,
we use the Wilcoxon one sample test. Using the option Statistics → Non-parametric tests →
Wilcoxon single sample test we get the following output:
so we do not reject H0 and therefore there are not significant evidence against the median being
greater than or equal to 260 tons.
To test if the median is different from 250, we use the menu Statistics → Non-parametric
tests → Wilcoxon single sample test and select (Alternative hypothesis: two-sided,
Null hypothesis: mu = 250). We get a p-value of 0.302, so again we conclude that it is
admissible that the median is of 250 tons.
If we test H0 : M e = 240 against H1 : M e ̸= 240, the p-value is 0.6244, so again we accept
H0 .
Exercise 3.6. Yes, because the p-value of the test H0 : p ≤ 0.2 against H1 : p > 0.2, where p
represents the proportion of times line A is used, is smaller than 2.2 · 10−16 . Thus, we reject
H0 and conclude that there are significant evidences that the percentage of times line A is used
is greater than 20%.
Exercise 3.7. a) An adequate graph for the number of hours when the overheating detection
system is on is a bar chart of system. In the frequency table we observe that 49.57265%
of the hours it was on.
b) We test H0 : p ≤ 0.6 against H1 : p > 0.6, because OF F < ON and thus p = P (OF F ).
The p-value is 0.9827, so there are no evidences that P (OF F ) > 0.6, or, equivalently,
that P (ON ) < 0.4.
c) If we consider the variable X =“number of hours where the system is on between the 25
chosen at random” and assume that the population proportion coincides with the point
estimation from item a), we obtain that X ≡ B(25, 0.4957). Then,
In the previous part we learned how to make one sample tests. As a summary, we saw how to
perform the following:
In this part and the next one we shall compare two samples. The outline is the following:
55
56 SESSION 4. TWO SAMPLE TESTS
Example 4.1. Is the proportion of hours without breakdowns smaller in line A than in line B?
Solution: Let us follow the usual steps for the resolution of an hypothesis test.
Identify the adequate test for the problem
In this problem we are comparing two proportions, so the adequate test is the two sample
test for equality of proportions.
Determine H0 and H1 for this test
The different types of hypotheses we can consider when comparing proportions are:
H0 : pA = pB H0 : pA ≥ pB H0 : pA ≤ pB
H1 : pA ̸= pB H1 : pA < pB H1 : pA > pB
two.sided less greater
where pA and pB are the proportions in populations A and B, respectively. In our case, sample
A will be given by those data where line=="A" and B, by those with line=="B".
By default, R-Commander assumes that proportions pA and pB are associated to the first
class in the alphabetical order, in this case the value A of the variable line . Hence, our
hypotheses are:
H0 : pA ≥ pB (better in line A)
H1 : pA < pB (worse in line A)
Statistics
yProportions
yTwo-sample proportions test
We obtain:
2-sample test for equality of proportions without continuity correction
When applying the two sample proportions test, we should be careful with three things:
• Identify correctly the response variable and the group variable; in this respect, note that
the response variable is the object of our study, and we compare its behaviour on the
groups determined by the other variable. Note that the values or these groups usually
appear as subindices in H0 , H1 .
• Verify that the proportion is representing what we are interested in. By default, R
considers that p is the proportion of the first category in alphabetical order. If we are
interested in the other category, we should either reorder the factor levels or express
H0 , H1 in terms of the other category.
• The order of the categories of the group variable when choosing the appropriate alternative
hypothesis.
the Levene test. Here we shall use the first one: the two variances F test, because we shall
consider this problem only after verifying that the distributions are normal.
Example 4.2. Are the variances of the consumption the same in lines A and B? (assuming
normality)
so we must make
Statistics
yVariances
yTwo-variances F test
A B
1431.355 2034.651
Since the p-value (0.1834) is greater than α we do not reject the null hypothesis. Hence, we
assume that there are no significant differences between the variances in the two populations
(line A and line B).
This test is usually employed as an auxiliary test for the independent samples t test, that
we shall see later. However, in the context of engineering it is interesting by itself, because a
basic strategy for the improvement of quality is determining the causes of variability, in order
to reduce it. For this reason, it is not uncommon to perform tests of homoscedasticity (usually
with alternative hypothesis of the type H1 : σ12 > σ22 or H1 : σ12 < σ22 ) to see if the strategies
that have been adopted to reduce the variability are effective or not.
independent: they are two samples corresponding to different elements of the popula-
tion. In R-Commander, two independent samples within the same data set have a
dichotomous factor (that is, a factor with two levels) that distinguishes them; the
quantitative variable under study is in one column only. Assume for instance that
we are studying the consumption with and without breakdowns; the values of the
consumption are all in the same column (consumption) and each data belongs to
one sample or the other depending on the value of variable breakdowns.
paired: in this case, each individual has a value in each of the two samples. In R-
Commander, the data appear in two different columns.
The following table summarizes the different tests of comparison of means that we shall see in
this subject:
Approximately
Comparison normal Independent? Test type
distributions?
Difference of means Yes Yes Independent samples t test
Mean of the difference Yes No Paired t test
Difference of medians No Yes Two sample Wilcoxon test
Median of the difference No No Wilcoxon paired test
In the following two subsections we will consider the case where the underlying distributions
are normal, and in the two subsequent ones one, the case of arbitrary distributions. In case of
normality, we compare the means of both groups, or, equivalently, the mean of the difference.
Solution: We must determine first of all if the data come from a normal distribution. This can
be done in a number of ways.
The fastest procedure is to test the normality by groups (note that this feature is not
available in old versions of RCommander):
Statistics
ySummaries
yTest of normality
Another possibility, valid in old versions of RCommander, is to type the following in the
RScript:
with(Steel,by(consumption,line,shapiro.test))
Finally, we could also make two filters and create two data sets, corresponding to line A
and B, respectively, and then apply to each of them the Shapiro-Wilk normality test.
Whatever the procedure, we obtain the following output:
line: A
-----------------------------------------------------------------------------
line: B
The p-value for the consumption in line A is 0.1534 and in line B it is 0.2841. In both
cases it is large enough so as not to reject the null hypothesis (we can accept the normality of
the data). The two samples are independent, and as we saw in Example 4.2 the variances are
equal. Taking all this into account, we can apply the Independent samples t test, assuming
equal variances.
Determine H0 and H1 for this test
For the independent samples t test the possibilities for the null and alternative hypotheses
are:
H0 : µ1 = µ2 H0 : µ1 ≥ µ2 H0 : µ1 ≤ µ2
H1 : µ1 ̸= µ2 H1 : µ1 < µ2 H1 : µ1 > µ2
Statistics
yMeans
yIndependent samples t test
Statistics
yMeans
yIndependent samples t test
We obtain:
Since the p-value (< 2.2 · 10−16 ) is again smaller than α we do reject the null hypothesis,
and conclude that there are significant differences in the mean consumption in lines A and B.
This was to be expected, because if we have significant evidences that the mean consumption
is strictly smaller in line A than in line B, in particular there are significant evidences that the
two consumptions are different.
We start calculating the variable difference with the option Data → Manage variables in
active data set → Compute new variable with the following input:
This creates a new variable to which we apply the Shapiro-Wilk normality test, in the usual
way: Statistics → Summaries → Shapiro-Wilk normality test. Once we are in Shapiro-Wilk
normality test we select variable difference and obtain the following result:
Since the p-value is 0.3948, we do not reject the hypothesis of normality. Hence, we can
apply the paired t test to the hypotheses
H0 : µcc−pint = 0
H1 : µcc−pint ̸= 0
Statistics
yMeans
yPaired t test
The output is
Paired t-test
and we see that the p-value is 0.0139; this allows us to conclude that there is a significant
difference in the means of pr.cc and pr.pint.
Example 4.6. Let us study the average production of type II galvanized steel depending on the
production line.
Solution: We apply the Shapiro-Wilk normality test to each sample of variable pr.galv2 de-
pending on the value of line. To do this, we select first of all those values corresponding to
line A (Figure 4.1).
Next, we apply the Shapiro-Wilk normality test, with the following results:
Shapiro-Wilk normality test
Note also that a faster way to make this test, instead of making all the filters, would be to
type in the RScript the following instructions:
with(Steel,by(pr.galv2,line,shapiro.test))
Finally, an even faster procedure would be to test the normality of variable pr.galv2 per
group, similarly to what we did in Section 4.3.1.
Now, if we want to compare the averages in both samples, we may consider the following
hypotheses:
We apply then the Wilcoxon two samples test, in the menu (Statistics → Nonparametric
tests → Two-sample Wilcoxon test, Figure 4.3).
The output is
Wilcoxon rank sum test with continuity correction
Solution: First of all, we obtain the difference of these two variables with the menu Data →
Manage variables in the active data set → Compute new variable. We shall denote the new
variable dif.
The output of the Shapiro-Wilk normality test on dif is:
We reject the normality at the significance level α = 0.05. Hence, instead of making the t
paired test we are going to apply the Wilcoxon paired test. The null and alternative hypotheses
we consider are:
H0 : Me X1 −X2 = 0 (the average production is the same in both cases)
H1 : Me X1 −X2 ̸= 0 (the average production is different for both variables)
We use the menu Statistics → Nonparametric tests → Paired samples Wilcoxon test and
select the variables as described in Figure 4.4.
The p-value is < 2.2 · 10−16 ≈ 0, smaller than any reasonable significance level α, so we
conclude that the production of the two types of galvanized steel is different in average.
4.4 Exercises
Exercise 4.1. Give a reasoned answer to the following questions:
a) Make a test to determine if the percentage of hours when the overheating detection system
is off is greater in line A than in line B. How much is the p-value? What do we conclude?
b) In line A, what is the percentage of hours when the system was off ?
c) In line B, what is the percentage of hours when the system was off ?
Exercise 4.2. Are there significant evidences that the proportion of times we use line B is greater
when there are breakdowns?
Exercise 4.3. We want to compare the average consumption when the overheating detection
system is on and when it is off.
b) If the alternative hypothesis is that the mean consumption is greater when the system is
off, what is the p-value? What do we conclude?
c) Make a graphical representation that illustrates the conclusions of the previous point.
Exercise 4.4. We want to compare the average production of continuous casting and of the steel
converter.
b) For this sample, what is the mean production of continuous casting? And of the steel
converter?
c) If the alternative hypothesis is that the mean production of continuous casting is greater
than that of the steel converter, what is the associated p-value? What do we conclude?
Exercise 4.5. a) Is the production of the steel converter greater, in average, when the over-
heating detection system is on than when it is off ?
b) How much is the sample median of the production of the steel converter when the system
is off ? And when it is on?
c) Make a graph where we can compare the production of the steel converter when the over-
heating detection system is off and on.
Exercise 4.6. a) Is the production of the steel converter smaller, in average, than that of the
hot belt train?
b) How much is the sample median of the production of the steel converter? And that of the
hot belt train?
system
line OFF ON Total Count
A 50.8 49.2 100 61
B 50.0 50.0 100 56
The p-value is 0.4647 so we cannot conclude that the % is significantly greater in line A,
the sample differences may be due to randomness.
b) From the table in the previous item we see that the percentage of times the system was off
in line A was 50.8%.
c) Similarly, the percentage of times the system was off in line B is 50%.
d) The % of times the system is off is greater for line A than for line B. Nevertheless,
the differences are not large enough to conclude that this is what happens in the whole
population.
Exercise 4.2. We should apply the two sample proportions test with H0 : pY ES ≤ pN O against
H0 : pY ES < pN O , where p represents the proportion of times we use line B. Since this is the
second category in the variable line, we reorder the factor levels. Moreover, we choose < in
the alternative hypothesis (the difference NO-YES should be negative).
Since we obtain a p-value of 0.3976, we ACCEPT the null hypothesis, and thus conclude
that there are NOT significant evidences that the proportion of times we use line B is greater
when there are breakdowns.
-------------------------------------------------------------------
system: ON
From all this we deduce that the adequate test for this problem is the t test for independent
samples, with the option of equal variances.
sample estimates:
mean in group OFF mean in group ON
146.9241 124.2362
The p-value is 0.01523, so we reject the null hypothesis at the significance level α = 0.05,
and conclude that the mean consumption is significantly greater when the system is on.
c) Although other graphical representations may also be adequate, we are going to make the
box-plot for OFF and ON, since it is the easiest one among those we saw in the first
session.
Exercise 4.4. a) They are paired samples, so we begin by obtaining the variable difference,
that we shall call dif_cc_sc . To this variable we apply a normality test.
Creation of the variable difference:
Normality test:
Since we can accept the normality (p-value = 0.06339), we use the paired t test.
b) The mean production of continuous casting is 433.93 tons and that of the steel converter
is 244.92 tons.
c) We obtain:
Paired t-test
With a p-value of 6.079 · 10−8 , we conclude that there are significant evidences that the
mean production of continuous casting is greater than that of the steel converter.
Exercise 4.5. a) The p-value of the Shapiro-Wilk normality test on the data of the production
of the steel converter when the system is off is 0.002512, so we reject the normality of one
of the variables. Since we can only apply the t two sample test when there is normality,
we must use the Wilcoxon two sample test instead.
If we consider the hypotheses:
The associated p-value is 0.07351, so in this case there are no significant evidences than
the production, in average, is greater when the system is on than when it is off.
b) When it is off, 179 and when it is on, 241.
c) We may for instance make a box-plot of the variable pr.ca grouping on the variable
system.
Exercise 4.6. a) The p-value of the normality test for the difference (pr.sc-pr.hbt) is 1.892·
10−7 , so reject normality. The p-value of the paired Wilcoxon test is smaller than 2.2 ·
10−16 , so there are significant evidences that the average production of the steel converter
is smaller than that of the hot belt train.
b) The sample median of the production of the steel converter is 225 tons and that of the hot
belt train is 8062 tons.
In many occasions we are interested in determining whether two variables are related or not.
We may wonder for instance if the salary depends on the type of studies, or if the percentage
of defective components depends on the production line considered. There are quite a few tests
that help us answering these questions. In this session we shall see two of them:
• Pearson’s correlation test: it tests the existence or not of a linear relationship between
the two variables. We shall use it on quantitative variables.
If we want to study the relationship between one qualitative and one quantitative variable,
we must take into account the number of categories (factor levels) of the qualitative variable. If
they are only two, then we must apply a test for independent samples: the null hypothesis will
be that the mean of the difference is zero, and this is related to the absence of a relationship;
the alternative hypothesis will be that the mean of the difference is non-zero, which points out
towards the existence of a relationship.
When the qualitative variable takes more than two levels, we must use statistical methods
that lie outside the scope of this course, such as the analysis of variance (ANOVA) or Kruskal-
Wallis test.
Indeed, many of the tests we have seen so far can be considered as tests for the existence
of a relationship between two variables. If we call response variable the one whose behaviour
we want to understand and group variable the one that we assume may have some influence
on the response variable, we can summarize in Table 5.1 the main tests:
73
74 SESSION 5. χ2 -TEST OF INDEPENDENCE AND LINEAR CORRELATION
Example: Study of whether the mean weight is the same for men and for women.
Example: Study of whether the mean weight is the same in Spain, USA and Japan.
Example: Study of whether the percentage of smokers is the same for men and women.
χ2 -independence test
QUALITATIVE QUALITATIVE H0 : the probability distribution of the response variable
WITH SEVERAL WITH SEVERAL is the same for each level of the group variable. We can
LEVELS LEVELS regard this as equivalent to the absence of a relationship
between the variables.
Example: Study of whether the percentage of smokers is the same in Spain, USA and Japan.
Table 5.1: Some tests about the relationship between two variables.
As we have seen before, we can establish a three step protocol when testing an hypothesis.
In the particular case of the tests we consider in this sessions, the steps to take are:
Specifically, in the case of the χ2 -independence test, these hypothesis can be written as:
Nevertheless, R-Commander also allows us to put as the alternative hypothesis that the
correlation coefficient is positive (large values of one variable imply large values of the
other one) or negative (large values of one variable imply small values of the other one).
A p-value smaller than the significance level indicates the existence of a relationship
between the variables. Otherwise we say that the data do not provide significant evidences
of a relationship.
5.1 Independence
The χ2 independence test allows us to determine if there is a statistical relationship between
two qualitative variables. Note that this test does not indicate the type of relationship, nor
which of the variables influences the other one.
We can see it as a generalization of the two-sample proportions test to several samples. Let
us explain how it works by means of an example:
Example 5.1. Is there a relationship between the existence of breakdowns and the temperature?
Solution: Both variables are qualitative. Temperature has three levels (High, Medium, Low)
while breakdowns has two levels (Yes, No). Since they are qualitative, we use the χ2 -independence
test. We go to
Statistics
yContingency tables
yTwo-way table. . .
Entering the frequencies directly If instead of having the data set and selecting the variables
of interest we are given already the contingency table, we can also apply the χ2 test of inde-
pendence to these data. For this, we should consider the option Enter and analyze a two way
table.
Statistics
yContingency tables
yEnter and analyze a two way table
Then, we determine the size of the table, by choosing the number of rows (=number of
different values observed for the first variable) and the number of columns (=number of different
values observed for the second variable), and enter the joint frequencies inside the table. The
rest of the options of the test are the same as before.
According to these data, is there a linear relationship between the energy consumption and the
emissions of carbon dioxide?
Solution: Since both variables are quantitative and continuous, we may use Pearson’s correla-
tion test:
Statistics
ySummaries. . .
yCorrelation test
Since the p-value is smaller than 2.2 · 10−16 , it is in particular smaller than all the usual
significant levels α, so we reject the null hypothesis. Thus, we can conclude that there are
significant evidences of a linear relationship between the consumption and the emissions of
CO2.
We observe also that the point estimation of the correlation coefficient between both vari-
ables is 0.9563613, which leads to a confidence of 95% that the true value of the correlation coef-
ficient is between 0.9376074 and 0.9695667: 95 percent confidence interval: 0.9376074
0.9695667. This can also helps us deducing that there are evidences against the absence of a
linear relationship, because the latter means that the correlation coefficient is zero, and this is
incompatible with the confidence interval above.
Since the estimation of the correlation coefficient 0.9563613 is very close to one and the
p-value is very small, we conclude that the degree of linear relationship between both variables
is very high. Since moreover the estimation is positive and the p-value of the test H0 : ρ ≤ 0
against H1 : ρ > 0 is again smaller than 2′ 2 · 10−16 (we can obtain it by selecting Correlation
> 0 in the alternative hypothesis), we see that the linear relationship between both variables
is positive, so big consumptions imply large emissions of CO2, and small consumptions imply
small emissions.
The calculation of the p-value of this test assumes that the joint distribution of both variables
is a bivariate normal distribution 1 . There is not a completely satisfactory way of verifying this.
One partial verification can be made by testing the normality of each of the variables separately:
if we reject it in any of the two cases, then we should not apply Pearson’s correlation test when
the sample size is small. In that case we can use a non-parametric correlation coefficient, such
as Spearman’s coefficient.
Example 5.5. In the previous example the conclusions obtained are valid, because the outcome
of the Shapiro-Wilk normality test on each of the variables is:
data: CO2
W = 0.9924, p-value = 0.771
Quite often we are interested in determining which of the variables in a given set has the
strongest linear relationship with another one. To analyze this, it is usual to represent all the
correlation coefficients in the correlation matrix.
Example 5.6. In the previous example we may be interested in analyzing whether the energy
consumption is linearly related to the emission of CO, CO2 and SO2. In order to obtain the
correlation coefficients and the p-values, we must do the following:
Statistics
ySummaries
yCorrelation matrix.
1
The details of this distribution lie outside the scope of this course.
SO2 1 1 1
The second matrix tells us that there are significant evidences of a linear relationship between
the energy consumption and the emissions of CO and CO2, but not with SO22 . Out of the variables
2
Note that we do not reject the hypothesis of normality for any of these four variables, because the p-
values of the Shapiro-Wilk normality tests are p-value(CO)=0.1485, p-value(CO2)=0.771, p-value(SO2)=0′ 2773
and p-value(consumption) =0.4207. Thus, it makes sense to interpret the p-values of the Pearson’s correlation
tests.
with a significant linear relationship, the stronger one is that of CO2, because its correlation
coefficient (see the first matrix) is 0.9564 instead 0.9198.
The study of the correlation is usually a first step when determining the relationship between
the variables, and, from it, predicting the value of one of them given the value of the other. The
corresponding techniques, known as regression models, will be the subject of the next session.
5.3 Exercises
Give a reasoned answer to the following questions, using the database Steel.Rdata.
Exercise 5.1. a) Is there a relationship between the temperature and the state of the over-
heating detection system?
b) In the sample, how many times has the system being OFF and the temperature high? And
how many times has it been ON and the temperature has been medium?
c) If there was statistical independence between the two variables, how many times would be
expect the system to be OFF and the temperature to be high?
d) Out of all the hours with medium temperature, what is the percentage of hours with the
overheating detection system ON?
Exercise 5.2. a) Without making calculations, what test may we use to analyze the relation-
ship between the consumption and the total production, the χ2 test of independence or
Pearson’s correlation test?
c) According to the p-value, what do we conclude about the linear relationship between the
consumption and the total production?
d) What is the point estimation of Pearson’s correlation coefficient (ρ)? According to this
value, what do we expect to happen when we increase the total production, that the con-
sumption increases or decreases?
f) Out of all the gas emissions, how many of them have a significant relationship with the
energy consumption? Which of them has the stronger relationship?
b) What is the proportion of times that there are breakdowns in the night shift? And in the
morning and the afternoon shifts?
Exercise 5.4. Is there a linear relationship between the productions of type 1 and type 2 galva-
nized steel?
5.4 Solutions
Exercise 5.1. a) The p-value of the χ2 -independence test is 0.9471, so there is no statisti-
cal evidence that the system is ON more frequently depending on the temperature. The
conclusions of this test are reliable because the expected frequencies are:
b) In the sample, in 24 occasions the temperature was high and the system was OFF, and in
17 occasions the system was ON and the temperature was medium.
c) If there was statistical independence between both variables, we would expect to have
23.19658 hours with high temperature and the system OFF out of the 117 in the sam-
ple.
d) Out of the hours with medium temperature, in 51.5% the overheating detection system
was ON.
Exercise 5.2. a) Pearson’s correlation test, because they are quantitative and continuous.
b) The p-value of Shapiro-Wilk normality test for consumption is 0.4207 and for TotalProd
is 0.8543, so in both cases we accept the normality. These are the minimum requirements
to be able to apply Pearson’s correlation test.
c) The p-value of Pearson’s correlation test is smaller than 2′ 2·10−16 , so there are significant
evidences of ρ ̸= 0, i.e., of the existence of a linear relationship between the consumption
and the total production.
d) The point estimation of Pearson’s correlation test is R = 0.9496154. Since this value is
positive, we expect the consumption to increase as the total production increases.
e) The confidence interval at 95% for ρ is (0.9280690, 0.9648255). Thus, a value of ρ equal
to 0′ 75 is not admissible.
f) All of them have a significant relationship but SO2, because the matrix of p-values of
Pearson’s correlation test is
and we can use these values because the p-values provided by Shapiro-Wilk normality test
are:
so we can accept that all variables under study follow a normal distribution. In fact,
the large sample size (n=117) already allows to draw reliable conclusions by means of
Pearson’s correlation test.
The one with the stronger relationship is CO2, because it is the one with the greater cor-
relation coefficient in absolute value. The correlation matrix is:
Exercise 5.3. a) Since both breakdowns and shift are qualitative variables, we apply the χ2
test of independence. We obtain a p-value of 1.527e−15, so there are significant evidences
that the two variables are NOT independent.
b) If the variable breakdowns is in the row, we select row percentages in the options of
the χ2 test of independence. We observe that 100% of the breakdowns happened in the
night shift, and as a consequence 0% of the breakdowns ocurred in the in the morning and
afternoon shifts.
Exercise 5.4. Since both pr.galv1 and pr.galv2 are quantitative variables, we must apply a
correlation test. In order to check if Pearson’s correlation test is applicable, we test first the nor-
mality of these two variables. We obtain the p-values 0.00957 for pr.galv1 and 0.0000003397
for pr.galv2, so we reject normality. Thus, we cannot apply Pearson’s correlation test, be-
cause this test needs that both variables are normally distributed. If we apply Spearman’s test
we obtain a p-value of 0.127. Since this p-value is greater than 0.05, we conclude that there are
not significant evidences that the two variables are correlated.
Linear regression
a) Make an assumption about the type of relationship between the response and the ex-
planatory variables.
d) Use the model to make estimations and predictions, in case it has been deemed adequate.
87
88 SESSION 6. LINEAR REGRESSION
We shall focus in this session in linear regression. It refers to a model where the conditional
mean of Y given the value of X is a linear function of X.
Linear regression was the first type of regression analysis to be studied rigorously, and to
be used extensively in practice. The reason is that this type of models are easier to treat
mathematically, and moreover the statistical properties of the estimators involved are easier to
determine.
Linear regression can be used to fit a predictive model to an observed data set of y and x
values. After developing such a model, if an additional value of X is then given without its
accompanying value of Y , the fitted model can be used to make a prediction of the value of Y .
Given a variable Y and a number of variables X1 , . . . , Xp that may be related to Y , linear
regression analysis can also be applied to quantify the strength of the relationship between Y
and the Xj , to assess which Xj may have no relationship with Y at all, and to identify which
subsets of the Xj contain redundant information about Y .
Linear regression models are often fitted using the least squares approach; there are, nonethe-
less, other possibilities that optimize the fitness with respect to other criteria. Thus, while the
terms “least squares” and “linear model” are closely linked, they are not equivalent.
Example 6.1. Assume we want to predict the value of variable N2O as a function of the other
emissions of the factory (CO, CO2, NOx and SO2). In order to do this, we are going to study
which of them is the best explanatory variable.
Statistics
ySummaries
yCorrelation matrix.
Out of the three variables it has relationship with, the one with the greatest correlation
coefficient in absolute value is CO2 (0.8540), so we choose this one as a explanatory variable for
N2O.
We can support these conclusions graphically by means of the scatterplot matrix, that we
can obtain as follows:
Graphs
yScatterplot matrix
Out of all these graphs, the interesting ones for our problem are those in the third row,
because the response variable (in our case N2O) is usually plotted in the Y axis while the
explanatory variables appear in the X axis.
test.
Which scatterplot shows a stronger relationship between N2O and the other variables? We
see that there does not seem to be any relationship with SO2, not a linear relationship with NOx
and that there are strong linear relationships with CO and CO2.
Once the response and explanatory variables have been determined, we make a scatterplot
of these two variables, in order to see if the linear regression model seems adequate in this case.
Example 6.2. Plot the scatterplot of N2O in the Y axis and CO2 in the X axis.
Graphs
yScatterplot
The abscissa axis is the emission of CO2 and the ordinate axis displays the variable N2O.
There are two different lines in this graphic. The first one is the linear regression line of y on
x, and the other line is the nonparametric regression (this one is the best adjustment using
least square regression). When both lines are very similar, the linear model will provide a good
adjustment.
To determine the best such model, we use the least squares estimation of the regression
parameters β0 and β1 . The estimations are the parameters minimizing the residual sum of
squares (RSS)
Xn
RSS = (yi − (β0 + β1 xi ))2 ,
i=1
where n is the sample size (the number of pairs of observations in our sample). Let us see an
example.
N2O = β0 + β1 · CO2 + ϵ
Statistics
yFit models
yLinear regression
We may give this model a particular name or leave it with the default name. The output
is:
Residuals:
Min 1Q Median 3Q Max
-2.2585 -0.7287 0.0404 0.6511 2.9353
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.526865 0.280149 5.45 2.91e-07 ***
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
The Estimate column presents the estimations of the coefficients. We deduce that the
equation of the linear model is
Thus, β0 is the Intercept and equals to 1.526865. This coefficient represents the quantity of
N2O when there is no CO2. As its p-value= 2.91 · 10−7 , this coefficient is statistically significant
non-zero ( H0 : β0 = 0, H1 : β0 ̸= 0).
The estimation of β1 is 0.043850. It is significant, since its p-value is less than 2 · 10−16 (
H0 : β1 = 0, H1 : β1 ̸= 0). The interpretation of this value is that, in average, for each unit of
CO2, the N2O increases in 0.043850 units.
• The p-values of the coefficients (2.91 · 10−7 , 2 · 10−16 ) are smaller than the significance
level 0.05, so the linear model is not inadequate.
The ANOVA p-value (last line in the output) coincides in the case of simple linear regres-
sion with the p-value of the explanatory variable, so it doesn’t need a separate analysis
in this case.
We see that in the Residuals vs. Fitted plot the hypothesis of mean 0 of the resid-
uals seems admissible, there are no evidences of heteroscedasticity and the hypothesis of
linearity does not seem inadequate.
In the Normal Q-Q, plot, the points are on the straight line y = x. Therefore, the errors
follow a normal distribution.
The Residuals vs. Leverage plot does not show any outlier.
We conclude that this model seems adequate to predict the emission of N2O from the emission
of CO2.
As we said before, the above graphical procedure is only preliminary. In practice, we should
complement it with some hypotheses tests, such as the Breusch-Pagan test to check the ho-
moscedasticity, the Reset test to verify the hypothesis of linearity and the Bonferroni test to
check the existence of outliers. All these tests can be found under the menu Models-> Nu-
merical diagnostics. We can also verify the normality of the residuals by means of the
Shapiro-Wilk test of normality. In any case, the study of these tests lies outside the scope of
this course, and here we shall only do a graphical diagnose.
where RegModel.1 is the name that was given to the model by the function lm (that we can see
on the top right of the screen) and where we specify next the value of the explanatory variable
we want to use in our prediction.
The output tells us that the point estimation, by means of this linear regression model, of
the emission of N2O that happens in an hour where the emission of CO2 is of 110 tons is 6.350341.
If in addition we want to obtain a prediction interval, we must type
predict(RegModel.1,data.frame(CO2=c(110)),interval="prediction")
The output is
fit lwr upr
1 6.350341 4.140992 8.55969
This means that, with a confidence interval of 95%, the emission of N2O lies between
4.140992t/h and 8.55969t/h when the emission of CO2 is 110t/h. The value 6.350341 is the
point estimate of N2O with CO2=110 in the linear equation.
In addition, we can add an option of the type level=0.99 in order to modify the confidence
level of the interval, which by default is of 95%. We would obtain:
predict(RegModel.1,data.frame(CO2=c(110)),interval="prediction",level=0.99)
With the option interval=.prediction. we tell that we want a confidence interval for the
prediction of the response variable for a certain value of the explanatory variable. If instead
we want to obtain a confidence interval for the mean of the response variable, we must replace
this by interval=’confidence’.
If we compute
predict(RegModel.1,data.frame(CO2=c(110,100)),interval="confidence")
then
In the first case, the mean of N2O when CO2 is dropping 110t/h will be between 6.145248 and
6.555434 units. In the second case we obtain a confidence interval for the mean when CO2 is
100.
6.5 Exercises
Exercise 6.1. Open the database Steel.Rdata.
a) Is there any linear relationship between consumption and SO2? And between consump-
tion and N2O?
b) In order to predict the value of consumption, which of the following explanatory variables
is the most adequate: N2O, SO2 or NOx?
c) Plot the scatterplots of the variable consumption versus the explanatory variable obtained
in the previous point.
f) Is the regression coefficient (the coefficient of the explanatory variable) significantly dif-
ferent from zero?
h) How many units does the variable consumption increase per unit of N2O?
i) Predict the energy consumption when the emission of N2O is 6t/h, by means of a point
estimation and a confidence interval with a confidence level of 95%.
j) Predict the average energy consumption when the emission of N2O is 6t/h, by means of a
confidence interval at a 95% confidence level.
Exercise 6.2. a) Compute a new variable, named Y, with the following equation
Y = TotalProd + 20 * TotalProd^2
c) Fit a linear model with Y as the response and the TotalProd as predict.
6.6 Solutions
Exercise 6.1. a) We do not reject the normality for any of the variables, because the p-values
of the Shapiro-Wilk test are:
data: Steel$consumption
W = 0.9984, p-value = 0.4207
data: Steel$N2O
W = 0.9922, p-value = 0.7518
data: Steel$NOx
W = 0.9797, p-value = 0.07302
data: Steel$SO2
W = 0.9862, p-value = 0.2772
Pearson correlations:
consumption N2O NOx SO2
consumption 1.0000 0.8274 0.5384 -0.0076
It is clear that there is no linear relationship between the consumption and the emission
of SO2. There are linear relationships among consumption and N2O and NOx.
f) Yes, in the test H0 : β1 = 0 versus H1 : β1 ̸= 0, the p-value is less than 2 · 10−16 , so this
coefficient is not zero.
g) The determination coefficient (R2 = 0.6846; the adjusted R2 is Ra2 = 0.6819) is not too
close to zero, and the p-values of the tests on the coefficients suggest that this is not a bad
model. Now, we will check the residuals:
The graph Residuals vs Fitted shows that errors are homoscedastic (equal variance).
In the plot Normal Q-Q, the errors seem to follow a normal distribution.
h) We estimate that the consumption grows 22.1556 units per unit of N2O.
i) The point estimation when N2O=6 is 133.1339 ton and the confidence interval at 95% is
(69.28326, 196.9846).
j) At a confidence level of 95%, the average consumption will lie between 127.2474 and
139.0204 tons.
Exercise 6.2. a) Go to: Data → Manage variables in the active data set → Compute new
variable. Then in the menu Compute new variable, we fill it with:
Residual plots:
• Average of a population.
• Comparison of variances.
• Comparison of averages.
103
104 APPENDIX A. SUMMARY OF THE MAIN HYPOTHESIS TESTS
shift Shift when data were collected: morning (M), afternoon (A), night (N).
105
106 APPENDIX B. INFORMATION ON THE DATA SETS
5. DrivingLicense: variable indicating if the student has driving license (Yes or No)
6. PublicTransport: variable indicating if the student uses the public transport regularly to
come to the Campus (Yes or No)
8. TimeArriving: Time (in minutes) the student takes to reach the Campus from his/her
home.
9. TimeinCampus: Time (in hours) the student spends at the Campus per week
10. Networks: Time (in hours) the student spends on social networks (facebook, twitter, etc)
and on messenger programs (MSN, Yahoo Messenger, etc), in a regular week
11. TV: Time (in hours) the student spends watching television or playing computer games,
in a regular week
12. StudyMontoFri: Time (in hours) the student spends studying between Monday and Fri-
day, in a regular week
13. StudyWeekend: Time (in hours) the student spends studying during the weekend, in a
regular week ,
14. CallReceived: Duration (in minutes) of the last call received on his/her mobile phone
15. CallMade: Duration (in minutes) of the last call made from his/her mobile phone
• Nonsmoker (No)
• Only occasionally (Casual)
• Regularly (Regular)
• Never (Never)
• Yes, but not every week (Casual)
• Yes, once or twice a week (1or2)
• Yes , at least three times a week ( 3ormore )