0% found this document useful (0 votes)
58 views35 pages

Sta238 Wks - Week1+2

Uploaded by

ACE ALPHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views35 pages

Sta238 Wks - Week1+2

Uploaded by

ACE ALPHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

25/01/2024, 12:30 Week 1: Review and Introduction to Data Analysis

Review of basic R
In this section, we revisit basic concepts and usage of R. You may skip the section if you are comfortable
using R.

With R, you can


Compute simple arithmetic operations:

2 / 5 + (5 - 4) * 1

## [1] 1.4

Evaluate logical statements:

5 > 2 && 3 != 5 && 2 == 2

## [1] TRUE

Simulate and plot data:

plot(rnorm(25))

In this class, we will use R to study data.

Variables and data types


You can store values to variables such as
numbers:

my_number <- 5
my_number

## [1] 5

characters:

my_characters <- "this is a single sentence."


my_characters

## [1] "this is a single sentence."

logicals:

my_logical <- TRUE


my_logical

## [1] TRUE

sequences of a single data type called vectors:

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-review-of-basic-r 1/4
25/01/2024, 12:30 Week 1: Review and Introduction to Data Analysis

my_vector <- c(2, 3, 7)


my_vector

## [1] 2 3 7

To extract elements from a vector, you can use


the positional indices inside [ starting with 1:

my_vector[1]

## [1] 2

my_vector[c(1, 3)]

## [1] 2 7

a logical vector of the same length to extract elements at TRUE positions:

my_vector[c(FALSE, FALSE, TRUE)]

## [1] 7

my_vector[c(T, F, T)]

## [1] 2 7

You can create empty vectors of length n with:

n <- 5
numeric(n) # numbers

## [1] 0 0 0 0 0

character(n) # characters

## [1] "" "" "" "" ""

logical(n) # logicals

## [1] FALSE FALSE FALSE FALSE FALSE

Exercise 1
Create a vector that consists of day of your birthday and your first name. Print the vector and comment on its data
type.

R CODE  START OVER  HINT  RUN CODE

1 birth_name <- c(26, "Aditya")


2 birth_name
3

[1] "26" "Aditya"

Functions
In R, a function is in the following form:

<function name>(<arg1>, <arg2>, ...)

You can store the result of a function with result <- function_x(...)

R CODE  START OVER  RUN CODE

1 sum_of_237 <- sum(c(2, 3, 7))


2 sum_of_237
3

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-review-of-basic-r 2/4
25/01/2024, 12:30 Week 1: Review and Introduction to Data Analysis

[1] 12

You can look at the help page of each function with ?<function name> . Whenever you need help using a function,
try the help page first. A quick search on the Internet will often provide many examples as well.

R CODE  START OVER  RUN CODE

1 ?sample
2
3

There are also a plenty of resources and examples online for most R functions.

You can also create custom functions.

R CODE  START OVER  RUN CODE

1 # your function definition may look like ...


2 your_function <- function(arg1, arg2) {
3 # some task using the inputs
4 # the variable names must match the names
5 # defined in the function definition
6 output <- arg1 + arg2
7 # return() explicitly tells R the object your function
8 # the point at which point the function stops and
9 # returns an output
10 return(output)
11 }
12 # you can then call the function as ...
13 your_function(5, 5)

[1] 10

Exercise 2
Create a function named diff_in_absolute_diff() that takes 4 numbers – say, w , x , y , and z — and computes

| w − x | − |y − z |

R CODE  START OVER  HINTS  RUN CODE

1 diff_in_absolute_diff <- function(w, x, y, z) {


2 out <- abs(w - x) - abs(y - z)
3 return(out)
4 }
5 diff_in_absolute_diff(20,10,15,10)
6

[1] 5

Loops and vectorized calls


To loop through each element in a vector, you may use a “for loop” in R:

x <- c(10, 20, 30, 40, 50)


y <- c(5, 10, 15, 20, 25)
z <- numeric(5)
for (i in c(1, 2, 3, 4, 5)) {
z[i] <- x[i] + y[i]
}
z

## [1] 15 30 45 60 75

Many R functions automatically broadcast the operation to each element resulting in a vector of same size.

x + y

## [1] 15 30 45 60 75

Exercise 3

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-review-of-basic-r 3/4
25/01/2024, 12:30 Week 1: Review and Introduction to Data Analysis

Use diff_in_absolute_diff() from Exercise 2 in a loop to compute

| w − x | − |y − z |
for each element of the following vectors.

w <- c(1, 2, 3, 4, 5)
x <- c(10, 20, 30, 40, 50)
y <- c(5, 10, 15, 20, 25)
z <- c(50, 40, 30, 20, 10)

The vectors w , x , y , z , and the function diff_in_absolute_diff() have been predefined for your use in the
following chunk. Use the for-loop to compute and save the resulting values to the vector u .

R CODE  START OVER  HINTS  RUN CODE

1 # use a loop
2 u <- numeric(5)
3 for (i in 1:5) {
4 u[i] <- diff_in_absolute_diff(w[i], x[i], y[i], z[i])
5 }
6 u

[1] -36 -12 12 36 30

In the code chunk below, the vectors are directly used as inputs. Compare the results from the loop
version above.

R CODE  START OVER  RUN CODE

1 # diff_in_absolute_diff() is a vectorized
2 u <- diff_in_absolute_diff(w, x, y, z)
3 u

[1] -36 -12 12 36 30

NEXT TOPIC

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-review-of-basic-r 4/4
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis

Data in R
Matrix
It is often useful to work in 2-dimensions when working with data. In R, a matrix is a 2-dimensional object consisting of
values of a single data type.

You can create a matrix using matrix(<vector of values>, <number of rows>) .

matrix(1:9, 3)

## [,1] [,2] [,3]


## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9

matrix(LETTERS, 2) # `LETTERS` is a prepopulated vector with the capital alphabets A to Z.

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] "A" "C" "E" "G" "I" "K" "M" "O" "Q" "S" "U" "W" "Y"
## [2,] "B" "D" "F" "H" "J" "L" "N" "P" "R" "T" "V" "X" "Z"

You can access the individual elements using [ like vectors. If you use a single value or single vector, it treats the matrix
as a vector concatenating the columns in order.

mat_ex <- matrix(1:12, 3)


mat_ex

## [,1] [,2] [,3] [,4]


## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12

mat_ex[5] # 5th value

## [1] 5

mat_ex[c(2, 11)] # 2nd and 11th values

## [1] 2 11

More intuitively, you can use 2-dimensional coordinates to extract the elements separated by a comma. The first value
corresponds to the row and the second value to the column. The index starts at 1 like a single value index.

mat_ex[2, 3] # second row, third column

## [1] 8

mat_ex[c(1, 3), 3] # first and third row, third column

## [1] 7 9

mat_ex[c(2, 3), c(1, 4)] # second and third row, first and fourth column

## [,1] [,2]
## [1,] 2 11
## [2,] 3 12

If you leave one dimension blank, you can extract across the whole rows or columns.

mat_ex[1, ] # all values from the first row

## [1] 1 4 7 10

mat_ex[ , 4] # all values from the fourth column

## [1] 10 11 12

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 1/5
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis

There are a few convenient functions provided for working with multidimensional objects. For example, rowMeans()
computes mean values for each row. See documentation for details and other similar functions.

R CODE  START OVER  RUN CODE

1 ?rowMeans
2
3

colSums package:base R Documentation

_F_o_r_m _R_o_w _a_n_d _C_o_l_u_m_n _S_u_m_s _a_n_d _M_e_a_n_s

_D_e_s_c_r_i_p_t_i_o_n:

Form row and column sums and means for numeric arrays (or data
frames).

_U_s_a_g_e:

colSums (x, na.rm = FALSE, dims = 1)


rowSums (x, na.rm = FALSE, dims = 1)
colMeans(x, na.rm = FALSE, dims = 1)
rowMeans(x, na.rm = FALSE, dims = 1)

.colSums(x, m, n, na.rm = FALSE)


.rowSums(x, m, n, na.rm = FALSE)
.colMeans(x, m, n, na.rm = FALSE)
.rowMeans(x, m, n, na.rm = FALSE)

_A_r_g_u_m_e_n_t_s:

x: an array of two or more dimensions, containing numeric,


complex, integer or logical values, or a numeric data frame.
For '.colSums()' etc, a numeric, integer or logical matrix
(or vector of length 'm * n')

Exercise 4
mat is a 6 by 4 numeric matrix in the following code chunk. Use R code to

1. find the row numbers where the mean values of the row is smaller than 15
2. find the column numbers where the column sums are larger than 75
3. check whether all values in the second row are larger than values in the fourth row AND value 5 is in the last
column

R CODE  START OVER  HINTS  RUN CODE

1 mat
2
3 (1:6)[rowMeans(mat) < 15] # fill in the bracket with a logical statement
4 (1:4)[colSums(mat) > 75] # fill in the bracket with a logical statement
5
6 all(mat[2, ] > mat[4, ]) && (5 %in% mat[ ,4])
7

[,1] [,2] [,3] [,4]


[1,] 21 20 24 2
[2,] 5 22 23 9
[3,] 13 7 10 16
[4,] 11 14 12 18
[5,] 19 17 3 6
[6,] 1 15 8 4

[1] 2 3 4 5 6

[1] 2 3

[1] FALSE

Data frames
Vectors and matrices can only save values of a single data type. We can create tables consisting of columns in different
data types via data.frame() in R.

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 2/5
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis

For example, the data frame below has a column of numbers and a column of characters.

set.seed(238)

some_table <- data.frame(


numbers = 1:5,
alphabets = letters[1:5],
l = sample(c(T, F), 5, replace = TRUE),
y = rpois(5, 100),
w = rgeom(5, 1/100)
)
some_table

numbers alphabets l y w
<int> <chr> <lgl> <int> <int>

1 a TRUE 106 14

2 b TRUE 100 98

3 c FALSE 89 1

4 d FALSE 100 141

5 e TRUE 92 282

5 rows

Many functions in R are designed to work with data frames. For example, ggplot() 1 expects a data frame as the first
input argument. You can then define the mappings between columns and aesthetics (e.g., x axis).

library(ggplot2)
ggplot(some_table, # provide the table
aes(x = numbers)) + # map the column `numbers` to x-axis
theme_minimal() + # use minimal theme
geom_point(aes(y = ifelse(l, y, w), # draw points using `y` or `w` based on the value of `l`
color = l)) # change color based on the value of `l`

Specifying aesthetics such as color , size , linetype , shape , etc. inside aes() changes the specified aesthetics
based on the mapped values.
Most multivariate data come in a table format and R makes it easy to work with them.

You can access elements of a data frame using 2D indexes in the same way you use 2D indexes with matrices.

some_table[1, ] # extracting the first row

numbers alphabets l y w
<int> <chr> <lgl> <int> <int>

1 1 a TRUE 106 14

1 row

some_table[1:3, ] # extracting the first three rows

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 3/5
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis

numbers alphabets l y w
<int> <chr> <lgl> <int> <int>

1 1 a TRUE 106 14

2 2 b TRUE 100 98

3 3 c FALSE 89 1

3 rows

some_table[ , 3] # extracting the third column

## [1] TRUE TRUE FALSE FALSE TRUE

some_table[1, 3:4] # extracting the element in the first row, third and fourth columns

l y
<lgl> <int>

1 TRUE 106

1 row

When you extract multiple columns, the result is still a data frame.

On the other hand, using a single positional index behaves differently from a matrix and returns a column.

some_table[3]

l
<lgl>

TRUE

TRUE

FALSE

FALSE

TRUE

5 rows

You can also extract a column using $ followed by the column name.

some_table$l

## [1] TRUE TRUE FALSE FALSE TRUE

Using 2D index to extract a column (e.g., some_table[ , 3] ) or $ extractor returns a vector whereas using 1D index
(e.g., some_table[3] ) returns a single column data frame. It doesn’t affect the codes in most cases but can be a good
place to check if you face an unexpected error.

Exercise 5
In the code chunk below, some_table is available for you. i. Extract values of y from rows where l is TRUE . ii.
Extract values of w from rows where l is FALSE . iii. Add the five pairs of extracted values.

R CODE  START OVER  HINTS  RUN CODE

1 some_table
2 sum(some_table$y[some_table$l]) + sum(some_table$w[!some_table$l])
3

numbers alphabets l y w
<int> <chr> <lgl> <int> <int>

1 a TRUE 106 14

2 b TRUE 100 98

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 4/5
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis

numbers alphabets l y w
<int> <chr> <lgl> <int> <int>

3 c FALSE 89 1

4 d FALSE 100 141

5 e TRUE 92 282

5 rows

[1] 440

PREVIOUS TOPIC NEXT TOPIC

1. ggplot() is a plotting function available from ggplot package (https://ggplot2.tidyverse.org).

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 5/5
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis

Introduction to data analysis using R


Data
For the following exercises, we will use weather_forecasts — a sample of predicted temperature from weather
forecasts and the observed temperature across the United States2. The data set includes 100 records from a summer and
100 records from a winter. Each row includes the following columns:

season : “Summer” or “Winter”


forecast_temp : predicted temperature (°C)
observed_temp : actual observed temperature (°C)

Below is the data set weather_forecasts .

season forecast_temp observed_temp


<chr> <dbl> <dbl>

Winter 19.4444444 20.5555556

Winter 19.4444444 18.8888889

Winter 7.7777778 7.7777778

Winter 19.4444444 18.8888889

Winter -17.2222222 -11.6666667

Winter 26.1111111 26.1111111

Winter 1.6666667 -0.5555556

Winter -10.0000000 -12.7777778

Winter -6.1111111 -6.1111111

Winter 1.6666667 -0.5555556

1-10 of 200 rows Previous 1 2 3 4 5 6 ... 20 Next

dplyr package
dplyr (https://dplyr.tidyverse.org) is a package that provides a set of functions for manipulating data. They are useful
when working with tabular data.

library(dplyr)

For example, mutate() allows you to create one or more new columns as function(s) of existing columns. The code
below computes the difference forecast_temp - observed_temp for each row and saves them to a new column named
forecast_err . We can save the resulting data frame to weather_forecasts again for future use.

weather_forecasts <- weather_forecasts |>


mutate(forecast_err = forecast_temp - observed_temp)

|> is called the forward pipe operator. It feeds the object on the left hand side as the first input argument to the function
on the right hand side. Here, weather_forecast is passed to the function mutate() as the first argument.

Another useful function is filter() . As the name suggests, the function filters the given data set based on one or more
logical statements. For example, the statement outputs a subset of the data set where the season is summer and the
absolute ( abs() in R) discrepancy between the forecast temperature and the observed temperature is greater than 3°C.

weather_forecasts |>
filter(season == "Summer", abs(forecast_err) > 3)

season forecast_temp observed_temp forecast_err


<chr> <dbl> <dbl> <dbl>

Summer 22.22222 25.55556 -3.333333

Summer 16.11111 19.44444 -3.333333

Summer 30.00000 26.11111 3.888889

Summer 28.33333 23.88889 4.444444

4 rows

You can submit multiple logical statements separated by , to request for records that satisfied all of the conditions. If
you want to request for those that satisfy at least one of multiple conditions, you can use the “OR” operator, | .

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 1/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis

You can find more about the package on this cheat sheet (https://rstudio.github.io/cheatsheets/html/data-
transformation.html).

Numerical summaries
You can compute individual numerical summaries using built-in R functions that takes numeric vectors as inputs. These
include:

mean() for sample mean


median() for sample median
min() for minimum, max() for maximum, and range() for both
quantile() for sample quantile(s) — you can specify the desired probabilities; by default, it returns quantiles that
correspond to probabilities 0, 0.25, 0.5, 0.75, and 1
var() for sample variance and sd() for sample standard deviation
summary() for a five-number summary

For example, you can compute the 0.3th and 0.7th quantiles of the observed temperatures as below.

quantile(weather_forecasts$forecast_err, c(.3, .7))

## 30% 70%
## -1.1111111 0.5555556

Alternatively, you can use summarise() from dplyr to compute multiple numerical summaries at once. Similar to
mutate() , summarise() takes a data frame as the first input followed by one or more summarising functions of existing
columns. For example, the code below computes the five-number summary and the sample mean of the observed
temperatures.

weather_forecasts |>
summarise(
min = min(observed_temp),
lower_quartile = quantile(observed_temp, .25),
mean = mean(observed_temp),
median = median(observed_temp),
upper_quartile = quantile(observed_temp, .75),
max = max(observed_temp)
)

min lower_quartile mean median upper_quartile max


<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>

-29.44444 0 13.74167 17.22222 26.11111 38.88889

1 row

Exercise 6
Compute the mean, the median, the interquartile range, and the sample variance of the following quantities in
weather_forecasts for summer and winter separately:

i. predicted temperatures
ii. observed temperatures
iii. difference between the predicted and the observed temperatures

The data frame weather_forecasts with the computed column forecast_err is available in the code chunk
below.

R CODE  START OVER  HINTS  RUN CODE

1 weather_forecasts
2
3 # using dplyr --- advanced using `.by` argument in `summarise()`
4 weather_forecasts |>
5 summarise(
6 .by = season,
7 predicted_mean = mean(forecast_temp),
8 predicted_median = median(forecast_temp),
9 predicted_iqr = quantile(forecast_temp, .75) - quantile(forecast_temp, .25),
10 predicted_var = var(forecast_temp),
11 observed_mean = mean(observed_temp),
12 observed_median = median(observed_temp),
13 observed_iqr = quantile(observed_temp, .75) - quantile(observed_temp, .25),
14 observed_var = var(observed_temp),
15 error_mean = mean(forecast_err),
16 error_median = median(forecast_err),
17 error_iqr = quantile(forecast_err, .75) - quantile(forecast_err, .25),
18 error_var = var(forecast_err)
19 )

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 2/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis

season forecast_temp observed_temp forecast_err


<chr> <dbl> <dbl> <dbl>

Winter 19.4444444 20.5555556 -1.1111111

Winter 19.4444444 18.8888889 0.5555556

Winter 7.7777778 7.7777778 0.0000000

Winter 19.4444444 18.8888889 0.5555556

Winter -17.2222222 -11.6666667 -5.5555556

Winter 26.1111111 26.1111111 0.0000000

Winter 1.6666667 -0.5555556 2.2222222

Winter -10.0000000 -12.7777778 2.7777778

Winter -6.1111111 -6.1111111 0.0000000

Winter 1.6666667 -0.5555556 2.2222222

1-10 of 200 rows Previous 1 2 3 4 5 6 ... 20 Next

season predicted_mean predicted_median predicted_iqr predicted_var observed_mean


<chr> <dbl> <dbl> <dbl> <dbl> <dbl>

Winter 1.05000 0.5555556 15.13889 132.96480 1.438889

Summer 25.93889 25.8333333 11.80556 53.82651 26.044444

2 rows | 1-6 of 13 columns

Graphical summaries
We will use ggplot2 (https://ggplot2.tidyverse.org) package to create plots in R. You can find the cheat sheet for the
package here (https://rstudio.github.io/cheatsheets/html/data-visualization.html).

You can create a canvas by passing a data frame to ggplot() function then add graphical objects to create plots using
the column names. You can further customize the plot by adding additional layers. For example, you can create a scatter
plot of the observed temperatures vs. predicted temperatures coloured by season in the following steps:

1. creating a canvas with weather_forecasts

ggplot(weather_forecasts)

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 3/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis

2. add points with observed_temp mapped to x axis, forecast_temp to y axis, and season to colour

ggplot(weather_forecasts) +
geom_point(aes(x = observed_temp, y = forecast_temp, colour = season))

3. label axes and add a title

ggplot(weather_forecasts) +
geom_point(aes(x = observed_temp, y = forecast_temp, colour = season)) +
labs(title = "Weather forecast isn't perfect.",
x = "Observed temperature",
y = "Predicted temperature")

4. customize the look of the plot

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 4/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis

ggplot(weather_forecasts) +
geom_point(aes(x = observed_temp, y = forecast_temp, colour = season)) +
labs(title = "Weather forecast isn't perfect.",
x = "Observed temperature",
y = "Predicted temperature") +
scale_colour_discrete(name = NULL) + # remove the legend title
theme_minimal() + # apply the minimal theme
theme(legend.position = "top") # place the legend at the top

5. with facet_wrap() , you can separate the plot by season

ggplot(weather_forecasts) +
geom_point(aes(x = observed_temp, y = forecast_temp, colour = season)) +
labs(title = "Weather forecast isn't perfect.",
x = "Observed temperature",
y = "Predicted temperature") +
scale_colour_discrete(name = NULL) +
theme_minimal() +
theme(legend.position = "none") + # remove the legend
facet_wrap(vars(season), nrow = 2) # separate by season in 2 rows

Below are examples of other types of plots using ggplot() .

Box plot using geom_boxplot()

ggplot(weather_forecasts) +
geom_boxplot(aes(x = observed_temp, y = season)) +
labs(title = "Temperatures varied more in the winter across the U.S.",
x = NULL, y = NULL) + # remove axis titles
theme_minimal()

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 5/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis

By mapping the categorical column season to the y-axis, you can easily compare the distributions between the two
seasons using box plots.
Mapping the quantity of interest to the x-axis results in horizontal box plots.

Histogram using geom_histogram()

# using normal reference method for bin width (MIPS Remark 15.1)
B <- (24 * sqrt(pi))^(1 / 3) * sd(weather_forecasts$observed_temp) * nrow(weather_forecasts)^(-1 / 3)
ggplot(weather_forecasts) +
geom_histogram(aes(x = observed_temp), binwidth = B) +
labs(title = "Temperatures varied more in the winter across the U.S.",
x = "Observed temperature", y = "Frequency") +
theme_minimal() +
facet_wrap(vars(season), nrow = 2)

binwidth argument controls the width of each bin.


Using facet_wrap() allows comparison because the facets use a common x-axis.
geom_histogram() plots the counts by default rather than the relative frequencies for each bin.

Gaussian kernel density curve using geom_density() with histogram

ggplot(weather_forecasts, aes(observed_temp)) +
geom_histogram(aes(y = after_stat(density)),
bins = 15, alpha = .5) +
geom_density(colour = "forestgreen") +
labs(title = "Temperatures varied more in the winter across the U.S.",
x = "Observed temperature", y = "Frequency") +
theme_minimal() +
facet_wrap(vars(season), nrow = 2)

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 6/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis

Adding multiple geom_*() to ggplot() , results in multiple layers of plot objects in a single plot.
You can define the aes() mappin inside ggplot() instead of geom_*() functions. These mapping apply to all
geom ’s created.
after_stat(density) inside geom_histogram() transforms the heights to relative frequencies.
bins argument controls the number of bins.
alpha controls the transparency of the plotting objects.
Use colour argument outside aes() to directly change the colour of the plotting objects rather than mapping to
data.

Empricial CDF curve using geom_stat(stat = "ecdf")

weather_forecasts |>
ggplot(aes(x = observed_temp, colour = season)) +
geom_step(stat = "ecdf") +
labs(title = "Temperatures varied more in the winter across the U.S.",
x = "Observed temperature", y = "Empirical CDF") +
theme_minimal() +
scale_colour_discrete(name = NULL) +
theme(legend.position = "top")

stat = "ecdf" computes the empirical CDF of the mapped column and geom_step() plots them as a step
function.
To plot an empirical CDF curve without the vertical lines with ggplot() , you can use geom_segment() but you
need to compute the empirical CDF separately. Below is a custom function that achieves this.

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 7/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis

plot_ecdf <- function(x, point_size = 1) {


x <- sort(x)
ecdf_x <- ecdf(x) # ecdf(x) returns the eCDF of x, which is also a function
r <- diff(range(x))
x1 <- c(min(x) - r * .05, x)
x2 <- c(x, max(x) + r * .05)
y <- ecdf_x(xstart)

ggplot() +
geom_segment(aes(x = x1, y = y, xend = x2, yend = y)) +
geom_point(aes(x = x2[-length(x2)], y = y[-length(y)]),
shape = 21, fill = "white", size = point_size) +
geom_point(aes(x = xstart[-1], y = y[-1]), shape = 19, size = point_size)
}

Exercise 7
Use the below code chunk to create appropriate plots to answer the following questions. The data frame
weather_forecasts with the computed column forecast_err is available in the code chunk below.

R CODE  START OVER  HINT  RUN CODE

1 weather_forecasts
2
3 # box plot of errors across the two seasons
4 weather_forecasts |>
5 ggplot(aes(x = forecast_err)) +
6 theme_minimal() +
7 labs(x = "Error in predicted temperature",
8 y = NULL) +
9 geom_boxplot()
10 # box plot of errors for each season
11 weather_forecasts |>
12 ggplot(aes(x = forecast_err, y = season)) +
13 theme_minimal() +
14 labs(x = "Error in predicted temperature",
15 y = NULL) +
16 geom_boxplot()
17 # histograms of errors for each season
18 weather_forecasts |>
19 ggplot(aes(x = forecast_err)) +
20 theme_minimal() +
21 labs(x = "Error in predicted temperature",
22 y = "Frequency") +
23 geom_histogram(bins = 18) +
24 facet_wrap(vars(season), nrow = 2)
25 # histogram and KDE of errors across the two seasons
26 weather_forecasts |>
27 ggplot() +
28 theme_minimal() +
29 labs(x = "Error in predicted temperature",
30 y = "Density") +
31 geom_histogram(aes(x = forecast_err, y = after_stat(density)),
32 alpha = .5, bins = 25) +
33 geom_density(aes(x = forecast_err), colour = "forestgreen")
34 # empirical CDF of the erorrs for each season
35 weather_forecasts |>
36 ggplot(aes(x = forecast_err, colour = season)) +
37 geom_step(stat = "ecdf") +
38 labs(x = "Error in predicted temperature", y = "Empirical CDF") +
39 theme_minimal() +
40 scale_colour_discrete(name = NULL) +
41 theme(legend.position = "top")
42 # scatter plot between observed temperatures and errors
43 weather_forecasts |>
44 ggplot(aes(x = observed_temp, y = forecast_err, colour = season)) +
45 theme_minimal() +
46 geom_point() +
47 labs(x = "Observed temperature", y = "Error in predicted temperature") +
48 scale_colour_discrete(name = NULL) +
49 theme(legend.position = "top")

season forecast_temp observed_temp forecast_err


<chr> <dbl> <dbl> <dbl>

Winter 19.4444444 20.5555556 -1.1111111

Winter 19.4444444 18.8888889 0.5555556

Winter 7.7777778 7.7777778 0.0000000

Winter 19.4444444 18.8888889 0.5555556

Winter -17.2222222 -11.6666667 -5.5555556

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 8/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis

season forecast_temp observed_temp forecast_err


<chr> <dbl> <dbl> <dbl>

Winter 26.1111111 26.1111111 0.0000000

Winter 1.6666667 -0.5555556 2.2222222

Winter -10.0000000 -12.7777778 2.7777778

Winter -6.1111111 -6.1111111 0.0000000

Winter 1.6666667 -0.5555556 2.2222222

1-10 of 200 rows Previous 1 2 3 4 5 6 ... 20 Next

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 9/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis

Compare the median error in predicted temperatures between winter and summer. Which of the following
statements is correct?

✓ Box plots show that the median error was further from 0 in the winter.
✗ Empirical CDFs show that the median errors were 0 both in the winter and in the summer.

✗ Histograms show smaller median error in the summer.

Correct!

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 10/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis

Inspect the distribution of error in predicted temperatures across the two seasons. With which of the following
statements do you agree the least?

✓ Box plot reveals that all errors are concentrated within fourfold of the interquartile range.
✗ Histogram shows that the errors are approximately centred around 0..

✗ KDE curve shows that the distribution is approximately bell-shaped.

Correct!

Two ends of the whiskers extend maximum fourfold of the IQR. The box plot of the errors reveals 4 points
outside the whiskers.

Inspect the relationship between the observed temperatures and errors in predicted temperatures. With which of
the following statements do you agree with the most?

✗ There is a strong positive correlation across the two seasons.

✗ The relationship is exponential in the summer while there is no discernable relationship in the winter.

✓ There is no discernable relationship across the two seasons.

Correct!

PREVIOUS TOPIC

2. The data set was prepared by taking random samples from the U.S. weather forecast data set available via Tidytuesday
(https://github.com/rfordatascience/tidytuesday/tree/master/data/2022/2022-12-20). Sai Shreyas Bhavanasi, Harrison
Lanier, Lauren Schmiedeler, and Clayton Strauch at the Saint Louis University Department of Mathematics and Statistics
(https://github.com/speegled/weather_forecasts) collected and processed the source data.

https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 11/11
25/01/2024, 13:05 Week 2: Statistical Modelling

Guess the distribution


Histograms and KDE plots can help us make an educated guess of an unknown distributions from which random samples
are generated. However, a poor choice of the bin width or the bandwidth may lead to less useful plots.

For example, consider sim_z which contains 10 random samples generated from the standard normal distribution,
 (0, 1) .

sim_z <- rnorm(100)

In each of the plots below, a histogram of sim_z is plotted with the probability density function of the standard normal
distribution in red using different bin widths. The bin widths are controlled using binwidth parameter within
geom_histogram() function.

ggplot() +
theme_minimal() +
geom_histogram(
aes(x = sim_z, y = after_stat(density)),
binwidth = # set the bin width here
)

The narrower the bins, more spikes and gaps the histogram contains. On the other hand, when the bin width is too wide,
the histogram loses much of the information contained in the data. Similarly, the bandwidth of a KDE can control how
smooth and flat the resulting estimate is. The plots below use the Gaussian kernel with differnt bandwidths.

ggplot() +
theme_minimal() +
geom_density(
aes(x = sim_z),
bw = # set the bandwidth here
)

In the example above, a bin width between 0.6 and 1 seems suitable and a bandwidth of 0.5 or 0.6 seems fitting for the
Gaussian KDE.

Exercise 1
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 1/10
25/01/2024, 13:05 Week 2: Statistical Modelling

The following plots display probability density and probability mass functions of different distributions. Use the plots
to answer the following questions.

Part 1

In the following code chunk, each of x1 , x2 , x3 , and x4 contains 500 samples from one of the distributions shown
in the plots above. Generate histograms and/or KDE plots to inspect the distributions of the random samples and
guess their distributions.

R CODE  START OVER  RUN CODE

29 ggplot() +
30 theme_minimal() +
31 geom_histogram(
32 aes(x = x2, y = after_stat(density)),
33 binwidth = 0.5# set the bin width here
34 ) +
35 ggtitle("Histogram of x2")
36
37 ggplot() +
38 theme_minimal() +
39 geom_density(
40 aes(x = x2),
41 bw = 0.5 # set the bandwidth here
42 ) +
43 ggtitle("KDE Plot of x2")
44
45
46 ggplot() +
47 theme_minimal() +
48 geom_histogram(
49 aes(x = x3, y = after_stat(density)),
50 binwidth = 0.5# set the bin width here
51 ) +
52 ggtitle("Histogram of x3")
53
54
55 ggplot() +
56 theme_minimal() +
57 geom_density(
58 aes(x = x3),
59 bw = 0.5 # set the bandwidth here
60 ) +
61 ggtitle("KDE Plot of x3")
62
63 ggplot() +
64 theme_minimal() +
65 geom_histogram(
66 aes(x = x4, y = after_stat(density)),
67 binwidth = 0.5 # set the bin width here
68 ) +
69 ggtitle("Histogram of x4")
70
71
72 ggplot() +
73 theme_minimal() +
74 geom_density(
75 aes(x = x4),
76 bw = 0.5 # set the bandwidth here
77 ) +
78 ggtitle("KDE Plot of x4")

https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 2/10
25/01/2024, 13:05 Week 2: Statistical Modelling

[1] -0.253826399 -0.726084906 0.411240004 2.537540273 1.330998216


[6] 1.142200694 3.793946214 -0.905728457 1.772262741 -2.637451345
[11] 2.631986782 -2.903834632 2.184177742 2.831276819 4.489878216
[16] 2.861337023 -0.290316089 -3.184120506 -2.834072982 -0.206657103
[21] 1.886607959 -1.100133656 5.891779625 0.380545531 3.163113792
[26] -3.435317787 1.967259606 4.538299930 1.017118100 -0.661207912
[31] 3.914885764 -1.546009305 -0.815840375 0.740961956 0.161321370
[36] -2.372426630 0.668490644 -0.886669007 -2.612710328 -3.130450209
[41] 2.569361736 -0.776716982 1.327053225 -0.600316510 -2.765235860
[46] 1.620141361 2.631264066 2.045405395 -0.838863597 -1.766565879
[51] -1.051192292 -3.599091846 2.367509968 0.626558061 1.148750935
[56] 3.012146419 0.005113479 0.973172408 -0.585106974 0.680806105
[61] -2.396138313 -4.981535811 -0.091081505 -4.918216566 -0.855712112
[66] -1.343815961 0.053431836 3.807062696 2.177603972 -3.802978394
[71] 5.955978062 1.390409735 -1.070469457 -1.056909880 -1.043034364
[76] -1.641970861 0.684435457 1.782316732 -1.476871681 -1.527400338
[81] -2.016520377 -1.639677022 -3.073658905 -0.036668392 0.230868747
[86] 1.516085691 -1.124456218 1.783163000 0.472946307 -0.134906311
[91] -1.720778345 -1.550332821 2.500836266 3.062078294 0.966760191
[96] 1.074233935 0.087474983 1.714348657 0.399112396 -0.098564674
[101] 2.210895032 0.476472491 1.859579718 -0.150513320 1.074203786
[106] 0.846336825 0.165483682 -0.406688940 3.144760301 -2.435009805
[111] 3.907697129 -3.578414831 -1.057640470 -1.175750239 -0.015476664
[116] -0.511323852 -1.037301791 1.092868680 2.299641884 -0.155724933
[121] 0.576212021 -2.216732972 -1.077733866 0.063157681 -0.842201309
[126] 4.317112642 -0.234558197 1.018220392 2.669359849 -1.467126645
[131] 0 879871790 2 931019307 2 431972304 1 502592643 1 270754245

[1] 0.120109201 0.190126955 0.293933909 0.363482027 0.090466836 0.296588265


[7] 0.058407098 0.185875307 0.391141386 2.148141066 0.827850734 0.346916308
[13] 1.033285828 1.130178256 0.476082729 0.184228322 0.357335548 1.233671473
[19] 0.382858512 1.016434034 0.001362972 1.122398200 4.234327247 0.632777472
[25] 0.055722430 1.696620969 0.849052450 2.161908550 0.783275843 1.670592895
[31] 0.078752785 0.160821476 1.318977978 0.062219732 3.321743496 0.498752391
[37] 0.107844381 0.583286479 0.247579039 0.973231916 0.034413177 0.881176611
[43] 0.260837452 1.034902708 1.034924453 3.632808888 0.047987824 1.553801845
[49] 1.034481351 0.859430273 0.411054747 0.536993460 0.225545691 1.797518696
[55] 0.102056195 0.032990923 0.717265627 1.117654562 0.703454706 0.001624452
[61] 0.359201495 2.469634367 0.240950028 0.586855276 0.366891942 3.866790080
[67] 1.223806428 0.510881430 0.096817419 0.644136401 0.076080724 0.181261058
[73] 0.118678604 1.302287354 0.696158662 0.064064357 1.117412560 0.759861630
[79] 0.438755328 0.243849142 0.707906176 1.702758924 0.042237292 0.645833077
[85] 0.212343876 0.160058309 0.507520449 0.379347124 1.199586330 1.331908952
[91] 0.811402117 0.784880675 0.273282493 1.320228668 0.187195079 2.004189741
[97] 2.182438394 0.094711678 0.007380299 1.164384997 0.387223320 0.036119024
[103] 0.335616429 0.509455970 0.434562127 1.108826065 2.164033558 2.052650990
[109] 1.130641787 1.030434159 1.294379359 0.578148676 0.872559356 0.765659593
[115] 1.232810479 1.408666116 0.966141568 0.968207096 0.766667226 0.469280950
[121] 0.976664873 0.046412180 0.584889405 1.362555501 0.448545618 0.059468638
[127] 0.883400425 0.991672320 1.876409540 1.018232461 1.699083717 0.441653924
[133] 1.087339070 0.545149560 1.247396598 0.031003722 1.475887165 0.800100333
[139] 0.114147603 1.840726172 0.207928381 0.533712496 0.781643074 0.878733709
[145] 0.090831167 0.169720036 2.279728567 4.161193955 0.109782203 1.065969690
[151] 1.366773996 0.587732674 0.632117509 0.471435210 0.974245339 0.682543578
[157] 0 465683515 2 169274000 0 344842688 0 105956589 0 065379409 0 239884931

[1] 3 2 4 2 2 3 4 2 4 2 3 3 2 0 2 4 2 0 4 6 4 1 3 3 2 3 2 3 3 1 5 2 1 2 2 3 3
[38] 3 6 2 0 0 3 5 1 4 3 4 3 2 3 3 3 6 6 2 6 3 2 3 1 2 4 3 5 4 0 4 3 2 4 4 2 3
[75] 5 5 3 3 4 3 4 4 3 3 3 5 3 3 3 2 2 5 6 6 2 4 5 3 5 2 4 1 3 0 3 2 2 1 3 4 0
[112] 3 0 2 3 3 3 1 3 4 5 4 3 3 2 4 3 5 2 2 3 3 2 2 3 2 5 2 2 3 4 6 2 3 4 2 3 5
[149] 1 1 1 3 4 3 1 2 1 2 4 2 4 5 2 5 4 7 4 1 3 2 3 4 2 4 3 5 2 1 3 7 0 2 3 3 2
[186] 1 0 2 2 4 2 5 4 2 3 2 2 5 4 3 1 2 3 2 1 1 6 4 2 3 2 1 4 4 6 4 5 2 5 1 3 2
[223] 1 2 5 6 4 3 1 2 2 3 3 5 5 7 4 3 2 3 3 4 6 4 4 3 5 3 3 2 1 2 4 2 3 1 5 3 4
[260] 4 1 3 4 3 3 3 5 2 4 2 3 2 2 3 4 2 6 2 0 1 2 3 3 1 0 4 2 2 0 3 2 3 2 1 1 4
[297] 1 5 1 1 2 7 2 1 3 3 3 3 4 4 4 2 4 0 2 2 3 2 4 3 5 3 3 4 6 4 1 3 3 3 3 6 4
[334] 3 3 7 4 4 2 7 3 2 5 4 2 4 4 3 2 4 2 3 2 2 4 1 3 2 3 5 5 0 5 1 5 4 1 2 3 3
[371] 4 1 4 3 2 4 6 3 1 2 3 0 5 4 1 6 4 2 3 5 4 4 4 2 4 2 3 4 3 6 4 3 5 4 4 3 4
[408] 2 5 4 3 1 3 1 4 2 3 4 3 3 2 4 1 1 3 4 5 2 1 1 4 3 2 3 2 5 5 2 3 4 3 4 3 5
[445] 4 4 4 4 5 4 5 1 2 5 3 3 3 3 2 2 3 5 3 4 3 3 6 3 4 1 2 1 2 4 4 2 1 2 3 1 4
[482] 2 2 2 4 3 2 1 4 4 2 1 4 3 2 1 0 3 1 7

https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 3/10
25/01/2024, 13:05 Week 2: Statistical Modelling

[1] -1.5008264895 2.1782502662 2.3086793549 -0.9670355548 2.0528265006


[6] 0.5199165164 2.0737496583 -1.1129389954 0.4223767966 0.6478408445
[11] 0.0898537762 0.9450987654 -0.4572700607 1.0779663438 -1.5747121936
[16] -0.7410356374 -0.1889255185 -1.0188767235 -0.7288504534 -0.2074039145
[21] -0.3953447920 0.1137243406 -1.6869983154 0.2304183279 2.6471989590
[26] -0.8896148712 0.2246090578 -2.1269990853 -0.8797096555 -1.2605101735
[31] 0.0034474024 -0.0072848943 -1.2254471873 -0.5237227811 0.6066497109
[36] 0.9963269207 -1.8630911887 0.6295051091 -1.3212052636 1.1722973982
[41] 1.3631041887 0.6373804026 -0.2019478060 0.2685395354 0.5067579882
[46] -0.7377592595 -0.0892800120 -0.7956451365 1.1829834113 -0.4894782925
[51] -0.9299456821 0.5402157273 -1.1155273048 -0.9061984663 0.8519471901
[56] -0.2537659196 -0.2106080368 -1.1950737537 -1.2211404123 -0.5617024450
[61] 0.4796831936 1.7292042407 -0.0625711548 -0.3940357065 -0.3272479144
[66] 0.3579554559 -0.4793694575 -1.0290956760 -0.3802485215 1.2519996244
[71] 0.2268570930 0.5026470714 0.0067753263 -0.2911048762 0.6578359290
[76] -1.6229804051 -0.3302091026 1.2422905754 0.0784580272 0.3868311847
[81] -2.5160553032 1.5375799418 -0.5305525914 0.4971756312 -0.1164845937
[86] 0.2740980129 0.1663964190 1.4544294696 -0.2178743851 0.6623139041
[91] -0.9390209231 -0.5105812707 0.9472190027 -0.3298725367 1.3341004529
[96] -0.0017077394 -1.0163142717 1.3119347165 -0.6423145069 0.1954630916
[101] 0.1441143916 -0.9083025323 1.6961945290 -1.0314807316 -0.1599494009
[106] 1.6603331132 0.7756976051 0.5382664755 0.6098203619 -0.5818623459
[111] -0.5799406119 0.8992968336 -0.1243104486 1.1496852200 -0.0812553399
[116] 0.9212610611 0.3649685033 -0.4245203296 2.1987282982 0.6257581377
[121] -0.1728820695 -0.7093209436 1.1882665463 -0.6556038172 -1.3237779156
[126] 0.1341695461 0.2076293903 0.1325556357 0.0114305381 0.9159579372
[131] 1 6640280725 0 2687544698 1 0207917688 0 3229644189 1 0128411247

https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 4/10
25/01/2024, 13:05 Week 2: Statistical Modelling

https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 5/10
25/01/2024, 13:05 Week 2: Statistical Modelling

https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 6/10
25/01/2024, 13:05 Week 2: Statistical Modelling

Guess the distributions

What is the underlying distribution that generated x1?

✗ N(0, 1)

✓ N(0, 4)
✗ Exp(1)

✗ Binom(10, 0.5)

Correct!

What is the underlying distribution that generated x2?

✗ N(0, 1)

✗ N(0, 4)

✓ Exp(1)
✗ Binom(10, 0.5)

Correct!

What is the underlying distribution that generated x3?

✗ N(0, 1)

✗ N(0, 4)

✗ Exp(1)

✓ Binom(10, 0.5)

Correct!

What is the underlying distribution that generated x4?

✓ N(0, 1)
✗ N(0, 4)

✗ Exp(1)

✗ Binom(10, 0.5)

Correct!

Part 2

x5 contains 500 random samples from a normal distribution with some parameters μ and σ 2 . Generate histograms
and/or KDE plots to inspect the distributions of the random sample. Can you guess the approximate values of the
parameters?
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 7/10
25/01/2024, 13:05 Week 2: Statistical Modelling

R CODE  START OVER  RUN CODE

3 ggplot() +
4 theme_minimal() +
5 geom_histogram(
6 aes(x = x5, y = after_stat(density)),
7 binwidth = 0.5 # set the bin width here
8 ) +
9 ggtitle("Histogram of x5")
10
11 # Generate KDE plot
12 ggplot() +
13 theme_minimal() +
14 geom_density(
15 aes(x = x5),
16 bw = 0.5 # set the bandwidth here
17 ) +
18 ggtitle("KDE Plot of x5")
19
20 # Make a guess about the parameters
21 mean_guess <- mean(x5)
22 sd_guess <- sd(x5)
23
24 cat("Approximate guess for mean (μ):", mean_guess, "\n")
25 cat("Approximate guess for standard deviation (σ):", sd_guess, "\n")
26
27

[1] 1.64697425 3.03375511 0.52999730 0.97780989 2.41425142 1.56560547


[7] 0.65772003 1.76157957 0.17911674 2.68940232 3.65274052 2.67432592
[13] 1.63543298 1.47186976 1.14157839 2.59923870 3.23875130 4.19853726
[19] 2.33121584 4.19226089 1.98594525 0.55541270 2.69475591 3.65076588
[25] 1.42216953 3.12595884 2.58865505 1.88863191 0.90691606 2.62801048
[31] 1.76535824 1.74487154 1.23603452 1.62375276 1.53118599 1.83717582
[37] 0.38330838 0.94183833 2.07647933 1.61533156 -0.04784858 -0.27741444
[43] 3.29113139 2.46572866 1.70229944 4.31773039 1.33128587 2.68140402
[49] 0.25117418 0.61752148 1.41019311 3.52091850 1.89849148 2.42085444
[55] 2.70903154 2.45715515 2.60727545 1.69878119 1.41621761 1.15760124
[61] 2.11720630 0.93961547 1.58549303 3.56446237 3.95465584 0.30938047
[67] 3.36774063 3.16842730 1.23463954 1.03129717 1.60583396 2.96621013
[73] 0.18976238 1.83142458 3.19827505 0.75595042 2.34081769 3.25817346
[79] 2.13463394 2.47776901 1.36428068 2.34506809 3.92869561 3.93286358
[85] 2.55468498 1.62055031 -0.16968090 2.31047451 3.90219646 2.30605263
[91] 2.73814040 1.77960840 2.39644534 -0.50721787 1.69303954 1.68671566
[97] 3.08788590 2.52884695 3.69085632 2.45713845 1.89144683 1.17846282
[103] 1.33529605 1.87623773 2.36398247 1.74541302 0.77938334 2.45796912
[109] 2.24764472 0.56338788 2.12935385 1.41292771 1.98488138 1.87312222
[115] 1.67000899 3.86811322 3.55545262 1.48300894 1.18121600 2.03265892
[121] 2.65633076 3.54247684 3.11036832 2.01121330 1.42015837 2.08233484
[127] 2.87797873 2.12506256 -0.21556463 0.73070685 3.35299449 1.14760771
[133] 2.89325129 0.44519024 2.54063163 1.47404183 1.42069435 3.04392592
[139] 0.90468627 0.81930377 1.54150125 2.33142842 3.18376291 0.29459059
[145] 0.64855392 2.40431568 2.29707202 0.80842806 2.19146153 2.87930559
[151] 2.86650519 2.96108262 1.15620104 1.89906993 -0.29464652 2.58848776
[157] 3 13980728 2 72628922 2 70752175 2 01958399 1 54594654 0 74082936

https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 8/10
25/01/2024, 13:05 Week 2: Statistical Modelling

Approximate guess for mean (μ): 2.014973

Approximate guess for standard deviation (σ): 0.991421

NEXT TOPIC

https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 9/10
25/01/2024, 13:05 Week 2: Statistical Modelling

https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 10/10
25/01/2024, 13:07 Week 2: Statistical Modelling

Concentration around the mean


Chebyshev’s inequality
Recall that Chebyshev’s inequality states that

1
P(|Y − 𝔼(Y )| ≥ a) ≤ Var(Y )
a2
where Y is a random variable with 𝔼(Y ) < ∞ and Var(Y ) < ∞, and a is a positive constant. We can visualize the
theorem in action using simulation.

Consider Z ∼  (0, 1) and a positive constant a. Since 𝔼(Z) = 0 and Var(Z) = 1, we can compute the Chebyshev’s
inequality bounds for P(|Z | ≥ a) ≤ 1/a 2 . The following code computes the bounds for a sequence of constants as — 1,
1.1, 1.2, 0.3, … 10.

as <- seq(from = 1, to = 10, by = .1) # 1 to 10 increasing by 0.1


chebyshev_bounds <- 1 / as^2

We can plot the bounds against the constant a using ggplot() as below.

library(ggplot2)
ggplot() +
theme_minimal() +
# map the as constants to the x-axis
# and the bounds to the y-axis
geom_line(aes(x = as, y = chebyshev_bounds)) +
labs(x = "a", y = "P(|X| >= a)") +
ylim(c(0, 1)) # set y-axis limits

Let’s now estimate the probabilities P(|Z | ≥ a) for a = 1, 2, 3, . . . , 10 using 50 simulated random samples of Z . For
example, you can estimate P(|Z | ≥ 1) with the following code.

Z <- rnorm(n = 50, mean = 0, sd = 1) # simulate 50 samples from N(0, 1)


mean(abs(Z) >= 1)

## [1] 0.34

We can repeat the procedure for a = 1, 2, 3, . . . , 10 using a for-loop. We will store the simulated probabilities in the
vector sim_probabilities .

sim_probabilities <- numeric(10) # create an empty vector of 10 numbers


for (i in 1:10) {
Z <- rnorm(n = 50, mean = 0, sd = 1)
sim_probabilities[i] <- mean(abs(Z) >= i)
}

Let’s now plot the simulated probabilities along with the upper bounds based on the Chebyshev’s inequality. We can simply
add appropriate layers to the ggplot from above.

https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-concentration-around-the-mean 1/5
25/01/2024, 13:07 Week 2: Statistical Modelling

ggplot() +
theme_minimal() +
geom_line(aes(x = as, y = chebyshev_bounds)) +
labs(x = "a", y = "P(|X| >= a)") +
ylim(c(0, 1)) +
# add a layer of dotted lines connecting the simulated probabilities
geom_line(aes(x = 1:10, y = sim_probabilities),
linetype = "dotted", colour = "maroon") +
# add a layer of points for the simulated probabilities
geom_point(aes(x = 1:10, y = sim_probabilities), colour = "maroon")

The plot shows that all simulated probabilities fall below the upper bound curve based on the Chebyshevy’s inequality.

Exercise 2
Suppose W ∼ Exp(1/5) . Create a similar plot as above demonstrating the Chebyshev’s inequality for
P(|W − 𝔼(W)| < a) following the steps below: 1. Compute the Chebyshev’s bounds of the probability for a between
5 and 50 in increments of 0.5; store both the a values and the lower bounds. 2. Simulate 50 random samples of W to
estimate the probability for each of a = 5, 10, 15, 20..., 50 ; store both the a values and the simulated probabilities. 3.
Use geom_line() to plot the Chebyshev’s bounds as a curve and geom_point() + geom_line() to plot the
simulated probabilities as connected dots on the y-axis against the a values in a single plot; label the axes
appropriately.

R CODE  START OVER  HINTS  RUN CODE

1 # step 1
2 as_chevyshev <- seq(from = 5, to = 50, by = .5)
3 # recall Var(W) = 1/lambda^2 for W ~ Exp(lambda)
4 chevyshev_bounds <- 1 - 25 / as_chevyshev^2
5 # step 2
6 n <- 50
7 as_simulation <- seq(from = 5, to = 50, by = 5)
8 sim_probabilities <- numeric(length(as_simulation))
9 for (i in seq(length(as_simulation))) {
10 W <- rexp(n = n, rate = 1/5)
11 # recall E(W) = 1/lambda for W ~ Exp(lambda)
12 sim_probabilities[i] <- mean(abs(W - 5)< as_simulation[i])
13 }
14 # step 3
15 ggplot() +
16 theme_minimal() + # optional but strongly recommended
17 geom_line(aes(x = as_chevyshev, y = chevyshev_bounds)) +
18 geom_line(aes(x = as_simulation, y = sim_probabilities),
19 linetype = "dotted", colour = "red") +
20 geom_point(aes(x = as_simulation, y = sim_probabilities),
21 colour = "red") +
22 labs(x = "a", y = "P(|W- E[W]| < a)") +
23 ylim(c(0, 1))

https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-concentration-around-the-mean 2/5
25/01/2024, 13:07 Week 2: Statistical Modelling

The (Weak) Law of Large Numbers


Recall that the (weak) law of large numbers states that

p
X̄n → μ as n→∞
or

lim P (∣∣X̄n − μ∣∣ > ε) = 0


n→∞

where X1 , X2 , … are independent and identically distributed random variables with expectation μ and variance σ 2 ,
n
X̄n = ∑ i=1 Xi / n, and ε > 0 . We will visualize the law in action using simulation.
Consider W ∼ Exp(1/5) . Then E(W) = 5 since E(W) = 1/λ . The code below simulates 5 independent copies of W and
computes their arithmetic mean, w̄n , which is a realization of W̄5 .

sim_W <- rexp(n = 5, rate = 1/5)


mean(sim_W)

## [1] 4.396142

As you increase the number of copies, n , the sample mean, W̄n , should converge to the mean of W based on the law of
large numbers. The code below visualizes the law in action by simulating 10 000 samples of W, (w1 , w2 , … , w10 000 ),
and taking the mean of the first n values for n = 1, 2, … 10 000 .

m <- 10000 # the number of samples to simulate


# simulate 1 000 copies of W
sim_W <- rpois(n = m, lambda = 5)
# create an empty vector of size m
sim_expectations <- numeric(m)
# compute the sample mean for the first n value, n=1,...,1000
for (n in 1:m) {
sim_expectations[n] <- mean(sim_W[1:n])
}
# plot the sample means
ggplot() +
theme_minimal() +
# map the as constants to the x-axis
# and the bounds to the y-axis
geom_point(aes(x = 1:m, y = sim_expectations)) +
labs(x = "n", y = "w bar")

https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-concentration-around-the-mean 3/5
25/01/2024, 13:07 Week 2: Statistical Modelling

x[1:n] returns elements at indices from 1 to n in vector x in R.

Exercise 3
Consider two independent random variables U and V with unknown cumulative distribution functions, and let
Y = U ⋅ V . In the code chunk below, simulate_U(m) returns m simulated values of U and simulate_V(m)
returns m simulated values of V .

Construct a plot showing the convergence of sample means Ū n and V̄n following the steps below.

1. Simulate 10 000 samples of U and V using simulate_U() and simulate_V() .


2. Compute sample means, ūn , v̄n , and Ȳn for n = 1, 2, . . . , 10 000 .
3. Plot the computed sample means.
a. Plot the sample means for each random variable as a line with the values mapped to the y-axis and
corresponding n mapped to the x-axis using geom_line() .
b. Plot the three random variables on a single plot by adding the geom_line() layers together.
c. Use different colours for the three random variables.
d. Label the final sample means — ū10 000 , v̄10 000 , and ȳ10 000 on the y-axis by adding
scale_y_continuous(breaks = c(U_mean[m], V_mean[m], Y_mean[m])) to the ggplot.

R CODE  START OVER  SOLUTION  RUN CODE

1 # Simulate 10 000 samples.


2 m <- 10000
3 U <- simulate_U(m)
4 V <- simulate_V(m)
5 # Compute sample means n=1,2,...,10 000.
6 U_mean <- numeric(m)
7 V_mean <- numeric(m)
8 Y_mean <- numeric(m)
9 for (n in 1:m) {
10 U_mean[n] <- mean(U[1:n])
11 V_mean[n] <- mean(V[1:n])
12 Y_mean[n] <- mean(U[1:n] * V[1:n])
13 }
14 # Plot the computed sample means.
15 ggplot() +
16 theme_minimal() +
17 geom_line(aes(x = 1:m, y = U_mean), color = "darkgreen") +
18 geom_line(aes(x = 1:m, y = V_mean), color = "maroon") +
19 geom_line(aes(x = 1:m, y = Y_mean), color = "darkgrey") +
20 scale_y_continuous(breaks = c(U_mean[m], V_mean[m], Y_mean[m])) +
21 labs(x = "n", y = "Sample means")

https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-concentration-around-the-mean 4/5
25/01/2024, 13:07 Week 2: Statistical Modelling

Based on the plot, can use guess the means of the three random variables — 𝔼(U), 𝔼(V ), and 𝔼(Y )?
Note that the law of large numbers allows us to use simulations to estimate expectations

PREVIOUS TOPIC

https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-concentration-around-the-mean 5/5

You might also like