Sta238 Wks - Week1+2
Sta238 Wks - Week1+2
Review of basic R
In this section, we revisit basic concepts and usage of R. You may skip the section if you are comfortable
using R.
2 / 5 + (5 - 4) * 1
## [1] 1.4
## [1] TRUE
plot(rnorm(25))
my_number <- 5
my_number
## [1] 5
characters:
logicals:
## [1] TRUE
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-review-of-basic-r 1/4
25/01/2024, 12:30 Week 1: Review and Introduction to Data Analysis
## [1] 2 3 7
my_vector[1]
## [1] 2
my_vector[c(1, 3)]
## [1] 2 7
## [1] 7
my_vector[c(T, F, T)]
## [1] 2 7
n <- 5
numeric(n) # numbers
## [1] 0 0 0 0 0
character(n) # characters
logical(n) # logicals
Exercise 1
Create a vector that consists of day of your birthday and your first name. Print the vector and comment on its data
type.
Functions
In R, a function is in the following form:
You can store the result of a function with result <- function_x(...)
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-review-of-basic-r 2/4
25/01/2024, 12:30 Week 1: Review and Introduction to Data Analysis
[1] 12
You can look at the help page of each function with ?<function name> . Whenever you need help using a function,
try the help page first. A quick search on the Internet will often provide many examples as well.
1 ?sample
2
3
There are also a plenty of resources and examples online for most R functions.
[1] 10
Exercise 2
Create a function named diff_in_absolute_diff() that takes 4 numbers – say, w , x , y , and z — and computes
| w − x | − |y − z |
[1] 5
## [1] 15 30 45 60 75
Many R functions automatically broadcast the operation to each element resulting in a vector of same size.
x + y
## [1] 15 30 45 60 75
Exercise 3
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-review-of-basic-r 3/4
25/01/2024, 12:30 Week 1: Review and Introduction to Data Analysis
| w − x | − |y − z |
for each element of the following vectors.
w <- c(1, 2, 3, 4, 5)
x <- c(10, 20, 30, 40, 50)
y <- c(5, 10, 15, 20, 25)
z <- c(50, 40, 30, 20, 10)
The vectors w , x , y , z , and the function diff_in_absolute_diff() have been predefined for your use in the
following chunk. Use the for-loop to compute and save the resulting values to the vector u .
1 # use a loop
2 u <- numeric(5)
3 for (i in 1:5) {
4 u[i] <- diff_in_absolute_diff(w[i], x[i], y[i], z[i])
5 }
6 u
In the code chunk below, the vectors are directly used as inputs. Compare the results from the loop
version above.
1 # diff_in_absolute_diff() is a vectorized
2 u <- diff_in_absolute_diff(w, x, y, z)
3 u
NEXT TOPIC
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-review-of-basic-r 4/4
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis
Data in R
Matrix
It is often useful to work in 2-dimensions when working with data. In R, a matrix is a 2-dimensional object consisting of
values of a single data type.
matrix(1:9, 3)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] "A" "C" "E" "G" "I" "K" "M" "O" "Q" "S" "U" "W" "Y"
## [2,] "B" "D" "F" "H" "J" "L" "N" "P" "R" "T" "V" "X" "Z"
You can access the individual elements using [ like vectors. If you use a single value or single vector, it treats the matrix
as a vector concatenating the columns in order.
## [1] 5
## [1] 2 11
More intuitively, you can use 2-dimensional coordinates to extract the elements separated by a comma. The first value
corresponds to the row and the second value to the column. The index starts at 1 like a single value index.
## [1] 8
## [1] 7 9
mat_ex[c(2, 3), c(1, 4)] # second and third row, first and fourth column
## [,1] [,2]
## [1,] 2 11
## [2,] 3 12
If you leave one dimension blank, you can extract across the whole rows or columns.
## [1] 1 4 7 10
## [1] 10 11 12
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 1/5
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis
There are a few convenient functions provided for working with multidimensional objects. For example, rowMeans()
computes mean values for each row. See documentation for details and other similar functions.
1 ?rowMeans
2
3
_D_e_s_c_r_i_p_t_i_o_n:
Form row and column sums and means for numeric arrays (or data
frames).
_U_s_a_g_e:
_A_r_g_u_m_e_n_t_s:
Exercise 4
mat is a 6 by 4 numeric matrix in the following code chunk. Use R code to
1. find the row numbers where the mean values of the row is smaller than 15
2. find the column numbers where the column sums are larger than 75
3. check whether all values in the second row are larger than values in the fourth row AND value 5 is in the last
column
1 mat
2
3 (1:6)[rowMeans(mat) < 15] # fill in the bracket with a logical statement
4 (1:4)[colSums(mat) > 75] # fill in the bracket with a logical statement
5
6 all(mat[2, ] > mat[4, ]) && (5 %in% mat[ ,4])
7
[1] 2 3 4 5 6
[1] 2 3
[1] FALSE
Data frames
Vectors and matrices can only save values of a single data type. We can create tables consisting of columns in different
data types via data.frame() in R.
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 2/5
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis
For example, the data frame below has a column of numbers and a column of characters.
set.seed(238)
numbers alphabets l y w
<int> <chr> <lgl> <int> <int>
1 a TRUE 106 14
2 b TRUE 100 98
3 c FALSE 89 1
5 e TRUE 92 282
5 rows
Many functions in R are designed to work with data frames. For example, ggplot() 1 expects a data frame as the first
input argument. You can then define the mappings between columns and aesthetics (e.g., x axis).
library(ggplot2)
ggplot(some_table, # provide the table
aes(x = numbers)) + # map the column `numbers` to x-axis
theme_minimal() + # use minimal theme
geom_point(aes(y = ifelse(l, y, w), # draw points using `y` or `w` based on the value of `l`
color = l)) # change color based on the value of `l`
Specifying aesthetics such as color , size , linetype , shape , etc. inside aes() changes the specified aesthetics
based on the mapped values.
Most multivariate data come in a table format and R makes it easy to work with them.
You can access elements of a data frame using 2D indexes in the same way you use 2D indexes with matrices.
numbers alphabets l y w
<int> <chr> <lgl> <int> <int>
1 1 a TRUE 106 14
1 row
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 3/5
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis
numbers alphabets l y w
<int> <chr> <lgl> <int> <int>
1 1 a TRUE 106 14
2 2 b TRUE 100 98
3 3 c FALSE 89 1
3 rows
some_table[1, 3:4] # extracting the element in the first row, third and fourth columns
l y
<lgl> <int>
1 TRUE 106
1 row
When you extract multiple columns, the result is still a data frame.
On the other hand, using a single positional index behaves differently from a matrix and returns a column.
some_table[3]
l
<lgl>
TRUE
TRUE
FALSE
FALSE
TRUE
5 rows
You can also extract a column using $ followed by the column name.
some_table$l
Using 2D index to extract a column (e.g., some_table[ , 3] ) or $ extractor returns a vector whereas using 1D index
(e.g., some_table[3] ) returns a single column data frame. It doesn’t affect the codes in most cases but can be a good
place to check if you face an unexpected error.
Exercise 5
In the code chunk below, some_table is available for you. i. Extract values of y from rows where l is TRUE . ii.
Extract values of w from rows where l is FALSE . iii. Add the five pairs of extracted values.
1 some_table
2 sum(some_table$y[some_table$l]) + sum(some_table$w[!some_table$l])
3
numbers alphabets l y w
<int> <chr> <lgl> <int> <int>
1 a TRUE 106 14
2 b TRUE 100 98
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 4/5
25/01/2024, 12:53 Week 1: Review and Introduction to Data Analysis
numbers alphabets l y w
<int> <chr> <lgl> <int> <int>
3 c FALSE 89 1
5 e TRUE 92 282
5 rows
[1] 440
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-data-in-r 5/5
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis
dplyr package
dplyr (https://dplyr.tidyverse.org) is a package that provides a set of functions for manipulating data. They are useful
when working with tabular data.
library(dplyr)
For example, mutate() allows you to create one or more new columns as function(s) of existing columns. The code
below computes the difference forecast_temp - observed_temp for each row and saves them to a new column named
forecast_err . We can save the resulting data frame to weather_forecasts again for future use.
|> is called the forward pipe operator. It feeds the object on the left hand side as the first input argument to the function
on the right hand side. Here, weather_forecast is passed to the function mutate() as the first argument.
Another useful function is filter() . As the name suggests, the function filters the given data set based on one or more
logical statements. For example, the statement outputs a subset of the data set where the season is summer and the
absolute ( abs() in R) discrepancy between the forecast temperature and the observed temperature is greater than 3°C.
weather_forecasts |>
filter(season == "Summer", abs(forecast_err) > 3)
4 rows
You can submit multiple logical statements separated by , to request for records that satisfied all of the conditions. If
you want to request for those that satisfy at least one of multiple conditions, you can use the “OR” operator, | .
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 1/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis
You can find more about the package on this cheat sheet (https://rstudio.github.io/cheatsheets/html/data-
transformation.html).
Numerical summaries
You can compute individual numerical summaries using built-in R functions that takes numeric vectors as inputs. These
include:
For example, you can compute the 0.3th and 0.7th quantiles of the observed temperatures as below.
## 30% 70%
## -1.1111111 0.5555556
Alternatively, you can use summarise() from dplyr to compute multiple numerical summaries at once. Similar to
mutate() , summarise() takes a data frame as the first input followed by one or more summarising functions of existing
columns. For example, the code below computes the five-number summary and the sample mean of the observed
temperatures.
weather_forecasts |>
summarise(
min = min(observed_temp),
lower_quartile = quantile(observed_temp, .25),
mean = mean(observed_temp),
median = median(observed_temp),
upper_quartile = quantile(observed_temp, .75),
max = max(observed_temp)
)
1 row
Exercise 6
Compute the mean, the median, the interquartile range, and the sample variance of the following quantities in
weather_forecasts for summer and winter separately:
i. predicted temperatures
ii. observed temperatures
iii. difference between the predicted and the observed temperatures
The data frame weather_forecasts with the computed column forecast_err is available in the code chunk
below.
1 weather_forecasts
2
3 # using dplyr --- advanced using `.by` argument in `summarise()`
4 weather_forecasts |>
5 summarise(
6 .by = season,
7 predicted_mean = mean(forecast_temp),
8 predicted_median = median(forecast_temp),
9 predicted_iqr = quantile(forecast_temp, .75) - quantile(forecast_temp, .25),
10 predicted_var = var(forecast_temp),
11 observed_mean = mean(observed_temp),
12 observed_median = median(observed_temp),
13 observed_iqr = quantile(observed_temp, .75) - quantile(observed_temp, .25),
14 observed_var = var(observed_temp),
15 error_mean = mean(forecast_err),
16 error_median = median(forecast_err),
17 error_iqr = quantile(forecast_err, .75) - quantile(forecast_err, .25),
18 error_var = var(forecast_err)
19 )
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 2/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis
Graphical summaries
We will use ggplot2 (https://ggplot2.tidyverse.org) package to create plots in R. You can find the cheat sheet for the
package here (https://rstudio.github.io/cheatsheets/html/data-visualization.html).
You can create a canvas by passing a data frame to ggplot() function then add graphical objects to create plots using
the column names. You can further customize the plot by adding additional layers. For example, you can create a scatter
plot of the observed temperatures vs. predicted temperatures coloured by season in the following steps:
ggplot(weather_forecasts)
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 3/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis
2. add points with observed_temp mapped to x axis, forecast_temp to y axis, and season to colour
ggplot(weather_forecasts) +
geom_point(aes(x = observed_temp, y = forecast_temp, colour = season))
ggplot(weather_forecasts) +
geom_point(aes(x = observed_temp, y = forecast_temp, colour = season)) +
labs(title = "Weather forecast isn't perfect.",
x = "Observed temperature",
y = "Predicted temperature")
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 4/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis
ggplot(weather_forecasts) +
geom_point(aes(x = observed_temp, y = forecast_temp, colour = season)) +
labs(title = "Weather forecast isn't perfect.",
x = "Observed temperature",
y = "Predicted temperature") +
scale_colour_discrete(name = NULL) + # remove the legend title
theme_minimal() + # apply the minimal theme
theme(legend.position = "top") # place the legend at the top
ggplot(weather_forecasts) +
geom_point(aes(x = observed_temp, y = forecast_temp, colour = season)) +
labs(title = "Weather forecast isn't perfect.",
x = "Observed temperature",
y = "Predicted temperature") +
scale_colour_discrete(name = NULL) +
theme_minimal() +
theme(legend.position = "none") + # remove the legend
facet_wrap(vars(season), nrow = 2) # separate by season in 2 rows
ggplot(weather_forecasts) +
geom_boxplot(aes(x = observed_temp, y = season)) +
labs(title = "Temperatures varied more in the winter across the U.S.",
x = NULL, y = NULL) + # remove axis titles
theme_minimal()
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 5/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis
By mapping the categorical column season to the y-axis, you can easily compare the distributions between the two
seasons using box plots.
Mapping the quantity of interest to the x-axis results in horizontal box plots.
# using normal reference method for bin width (MIPS Remark 15.1)
B <- (24 * sqrt(pi))^(1 / 3) * sd(weather_forecasts$observed_temp) * nrow(weather_forecasts)^(-1 / 3)
ggplot(weather_forecasts) +
geom_histogram(aes(x = observed_temp), binwidth = B) +
labs(title = "Temperatures varied more in the winter across the U.S.",
x = "Observed temperature", y = "Frequency") +
theme_minimal() +
facet_wrap(vars(season), nrow = 2)
ggplot(weather_forecasts, aes(observed_temp)) +
geom_histogram(aes(y = after_stat(density)),
bins = 15, alpha = .5) +
geom_density(colour = "forestgreen") +
labs(title = "Temperatures varied more in the winter across the U.S.",
x = "Observed temperature", y = "Frequency") +
theme_minimal() +
facet_wrap(vars(season), nrow = 2)
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 6/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis
Adding multiple geom_*() to ggplot() , results in multiple layers of plot objects in a single plot.
You can define the aes() mappin inside ggplot() instead of geom_*() functions. These mapping apply to all
geom ’s created.
after_stat(density) inside geom_histogram() transforms the heights to relative frequencies.
bins argument controls the number of bins.
alpha controls the transparency of the plotting objects.
Use colour argument outside aes() to directly change the colour of the plotting objects rather than mapping to
data.
weather_forecasts |>
ggplot(aes(x = observed_temp, colour = season)) +
geom_step(stat = "ecdf") +
labs(title = "Temperatures varied more in the winter across the U.S.",
x = "Observed temperature", y = "Empirical CDF") +
theme_minimal() +
scale_colour_discrete(name = NULL) +
theme(legend.position = "top")
stat = "ecdf" computes the empirical CDF of the mapped column and geom_step() plots them as a step
function.
To plot an empirical CDF curve without the vertical lines with ggplot() , you can use geom_segment() but you
need to compute the empirical CDF separately. Below is a custom function that achieves this.
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 7/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis
ggplot() +
geom_segment(aes(x = x1, y = y, xend = x2, yend = y)) +
geom_point(aes(x = x2[-length(x2)], y = y[-length(y)]),
shape = 21, fill = "white", size = point_size) +
geom_point(aes(x = xstart[-1], y = y[-1]), shape = 19, size = point_size)
}
Exercise 7
Use the below code chunk to create appropriate plots to answer the following questions. The data frame
weather_forecasts with the computed column forecast_err is available in the code chunk below.
1 weather_forecasts
2
3 # box plot of errors across the two seasons
4 weather_forecasts |>
5 ggplot(aes(x = forecast_err)) +
6 theme_minimal() +
7 labs(x = "Error in predicted temperature",
8 y = NULL) +
9 geom_boxplot()
10 # box plot of errors for each season
11 weather_forecasts |>
12 ggplot(aes(x = forecast_err, y = season)) +
13 theme_minimal() +
14 labs(x = "Error in predicted temperature",
15 y = NULL) +
16 geom_boxplot()
17 # histograms of errors for each season
18 weather_forecasts |>
19 ggplot(aes(x = forecast_err)) +
20 theme_minimal() +
21 labs(x = "Error in predicted temperature",
22 y = "Frequency") +
23 geom_histogram(bins = 18) +
24 facet_wrap(vars(season), nrow = 2)
25 # histogram and KDE of errors across the two seasons
26 weather_forecasts |>
27 ggplot() +
28 theme_minimal() +
29 labs(x = "Error in predicted temperature",
30 y = "Density") +
31 geom_histogram(aes(x = forecast_err, y = after_stat(density)),
32 alpha = .5, bins = 25) +
33 geom_density(aes(x = forecast_err), colour = "forestgreen")
34 # empirical CDF of the erorrs for each season
35 weather_forecasts |>
36 ggplot(aes(x = forecast_err, colour = season)) +
37 geom_step(stat = "ecdf") +
38 labs(x = "Error in predicted temperature", y = "Empirical CDF") +
39 theme_minimal() +
40 scale_colour_discrete(name = NULL) +
41 theme(legend.position = "top")
42 # scatter plot between observed temperatures and errors
43 weather_forecasts |>
44 ggplot(aes(x = observed_temp, y = forecast_err, colour = season)) +
45 theme_minimal() +
46 geom_point() +
47 labs(x = "Observed temperature", y = "Error in predicted temperature") +
48 scale_colour_discrete(name = NULL) +
49 theme(legend.position = "top")
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 8/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 9/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis
Compare the median error in predicted temperatures between winter and summer. Which of the following
statements is correct?
✓ Box plots show that the median error was further from 0 in the winter.
✗ Empirical CDFs show that the median errors were 0 both in the winter and in the summer.
Correct!
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 10/11
25/01/2024, 12:58 Week 1: Review and Introduction to Data Analysis
Inspect the distribution of error in predicted temperatures across the two seasons. With which of the following
statements do you agree the least?
✓ Box plot reveals that all errors are concentrated within fourfold of the interquartile range.
✗ Histogram shows that the errors are approximately centred around 0..
Correct!
Two ends of the whiskers extend maximum fourfold of the IQR. The box plot of the errors reveals 4 points
outside the whiskers.
Inspect the relationship between the observed temperatures and errors in predicted temperatures. With which of
the following statements do you agree with the most?
✗ The relationship is exponential in the summer while there is no discernable relationship in the winter.
Correct!
PREVIOUS TOPIC
2. The data set was prepared by taking random samples from the U.S. weather forecast data set available via Tidytuesday
(https://github.com/rfordatascience/tidytuesday/tree/master/data/2022/2022-12-20). Sai Shreyas Bhavanasi, Harrison
Lanier, Lauren Schmiedeler, and Clayton Strauch at the Saint Louis University Department of Mathematics and Statistics
(https://github.com/speegled/weather_forecasts) collected and processed the source data.
https://rconnect.utstat.utoronto.ca/content/a57e9d9b-336e-4f49-9347-afee2f53a192/#section-introduction-to-data-analysis-using-r 11/11
25/01/2024, 13:05 Week 2: Statistical Modelling
For example, consider sim_z which contains 10 random samples generated from the standard normal distribution,
(0, 1) .
In each of the plots below, a histogram of sim_z is plotted with the probability density function of the standard normal
distribution in red using different bin widths. The bin widths are controlled using binwidth parameter within
geom_histogram() function.
ggplot() +
theme_minimal() +
geom_histogram(
aes(x = sim_z, y = after_stat(density)),
binwidth = # set the bin width here
)
The narrower the bins, more spikes and gaps the histogram contains. On the other hand, when the bin width is too wide,
the histogram loses much of the information contained in the data. Similarly, the bandwidth of a KDE can control how
smooth and flat the resulting estimate is. The plots below use the Gaussian kernel with differnt bandwidths.
ggplot() +
theme_minimal() +
geom_density(
aes(x = sim_z),
bw = # set the bandwidth here
)
In the example above, a bin width between 0.6 and 1 seems suitable and a bandwidth of 0.5 or 0.6 seems fitting for the
Gaussian KDE.
Exercise 1
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 1/10
25/01/2024, 13:05 Week 2: Statistical Modelling
The following plots display probability density and probability mass functions of different distributions. Use the plots
to answer the following questions.
Part 1
In the following code chunk, each of x1 , x2 , x3 , and x4 contains 500 samples from one of the distributions shown
in the plots above. Generate histograms and/or KDE plots to inspect the distributions of the random samples and
guess their distributions.
29 ggplot() +
30 theme_minimal() +
31 geom_histogram(
32 aes(x = x2, y = after_stat(density)),
33 binwidth = 0.5# set the bin width here
34 ) +
35 ggtitle("Histogram of x2")
36
37 ggplot() +
38 theme_minimal() +
39 geom_density(
40 aes(x = x2),
41 bw = 0.5 # set the bandwidth here
42 ) +
43 ggtitle("KDE Plot of x2")
44
45
46 ggplot() +
47 theme_minimal() +
48 geom_histogram(
49 aes(x = x3, y = after_stat(density)),
50 binwidth = 0.5# set the bin width here
51 ) +
52 ggtitle("Histogram of x3")
53
54
55 ggplot() +
56 theme_minimal() +
57 geom_density(
58 aes(x = x3),
59 bw = 0.5 # set the bandwidth here
60 ) +
61 ggtitle("KDE Plot of x3")
62
63 ggplot() +
64 theme_minimal() +
65 geom_histogram(
66 aes(x = x4, y = after_stat(density)),
67 binwidth = 0.5 # set the bin width here
68 ) +
69 ggtitle("Histogram of x4")
70
71
72 ggplot() +
73 theme_minimal() +
74 geom_density(
75 aes(x = x4),
76 bw = 0.5 # set the bandwidth here
77 ) +
78 ggtitle("KDE Plot of x4")
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 2/10
25/01/2024, 13:05 Week 2: Statistical Modelling
[1] 3 2 4 2 2 3 4 2 4 2 3 3 2 0 2 4 2 0 4 6 4 1 3 3 2 3 2 3 3 1 5 2 1 2 2 3 3
[38] 3 6 2 0 0 3 5 1 4 3 4 3 2 3 3 3 6 6 2 6 3 2 3 1 2 4 3 5 4 0 4 3 2 4 4 2 3
[75] 5 5 3 3 4 3 4 4 3 3 3 5 3 3 3 2 2 5 6 6 2 4 5 3 5 2 4 1 3 0 3 2 2 1 3 4 0
[112] 3 0 2 3 3 3 1 3 4 5 4 3 3 2 4 3 5 2 2 3 3 2 2 3 2 5 2 2 3 4 6 2 3 4 2 3 5
[149] 1 1 1 3 4 3 1 2 1 2 4 2 4 5 2 5 4 7 4 1 3 2 3 4 2 4 3 5 2 1 3 7 0 2 3 3 2
[186] 1 0 2 2 4 2 5 4 2 3 2 2 5 4 3 1 2 3 2 1 1 6 4 2 3 2 1 4 4 6 4 5 2 5 1 3 2
[223] 1 2 5 6 4 3 1 2 2 3 3 5 5 7 4 3 2 3 3 4 6 4 4 3 5 3 3 2 1 2 4 2 3 1 5 3 4
[260] 4 1 3 4 3 3 3 5 2 4 2 3 2 2 3 4 2 6 2 0 1 2 3 3 1 0 4 2 2 0 3 2 3 2 1 1 4
[297] 1 5 1 1 2 7 2 1 3 3 3 3 4 4 4 2 4 0 2 2 3 2 4 3 5 3 3 4 6 4 1 3 3 3 3 6 4
[334] 3 3 7 4 4 2 7 3 2 5 4 2 4 4 3 2 4 2 3 2 2 4 1 3 2 3 5 5 0 5 1 5 4 1 2 3 3
[371] 4 1 4 3 2 4 6 3 1 2 3 0 5 4 1 6 4 2 3 5 4 4 4 2 4 2 3 4 3 6 4 3 5 4 4 3 4
[408] 2 5 4 3 1 3 1 4 2 3 4 3 3 2 4 1 1 3 4 5 2 1 1 4 3 2 3 2 5 5 2 3 4 3 4 3 5
[445] 4 4 4 4 5 4 5 1 2 5 3 3 3 3 2 2 3 5 3 4 3 3 6 3 4 1 2 1 2 4 4 2 1 2 3 1 4
[482] 2 2 2 4 3 2 1 4 4 2 1 4 3 2 1 0 3 1 7
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 3/10
25/01/2024, 13:05 Week 2: Statistical Modelling
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 4/10
25/01/2024, 13:05 Week 2: Statistical Modelling
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 5/10
25/01/2024, 13:05 Week 2: Statistical Modelling
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 6/10
25/01/2024, 13:05 Week 2: Statistical Modelling
✗ N(0, 1)
✓ N(0, 4)
✗ Exp(1)
✗ Binom(10, 0.5)
Correct!
✗ N(0, 1)
✗ N(0, 4)
✓ Exp(1)
✗ Binom(10, 0.5)
Correct!
✗ N(0, 1)
✗ N(0, 4)
✗ Exp(1)
✓ Binom(10, 0.5)
Correct!
✓ N(0, 1)
✗ N(0, 4)
✗ Exp(1)
✗ Binom(10, 0.5)
Correct!
Part 2
x5 contains 500 random samples from a normal distribution with some parameters μ and σ 2 . Generate histograms
and/or KDE plots to inspect the distributions of the random sample. Can you guess the approximate values of the
parameters?
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 7/10
25/01/2024, 13:05 Week 2: Statistical Modelling
3 ggplot() +
4 theme_minimal() +
5 geom_histogram(
6 aes(x = x5, y = after_stat(density)),
7 binwidth = 0.5 # set the bin width here
8 ) +
9 ggtitle("Histogram of x5")
10
11 # Generate KDE plot
12 ggplot() +
13 theme_minimal() +
14 geom_density(
15 aes(x = x5),
16 bw = 0.5 # set the bandwidth here
17 ) +
18 ggtitle("KDE Plot of x5")
19
20 # Make a guess about the parameters
21 mean_guess <- mean(x5)
22 sd_guess <- sd(x5)
23
24 cat("Approximate guess for mean (μ):", mean_guess, "\n")
25 cat("Approximate guess for standard deviation (σ):", sd_guess, "\n")
26
27
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 8/10
25/01/2024, 13:05 Week 2: Statistical Modelling
NEXT TOPIC
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 9/10
25/01/2024, 13:05 Week 2: Statistical Modelling
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-guess-the-distribution 10/10
25/01/2024, 13:07 Week 2: Statistical Modelling
1
P(|Y − 𝔼(Y )| ≥ a) ≤ Var(Y )
a2
where Y is a random variable with 𝔼(Y ) < ∞ and Var(Y ) < ∞, and a is a positive constant. We can visualize the
theorem in action using simulation.
Consider Z ∼ (0, 1) and a positive constant a. Since 𝔼(Z) = 0 and Var(Z) = 1, we can compute the Chebyshev’s
inequality bounds for P(|Z | ≥ a) ≤ 1/a 2 . The following code computes the bounds for a sequence of constants as — 1,
1.1, 1.2, 0.3, … 10.
We can plot the bounds against the constant a using ggplot() as below.
library(ggplot2)
ggplot() +
theme_minimal() +
# map the as constants to the x-axis
# and the bounds to the y-axis
geom_line(aes(x = as, y = chebyshev_bounds)) +
labs(x = "a", y = "P(|X| >= a)") +
ylim(c(0, 1)) # set y-axis limits
Let’s now estimate the probabilities P(|Z | ≥ a) for a = 1, 2, 3, . . . , 10 using 50 simulated random samples of Z . For
example, you can estimate P(|Z | ≥ 1) with the following code.
## [1] 0.34
We can repeat the procedure for a = 1, 2, 3, . . . , 10 using a for-loop. We will store the simulated probabilities in the
vector sim_probabilities .
Let’s now plot the simulated probabilities along with the upper bounds based on the Chebyshev’s inequality. We can simply
add appropriate layers to the ggplot from above.
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-concentration-around-the-mean 1/5
25/01/2024, 13:07 Week 2: Statistical Modelling
ggplot() +
theme_minimal() +
geom_line(aes(x = as, y = chebyshev_bounds)) +
labs(x = "a", y = "P(|X| >= a)") +
ylim(c(0, 1)) +
# add a layer of dotted lines connecting the simulated probabilities
geom_line(aes(x = 1:10, y = sim_probabilities),
linetype = "dotted", colour = "maroon") +
# add a layer of points for the simulated probabilities
geom_point(aes(x = 1:10, y = sim_probabilities), colour = "maroon")
The plot shows that all simulated probabilities fall below the upper bound curve based on the Chebyshevy’s inequality.
Exercise 2
Suppose W ∼ Exp(1/5) . Create a similar plot as above demonstrating the Chebyshev’s inequality for
P(|W − 𝔼(W)| < a) following the steps below: 1. Compute the Chebyshev’s bounds of the probability for a between
5 and 50 in increments of 0.5; store both the a values and the lower bounds. 2. Simulate 50 random samples of W to
estimate the probability for each of a = 5, 10, 15, 20..., 50 ; store both the a values and the simulated probabilities. 3.
Use geom_line() to plot the Chebyshev’s bounds as a curve and geom_point() + geom_line() to plot the
simulated probabilities as connected dots on the y-axis against the a values in a single plot; label the axes
appropriately.
1 # step 1
2 as_chevyshev <- seq(from = 5, to = 50, by = .5)
3 # recall Var(W) = 1/lambda^2 for W ~ Exp(lambda)
4 chevyshev_bounds <- 1 - 25 / as_chevyshev^2
5 # step 2
6 n <- 50
7 as_simulation <- seq(from = 5, to = 50, by = 5)
8 sim_probabilities <- numeric(length(as_simulation))
9 for (i in seq(length(as_simulation))) {
10 W <- rexp(n = n, rate = 1/5)
11 # recall E(W) = 1/lambda for W ~ Exp(lambda)
12 sim_probabilities[i] <- mean(abs(W - 5)< as_simulation[i])
13 }
14 # step 3
15 ggplot() +
16 theme_minimal() + # optional but strongly recommended
17 geom_line(aes(x = as_chevyshev, y = chevyshev_bounds)) +
18 geom_line(aes(x = as_simulation, y = sim_probabilities),
19 linetype = "dotted", colour = "red") +
20 geom_point(aes(x = as_simulation, y = sim_probabilities),
21 colour = "red") +
22 labs(x = "a", y = "P(|W- E[W]| < a)") +
23 ylim(c(0, 1))
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-concentration-around-the-mean 2/5
25/01/2024, 13:07 Week 2: Statistical Modelling
p
X̄n → μ as n→∞
or
where X1 , X2 , … are independent and identically distributed random variables with expectation μ and variance σ 2 ,
n
X̄n = ∑ i=1 Xi / n, and ε > 0 . We will visualize the law in action using simulation.
Consider W ∼ Exp(1/5) . Then E(W) = 5 since E(W) = 1/λ . The code below simulates 5 independent copies of W and
computes their arithmetic mean, w̄n , which is a realization of W̄5 .
## [1] 4.396142
As you increase the number of copies, n , the sample mean, W̄n , should converge to the mean of W based on the law of
large numbers. The code below visualizes the law in action by simulating 10 000 samples of W, (w1 , w2 , … , w10 000 ),
and taking the mean of the first n values for n = 1, 2, … 10 000 .
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-concentration-around-the-mean 3/5
25/01/2024, 13:07 Week 2: Statistical Modelling
Exercise 3
Consider two independent random variables U and V with unknown cumulative distribution functions, and let
Y = U ⋅ V . In the code chunk below, simulate_U(m) returns m simulated values of U and simulate_V(m)
returns m simulated values of V .
Construct a plot showing the convergence of sample means Ū n and V̄n following the steps below.
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-concentration-around-the-mean 4/5
25/01/2024, 13:07 Week 2: Statistical Modelling
Based on the plot, can use guess the means of the three random variables — 𝔼(U), 𝔼(V ), and 𝔼(Y )?
Note that the law of large numbers allows us to use simulations to estimate expectations
PREVIOUS TOPIC
https://rconnect.utstat.utoronto.ca/content/46071d0f-46d2-4f2a-bbf9-522013a5ac33/#section-concentration-around-the-mean 5/5