Problem Set 1 Solution Numerical Methods

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

problem set 1

7 feb 2023

1 Setting up and intro problems


1.1 Setup
Install R and RStudio on your laptop. Make sure you install the most recent versions. You do not have to do
this if you are working on a university computer.
Many problems can be solved by removing any existing installation of R and RStudio. Sometimes rebooting
a laptop may solve problems as well. There should be enough spave available on the harddisk.You find the
version of R and some other information as follows:
sessionInfo()

## R version 4.1.2 (2021-11-01)


## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_4.1.2 magrittr_2.0.1 fastmap_1.1.0 cli_3.1.0
## [5] tools_4.1.2 htmltools_0.5.2 rstudioapi_0.13 yaml_2.2.1
## [9] stringi_1.7.6 rmarkdown_2.11 knitr_1.37 stringr_1.4.0
## [13] xfun_0.29 digest_0.6.29 rlang_1.0.2 evaluate_0.14

1.2 Ordinary Least Squares


Generate 200 observations from a N (0, 2) distribution and put these observations in a vector x. Use this
vector to generate observations yi according to

yi = 0.3 + 0.2xi + i , i = 1, . . . , 200,

with i ∼ N (0, 0.25). Construct an X matrix with a column of ones, and the vector x, and calculate the OLS
estimator (X 0 X)−1 X 0 y.
The purpose is to do some simple linear algebra and regression, so that you are familiar with basic functions.
Note that one of the arguments of rnorm is sd and not the variance.
set.seed(3572)
x <- rnorm(200,mean=0,sd=sqrt(2))

1
y <- 0.3+0.2*x + rnorm(200,sd=sqrt(0.25))
X <- cbind(rep(1,200),x)
b.ols <- solve(t(X) %*% X, t(X) %*% y)
b.ols

## [,1]
## 0.2890479
## x 0.2115689
lm(y~x)

##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 0.2890 0.2116

1.3 Problem 4
Consider the matrix
1 0 0 1
 
0 2 0 0
A=
0
.
0 3 0
0 0 0 4
Calculate the inverse of A, and its eigenvalues and eigenvectors.
A <- diag(1:4)
A[1,4] <- 1
A

## [,1] [,2] [,3] [,4]


## [1,] 1 0 0 1
## [2,] 0 2 0 0
## [3,] 0 0 3 0
## [4,] 0 0 0 4
A.inv <- solve(A)
A.inv %*% A

## [,1] [,2] [,3] [,4]


## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
A %*% A.inv

## [,1] [,2] [,3] [,4]


## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
eigen(A)

## eigen() decomposition

2
## $values
## [1] 4 3 2 1
##
## $vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.3162278 0 0 1
## [2,] 0.0000000 0 1 0
## [3,] 0.0000000 1 0 0
## [4,] 0.9486833 0 0 0

1.4 Write a function


Write a function that calculates f (x) = a + bx + cx2 , such that the function takes two arguments: a vector x
where the function needs to be evaluated (elementwise), and a vector of parameters (a b c)0 .
It is important that all arguments and parameters are communicated to the function, and not taken from the
workspace when the function is evaluated.
f.correct <- function(x,p){
p[1] + p[2]*x + p[3]*xˆ2
}

p <- c(0,1,1)
x <- c(1,2,3,4)
f.correct(x,p)

## [1] 2 6 12 20
R is would also evaluate the function if the parameters of the function are in the workspace, but this may
lead to unintended outcomes.
f.wrong <- function(x){
p[1] + p[2]*x + p[3]*xˆ2
}

f.wrong(x)

## [1] 2 6 12 20
p <- p/2
f.wrong(x)

## [1] 1 3 6 10

1.5 AR-model
Write a loop that generates data according to an AR(1) process: yt = αyt−1 + t , for t = 1, . . . , 2000. Make a
graph of your time series yt , for three different values of α. Take vart = 0.25.
We could use a package and a function to do this, but the point here is to write a loop.
ar.one <- function(T,alpha,sigma){
y0 <- rnorm(1,sd=sigma)
y <- rep(NA,T)
y[1] <- alpha * y0 + rnorm(1,sd=sigma)
for (t in 2:T){
y[t] <- alpha*y[t-1] + rnorm(1,sd=sigma)
}

3
y
}
y1 <- ar.one(2000,0.3,0.5)
y2 <- ar.one(2000,-0.1,0.5)
y3 <- ar.one(2000,0.95,0.5)

y.df <- data.frame(y1,y2,y3,i=1:2000)

plot(1:2000,y1)
1.5
0.5
y1

−0.5
−1.5

0 500 1000 1500 2000

1:2000
plot(1:2000,y2)
1.5
0.5
y2

−0.5
−1.5

0 500 1000 1500 2000

1:2000

4
plot(1:2000,y3)

4
2
y3

0
−2
−4

0 500 1000 1500 2000

1:2000
These plots are not very helpful. Below is a slightly more elaborate attempt.
library(ggplot2)
y.df <- data.frame(y=c(y1,y2,y3),alpha=as.factor(c(rep(0.3,2000),rep(-0.1,2000),rep(0.95,2000))),
i=rep(1:2000,3))

ggplot(y.df) + geom_point(aes(x=i,y=y,color=alpha),size=0.2)

5
5.0

2.5

alpha
−0.1
0.0
y

0.3
0.95

−2.5

−5.0
0 500 1000 1500 2000
i
ggplot(y.df) + geom_point(aes(x=i,y=y),size=0.2) + facet_wrap(~alpha)

6
−0.1 0.3 0.95
5.0

2.5

0.0
y

−2.5

−5.0
0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000
i

1.6 Computers are imprecise


Computers are not precise in a mathematical sense. Can you find a number  such that R thinks that 1+ = 1?
eps <- 1
while ((1+eps)!=1) eps <- eps/2
eps

## [1] 1.110223e-16
1+eps

## [1] 1
(1+eps)==1

## [1] TRUE

1.7 Unbiasedness and coverage


Suppose you have a sample of independent observations X1 , . . . , Xn , distributed according to a N (µ, σ 2 )
distribution. Then you know that
• X̄ = n1 i Xi is an unbiased estimator for EX = µ and,
P

t−1 (0.975)S
• X̄ ± n−1 √n is a 95% confidence interval for µ, with t−1
n−1 (0.975) the 97.5th percentile of a t
distribution with n − 1 degrees of freedom, and S the square root of the unbiased estimator for the
variance of X, σ 2 .
Take n = 25, µ = 0 and σ 2 = 2, and show the validity of these two claims in a simple simulation experiment.

7
These statements concern finite sample properties, so we should assess them by repeatedly sampling from the
(true) distribution, and check whether the claims hold.
mu <- 0
sigma <- sqrt(2)
n <- 25

B <- 1000000 # number of replications


m.sampled <- rep(NA,B)
ci.sampled <- matrix(NA,ncol=2,nrow=B)

set.seed(63729)
for (b in 1:B){
# generate a sample
sample.b <- rnorm(n,mean=mu,sd=sigma)
m.sampled[b] <- mean(sample.b)
}

t.percentile <- qt(0.975,df=n-1)


for (b in 1:B){
# generate a sample
sample.b <- rnorm(n,mean=mu,sd=sigma)
sd.b <- sd(sample.b)
ci.sampled[b,1] <- mean(sample.b) - t.percentile*sd.b/sqrt(n)
ci.sampled[b,2] <- mean(sample.b) + t.percentile*sd.b/sqrt(n)
}

mean(m.sampled)

## [1] -0.000373545
mean(ci.sampled[,1]<mu & ci.sampled[,2]>mu)

## [1] 0.949782
The average of the sampled means is very close to the true value (µ = 0), and the coverage of the confidence
interval is very close to 0.95. Whether the choice of a million replications makes sense depends your the
specifications of your laptop or computer. First test your code with, say, B = 10. Later in the course, we will
see a technique that can be used to speed up this type of simulation studies.

2 Reading and processing data: Stock market data


This week we look at data processing, and we will construct a dataset that will be used the coming weeks.
For that reason, it is important that your dataset is of high quality!

2.1 Reading material


Please scroll through Wickham and Grolemund, chapters 1-16. Brush up your R skills by reading chapters
1-8 of Jones et al., should that be necessary.

2.2 Read market data


On nestor (course documents, week 1) you find two files with stock market data, one of the Dutch in-
dex AEX and one with the German index DAX. Read both files into R. (Hint: use either read.csv or
readxl::read_excel.)}

8
First, we set the working directory to the one having the data, we load the appropriate library, and read the
data. When reading a csv file, make sure that the decimal point is the decimal point, and not a decimal
comma. This varies sometimes by language of the operating system. Nothing beats looking at the first few
lines of data in a simple text editor.
library(readxl)
suppressMessages(library(tidyverse))
aex <- read_excel("aex.xlsx")
glimpse(aex)

## Rows: 7,831
## Columns: 7
## $ Date <chr> "1992-10-12", "1992-10-13", "1992-10-14", "1992-10-15", "1~
## $ Open <chr> "126.382904", "126.773155", "126.056175", "126.097015", "1~
## $ High <chr> "127.181557", "127.004585", "126.555336", "126.568954", "1~
## $ Low <chr> "126.056175", "126.491806", "126.001724", "125.688614", "1~
## $ Close <chr> "126.945595", "126.850296", "126.487274", "125.824753", "1~
## $ `Adj Close` <chr> "126.945595", "126.850296", "126.487274", "125.824753", "1~
## $ Volume <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"~
dax <- read.csv("dax.csv")
glimpse(dax)

## Rows: 9,041
## Columns: 7
## $ Date <chr> "1987-12-30", "1987-12-31", "1988-01-01", "1988-01-04", "198~
## $ Open <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ High <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Low <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Close <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Adj.Close <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Volume <chr> "0", "null", "null", "0", "0", "0", "0", "0", "0", "0", "0",~
save(aex,dax,file="stock data.Rda")

Later, we will clean these data and do some transformations.

2.3 Index returns


Calculate the daily log-returns in your AEX and DAX data sets. Merge the datasets horizontally by day. It is
difficult to graph the densities of both markets, since they are in two different columns. Now, create a dataset
with the following variables: date, market, daily log-return. Use the command pivot_longer to do so, and
make a graph of the density of the log-returns in a faceted graph (one cell for the AEX, one for the DAX).
The inverse operation of pivot_longer is pivot_wider)}
library(readxl)
suppressMessages(library(tidyverse))
library(lubridate)

##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
aex <- read_excel("aex.xlsx")
glimpse(aex)

9
## Rows: 7,831
## Columns: 7
## $ Date <chr> "1992-10-12", "1992-10-13", "1992-10-14", "1992-10-15", "1~
## $ Open <chr> "126.382904", "126.773155", "126.056175", "126.097015", "1~
## $ High <chr> "127.181557", "127.004585", "126.555336", "126.568954", "1~
## $ Low <chr> "126.056175", "126.491806", "126.001724", "125.688614", "1~
## $ Close <chr> "126.945595", "126.850296", "126.487274", "125.824753", "1~
## $ `Adj Close` <chr> "126.945595", "126.850296", "126.487274", "125.824753", "1~
## $ Volume <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"~
dax <- read.csv("dax.csv",as.is=TRUE)
glimpse(dax)

## Rows: 9,041
## Columns: 7
## $ Date <chr> "1987-12-30", "1987-12-31", "1988-01-01", "1988-01-04", "198~
## $ Open <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ High <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Low <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Close <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Adj.Close <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Volume <chr> "0", "null", "null", "0", "0", "0", "0", "0", "0", "0", "0",~
names(aex) <- tolower(names(aex))
head(aex)

## # A tibble: 6 x 7
## date open high low close `adj close` volume
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1992-10-12 126.382904 127.181557 126.056175 126.945595 126.945595 0
## 2 1992-10-13 126.773155 127.004585 126.491806 126.850296 126.850296 0
## 3 1992-10-14 126.056175 126.555336 126.001724 126.487274 126.487274 0
## 4 1992-10-15 126.097015 126.568954 125.688614 125.824753 125.824753 0
## 5 1992-10-16 126.264915 126.478195 124.858192 125.230293 125.230293 0
## 6 1992-10-19 124.967102 125.184914 124.141220 124.286430 124.286430 0
aex <- mutate(aex,date=ymd(date),open=as.numeric(open),high=as.numeric(high),
low=as.numeric(low),close=as.numeric(close),adj.close=as.numeric(`adj close`),
volume=as.numeric(volume)) %>% select(-`adj close`) %>%
mutate(d.return=(log(adj.close)-log(lag(adj.close))))

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion


names(dax) <- tolower(names(dax))
head(dax)

## date open high low close adj.close volume

10
## 1 1987-12-30 1005.190002 1005.190002 1005.190002 1005.190002 1005.190002 0
## 2 1987-12-31 null null null null null null
## 3 1988-01-01 null null null null null null
## 4 1988-01-04 956.489990 956.489990 956.489990 956.489990 956.489990 0
## 5 1988-01-05 996.099976 996.099976 996.099976 996.099976 996.099976 0
## 6 1988-01-06 1006.010010 1006.010010 1006.010010 1006.010010 1006.010010 0
dax <- mutate(dax,date=ymd(date),open=as.numeric(open),high=as.numeric(high),
low=as.numeric(low),close=as.numeric(close),adj.close=as.numeric(adj.close),
volume=as.numeric(volume)) %>%
mutate(d.return=(log(adj.close)-log(lag(adj.close))))

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion


names(aex)[2:8] <- paste(names(aex)[2:8],"aex",sep=".")
names(dax)[2:8] <- paste(names(dax)[2:8],"dax",sep=".")

First, we look at the summaries, and remove apparent errors.


summary(aex$d.return.aex)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


## -0.11376 -0.00563 0.00069 0.00021 0.00664 0.10028 162
summary(dax$d.return.dax)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


## -0.14091 -0.00638 0.00076 0.00027 0.00741 0.10797 289
It is not possible for the AEX to have dropped 78% on one day, we remove that observation.
stock.returns1 <- full_join(aex,dax)

## Joining, by = "date"
stock.returns2 <- left_join(aex,dax)

## Joining, by = "date"
stock.returns3 <- right_join(aex,dax)

## Joining, by = "date"
glimpse(filter(stock.returns1,date=="1988-01-25"))

## Rows: 1
## Columns: 15
## $ date <date> 1988-01-25
## $ open.aex <dbl> NA
## $ high.aex <dbl> NA

11
## $ low.aex <dbl> NA
## $ close.aex <dbl> NA
## $ volume.aex <dbl> NA
## $ adj.close.aex <dbl> NA
## $ d.return.aex <dbl> NA
## $ open.dax <dbl> 962.63
## $ high.dax <dbl> 962.63
## $ low.dax <dbl> 962.63
## $ close.dax <dbl> 962.63
## $ adj.close.dax <dbl> 962.63
## $ volume.dax <dbl> 0
## $ d.return.dax <dbl> -0.003991457
glimpse(filter(stock.returns2,date=="1988-01-25"))

## Rows: 0
## Columns: 15
## $ date <date>
## $ open.aex <dbl>
## $ high.aex <dbl>
## $ low.aex <dbl>
## $ close.aex <dbl>
## $ volume.aex <dbl>
## $ adj.close.aex <dbl>
## $ d.return.aex <dbl>
## $ open.dax <dbl>
## $ high.dax <dbl>
## $ low.dax <dbl>
## $ close.dax <dbl>
## $ adj.close.dax <dbl>
## $ volume.dax <dbl>
## $ d.return.dax <dbl>
glimpse(filter(stock.returns3,date=="1988-01-25"))

## Rows: 1
## Columns: 15
## $ date <date> 1988-01-25
## $ open.aex <dbl> NA
## $ high.aex <dbl> NA
## $ low.aex <dbl> NA
## $ close.aex <dbl> NA
## $ volume.aex <dbl> NA
## $ adj.close.aex <dbl> NA
## $ d.return.aex <dbl> NA
## $ open.dax <dbl> 962.63
## $ high.dax <dbl> 962.63
## $ low.dax <dbl> 962.63
## $ close.dax <dbl> 962.63
## $ adj.close.dax <dbl> 962.63
## $ volume.dax <dbl> 0
## $ d.return.dax <dbl> -0.003991457
dim(merge(aex,dax))

## [1] 7792 15

12
dim(merge(aex,dax,all.x=TRUE))

## [1] 7831 15
dim(merge(aex,dax,all.y=TRUE))

## [1] 9041 15
dim(merge(aex,dax,all=TRUE))

## [1] 9080 15
stock.returns <- stock.returns1

save(aex,dax,stock.returns,file="stock data.Rda")

mean(is.element(aex$date,stock.returns$date))

## [1] 1
mean(is.element(dax$date,stock.returns$date))

## [1] 1
stocks.long <- select(stock.returns,date,aex=d.return.aex,dax=d.return.dax) %>%
pivot_longer(!date,names_to="market",values_to="return")
head(stocks.long)

## # A tibble: 6 x 3
## date market return
## <date> <chr> <dbl>
## 1 1992-10-12 aex NA
## 2 1992-10-12 dax 0.00272
## 3 1992-10-13 aex -0.000751
## 4 1992-10-13 dax 0.0206
## 5 1992-10-14 aex -0.00287
## 6 1992-10-14 dax -0.0121
ggplot(stocks.long) + geom_density(aes(x=return,color=market))

## Warning: Removed 1739 rows containing non-finite values (stat_density).

13
40

30

market
density

aex

20 dax

10

−0.15 −0.10 −0.05 0.00 0.05 0.10


return
ggplot(stocks.long) + geom_density(aes(x=return)) +
facet_wrap(~market,nrow=1)

## Warning: Removed 1739 rows containing non-finite values (stat_density).

14
aex dax

40

30
density

20

10

−0.15 −0.10 −0.05 0.00 0.05 0.10 −0.15 −0.10 −0.05 0.00 0.05 0.10
return

2.4 Monthly correlations


For each month in your returns dataset, calculate the correlation between the daily return on the AEX and
the DAX. Make a graph that shows this monthly correlation over time.
monthly.correlations <- mutate(stock.returns,month=format(date,format="%Y-%m")) %>%
group_by(month) %>%
summarise(correlation=cor(d.return.aex,d.return.dax,use="pairwise.complete.obs"))
ggplot(monthly.correlations,aes(x=month,y=correlation))+geom_point()+ylim(0,1)

## Warning: Removed 58 rows containing missing values (geom_point).

15
1.00

0.75
correlation

0.50

0.25

0.00

1987−12
1988−01
1988−02
1988−03
1988−04
1988−05
1988−06
1988−07
1988−08
1988−09
1988−10
1988−11
1988−12
1989−01
1989−02
1989−03
1989−04
1989−05
1989−06
1989−07
1989−08
1989−09
1989−10
1989−11
1989−12
1990−01
1990−02
1990−03
1990−04
1990−05
1990−06
1990−07
1990−08
1990−09
1990−10
1990−11
1990−12
1991−01
1991−02
1991−03
1991−04
1991−05
1991−06
1991−07
1991−08
1991−09
1991−10
1991−11
1991−12
1992−01
1992−02
1992−03
1992−04
1992−05
1992−06
1992−07
1992−08
1992−09
1992−10
1992−11
1992−12
1993−01
1993−02
1993−03
1993−04
1993−05
1993−06
1993−07
1993−08
1993−09
1993−10
1993−11
1993−12
1994−01
1994−02
1994−03
1994−04
1994−05
1994−06
1994−07
1994−08
1994−09
1994−10
1994−11
1994−12
1995−01
1995−02
1995−03
1995−04
1995−05
1995−06
1995−07
1995−08
1995−09
1995−10
1995−11
1995−12
1996−01
1996−02
1996−03
1996−04
1996−05
1996−06
1996−07
1996−08
1996−09
1996−10
1996−11
1996−12
1997−01
1997−02
1997−03
1997−04
1997−05
1997−06
1997−07
1997−08
1997−09
1997−10
1997−11
1997−12
1998−01
1998−02
1998−03
1998−04
1998−05
1998−06
1998−07
1998−08
1998−09
1998−10
1998−11
1998−12
1999−01
1999−02
1999−03
1999−04
1999−05
1999−06
1999−07
1999−08
1999−09
1999−10
1999−11
1999−12
2000−01
2000−02
2000−03
2000−04
2000−05
2000−06
2000−07
2000−08
2000−09
2000−10
2000−11
2000−12
2001−01
2001−02
2001−03
2001−04
2001−05
2001−06
2001−07
2001−08
2001−09
2001−10
2001−11
2001−12
2002−01
2002−02
2002−03
2002−04
2002−05
2002−06
2002−07
2002−08
2002−09
2002−10
2002−11
2002−12
2003−01
2003−02
2003−03
2003−04
2003−05
2003−06
2003−07
2003−08
2003−09
2003−10
2003−11
2003−12
2004−01
2004−02
2004−03
2004−04
2004−05
2004−06
2004−07
2004−08
2004−09
2004−10
2004−11
2004−12
2005−01
2005−02
2005−03
2005−04
2005−05
2005−06
2005−07
2005−08
2005−09
2005−10
2005−11
2005−12
2006−01
2006−02
2006−03
2006−04
2006−05
2006−06
2006−07
2006−08
2006−09
2006−10
2006−11
2006−12
2007−01
2007−02
2007−03
2007−04
2007−05
2007−06
2007−07
2007−08
2007−09
2007−10
2007−11
2007−12
2008−01
2008−02
2008−03
2008−04
2008−05
2008−06
2008−07
2008−08
2008−09
2008−10
2008−11
2008−12
2009−01
2009−02
2009−03
2009−04
2009−05
2009−06
2009−07
2009−08
2009−09
2009−10
2009−11
2009−12
2010−01
2010−02
2010−03
2010−04
2010−05
2010−06
2010−07
2010−08
2010−09
2010−10
2010−11
2010−12
2011−01
2011−02
2011−03
2011−04
2011−05
2011−06
2011−07
2011−08
2011−09
2011−10
2011−11
2011−12
2012−01
2012−02
2012−03
2012−04
2012−05
2012−06
2012−07
2012−08
2012−09
2012−10
2012−11
2012−12
2013−01
2013−02
2013−03
2013−04
2013−05
2013−06
2013−07
2013−08
2013−09
2013−10
2013−11
2013−12
2014−01
2014−02
2014−03
2014−04
2014−05
2014−06
2014−07
2014−08
2014−09
2014−10
2014−11
2014−12
2015−01
2015−02
2015−03
2015−04
2015−05
2015−06
2015−07
2015−08
2015−09
2015−10
2015−11
2015−12
2016−01
2016−02
2016−03
2016−04
2016−05
2016−06
2016−07
2016−08
2016−09
2016−10
2016−11
2016−12
2017−01
2017−02
2017−03
2017−04
2017−05
2017−06
2017−07
2017−08
2017−09
2017−10
2017−11
2017−12
2018−01
2018−02
2018−03
2018−04
2018−05
2018−06
2018−07
2018−08
2018−09
2018−10
2018−11
2018−12
2019−01
2019−02
2019−03
2019−04
2019−05
2019−06
2019−07
2019−08
2019−09
2019−10
2019−11
2019−12
2020−01
2020−02
2020−03
2020−04
2020−05
2020−06
2020−07
2020−08
2020−09
2020−10
2020−11
2020−12
2021−01
2021−02
2021−03
2021−04
2021−05
2021−06
2021−07
2021−08
2021−09
2021−10
2021−11
2021−12
2022−01
2022−02
2022−03
2022−04
2022−05
2022−06
2022−07
2022−08
2022−09
2022−10
2022−11
2022−12
2023−01
2023−02
month
monthly.correlations <- mutate(monthly.correlations,month=ymd(month,truncated=2))
ggplot(monthly.correlations,aes(x=month,y=correlation))+geom_point()+ylim(0,1)

## Warning: Removed 58 rows containing missing values (geom_point).

16
1.00

0.75
correlation

0.50

0.25

0.00

1990 2000 2010 2020


month

3 Reading and processing data: Tennis data


On http:// www.tennis-data.co.uk/alldata.php you find compressed files with tennis results. There is also
a file Notes.txt that has a short explanation of the data that are provided. Download all results for males
(2001-2017) and females (2007-2017). Read the xls files into R. Make sure that the variables in your datasets
with sex-year results are of the appropriate type. You will have to create a new variable, sex that takes values
male or female. Now merge all 20 + 14 = 34 datasets into one dataset. You can save this dataset using the
save-command.
suppressMessages(library(tidyverse))
#library(tidyverse)
library(lubridate)
library(readxl)
m.files <- list.files("raw data/m",full.names=TRUE)
f1 <- read_excel(m.files[1])
class(f1)

## [1] "tbl_df" "tbl" "data.frame"


f1

## # A tibble: 2,963 x 34
## ATP Location Tournament Date Series Court Surface Round
## <dbl> <chr> <chr> <dttm> <chr> <chr> <chr> <chr>
## 1 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 2 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 3 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 4 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~

17
## 5 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 6 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 7 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 8 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 9 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 10 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## # ... with 2,953 more rows, and 26 more variables: Best of <dbl>, Winner <chr>,
## # Loser <chr>, WRank <dbl>, LRank <chr>, W1 <dbl>, L1 <dbl>, W2 <dbl>,
## # L2 <dbl>, W3 <dbl>, L3 <dbl>, W4 <dbl>, L4 <dbl>, W5 <dbl>, L5 <dbl>,
## # Wsets <dbl>, Lsets <dbl>, Comment <chr>, CBW <dbl>, CBL <dbl>, GBW <dbl>,
## # GBL <dbl>, IWW <dbl>, IWL <dbl>, SBW <dbl>, SBL <dbl>
The variable LRank should be <dbl> so we write a function that forces WRank and LRank to be numeric. The
other variables that are forced to be numeric are from other excel files with incorrect types.
force.type <- function(d){
nm <- names(d)
if (is.element("WRank",nm)) d$WRank <- as.numeric(d$WRank)
if (is.element("LRank",nm)) d$LRank <- as.numeric(d$LRank)
if (is.element("B365W",nm)) d$B365W <- as.numeric(d$B365W)
if (is.element("B365L",nm)) d$B365L <- as.numeric(d$B365L)
if (is.element("WPts",nm)) d$WPts <- as.numeric(d$WPts)
if (is.element("LPts",nm)) d$LPts <- as.numeric(d$LPts)
if (is.element("LBW",nm)) d$LBW <- as.numeric(d$LBW)
if (is.element("LBL",nm)) d$LBL <- as.numeric(d$LBL)
d
}

First we read all the files into tibbles and store those in a list. Then we force the types if necessary (and if
the variable is present).
m.tennis <- force.type(f1)

## Warning in force.type(f1): NAs introduced by coercion


for (i in 2:length(m.files)){
f <- read_excel(m.files[i])
select(f,Date,B365W,B365L) %>% glimpse()
f <- force.type(f)
m.tennis <- bind_rows(m.tennis,f)
}

## Rows: 2,854
## Columns: 3
## $ Date <dttm> 2001-12-31, 2001-12-31, 2001-12-31, 2001-12-31, 2001-12-31, 200~
## $ B365W <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ B365L <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## Warning in force.type(f): NAs introduced by coercion
## Rows: 2,861
## Columns: 3
## $ Date <dttm> 2002-12-30, 2002-12-30, 2002-12-30, 2002-12-30, 2002-12-30, 200~
## $ B365W <dbl> NA, NA, 1.364, NA, NA, NA, 1.667, 1.400, 1.667, 1.286, 1.800, NA~
## $ B365L <dbl> NA, NA, 2.875, NA, NA, NA, 2.100, 2.750, 2.100, 3.250, 1.909, NA~
## Warning in force.type(f): NAs introduced by coercion
## Rows: 2,877

18
## Columns: 3
## $ Date <dttm> 2004-01-05, 2004-01-05, 2004-01-05, 2004-01-05, 2004-01-05, 200~
## $ B365W <dbl> NA, 1.160, 2.000, 1.830, 1.400, NA, 1.800, 1.800, NA, 1.533, 1.4~
## $ B365L <dbl> NA, 4.500, 1.720, 1.830, 2.750, NA, 1.909, 1.900, NA, 2.375, 2.6~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O1724 / R1724C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O1906 / R1906C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O2015 / R2015C15: got 'N/A'
## Rows: 2,909
## Columns: 3
## $ Date <dttm> 2005-01-03, 2005-01-03, 2005-01-03, 2005-01-03, 2005-01-03, 200~
## $ B365W <dbl> 1.286, 1.833, 1.800, 1.667, 1.615, 1.333, 1.500, 1.533, 1.400, 1~
## $ B365L <dbl> 3.250, 1.833, 1.909, 2.100, 2.200, 3.000, 2.500, 2.375, 2.750, 2~
## Warning in force.type(f): NAs introduced by coercion
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in AA1402 / R1402C27: got '`1'
## Rows: 2,909
## Columns: 3
## $ Date <dttm> 2006-01-02, 2006-01-02, 2006-01-02, 2006-01-02, 2006-01-02, 200~
## $ B365W <dbl> 1.39, 1.53, 1.28, 1.53, 1.44, 1.50, 1.40, 1.83, 1.90, 1.12, 1.66~
## $ B365L <dbl> 2.75, 2.37, 3.25, 2.37, 2.62, 2.50, 2.75, 1.83, 1.80, 5.50, 2.10~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1384 / R1384C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1384 / R1384C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1475 / R1475C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1475 / R1475C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1485 / R1485C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1485 / R1485C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M2384 / R2384C13: got 'N/A'

19
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O2384 / R2384C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M2386 / R2386C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O2386 / R2386C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M2698 / R2698C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O2698 / R2698C15: got 'N/A'
## Rows: 2,806
## Columns: 3
## $ Date <dttm> 2007-01-01, 2007-01-01, 2007-01-01, 2007-01-01, 2007-01-02, 200~
## $ B365W <dbl> 2.87, 3.00, 2.00, 2.37, 1.50, 1.08, 2.87, 1.19, 2.62, 1.14, 1.83~
## $ B365L <dbl> 1.36, 1.33, 1.72, 1.53, 2.50, 6.50, 1.36, 4.00, 1.44, 5.00, 1.83~
## Rows: 2,707
## Columns: 3
## $ Date <dttm> 2007-12-31, 2007-12-31, 2007-12-31, 2007-12-31, 2007-12-31, 200~
## $ B365W <dbl> 1.53, 6.50, 3.75, 1.40, 1.72, 2.50, 1.44, NA, 1.57, 1.16, 1.28, ~
## $ B365L <dbl> 2.37, 1.10, 1.25, 2.75, 2.00, 1.50, 2.62, NA, 2.25, 4.50, 3.50, ~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Rows: 2,731
## Columns: 3
## $ Date <dttm> 2009-01-04, 2009-01-04, 2009-01-04, 2009-01-04, 2009-01-05, 200~
## $ B365W <dbl> 1.25, 2.75, 2.10, 1.44, 2.00, 1.36, 2.20, 1.06, 1.03, 2.75, 1.28~
## $ B365L <dbl> 3.75, 1.40, 1.66, 2.62, 1.72, 3.00, 1.61, 8.00, 11.00, 1.40, 3.5~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Rows: 2,679
## Columns: 3
## $ Date <dttm> 2010-01-04, 2010-01-04, 2010-01-04, 2010-01-04, 2010-01-04, 201~
## $ B365W <dbl> 1.44, 2.25, 1.61, 2.62, 3.00, 1.61, 1.04, 1.08, 1.72, 1.12, 1.36~
## $ B365L <dbl> 2.62, 1.57, 2.20, 1.44, 1.36, 2.20, 10.00, 7.00, 2.00, 5.50, 3.0~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1232 / R1232C13: got 'N/A'

20
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1232 / R1232C15: got 'N/A'
## Rows: 2,675
## Columns: 3
## $ Date <dttm> 2011-01-02, 2011-01-02, 2011-01-03, 2011-01-03, 2011-01-03, 201~
## $ B365W <dbl> 1.53, 1.90, 1.83, 1.72, 1.36, 3.50, 1.53, 1.06, 1.36, 3.00, NA, ~
## $ B365L <dbl> 2.37, 1.80, 1.83, 2.00, 3.00, 1.28, 2.37, 8.00, 3.00, 1.36, NA, ~
## Rows: 2,607
## Columns: 3
## $ Date <dttm> 2012-01-01, 2012-01-01, 2012-01-02, 2012-01-02, 2012-01-02, 201~
## $ B365W <dbl> 4.33, 1.25, 3.25, 1.66, 1.40, 1.20, 1.36, 1.22, 1.36, 1.44, 1.50~
## $ B365L <dbl> 1.20, 3.75, 1.33, 2.10, 2.75, 4.33, 3.00, 4.00, 3.00, 2.62, 2.50~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in AE1755 / R1755C31: got '2.,3'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L2226 / R2226C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N2226 / R2226C14: got 'N/A'
## Rows: 2,631
## Columns: 3
## $ Date <dttm> 2012-12-31, 2012-12-31, 2012-12-31, 2012-12-31, 2013-01-01, 201~
## $ B365W <dbl> 1.36, 1.61, 1.25, 1.07, 1.90, 1.61, 2.20, 1.44, 3.00, 1.36, 1.57~
## $ B365L <dbl> 3.00, 2.20, 3.75, 9.00, 1.80, 2.20, 1.61, 2.62, 1.36, 3.00, 2.25~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
## Rows: 2,600
## Columns: 3
## $ Date <dttm> 2013-12-30, 2013-12-30, 2013-12-30, 2013-12-30, 2013-12-30, 201~
## $ B365W <dbl> 1.72, 1.28, 1.36, 1.90, 1.25, 2.25, 1.25, 1.66, 1.44, 1.57, 1.28~
## $ B365L <dbl> 2.00, 3.50, 3.00, 1.80, 3.75, 1.57, 3.75, 2.10, 2.62, 2.25, 3.50~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Rows: 2,630
## Columns: 3
## $ Date <dttm> 2015-01-05, 2015-01-05, 2015-01-05, 2015-01-05, 2015-01-06, 201~
## $ B365W <dbl> 4.50, 2.62, 1.28, 1.57, 1.53, 3.50, 2.00, 1.66, 1.66, 1.40, 1.25~
## $ B365L <dbl> 1.18, 1.44, 3.50, 2.25, 2.37, 1.28, 1.72, 2.10, 2.10, 2.75, 3.75~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

21
## Warning in force.type(f): NAs introduced by coercion
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1687 / R1687C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1687 / R1687C14: got 'N/A'
## Rows: 2,626
## Columns: 3
## $ Date <dttm> 2016-01-04, 2016-01-04, 2016-01-04, 2016-01-04, 2016-01-05, 201~
## $ B365W <dbl> 1.66, 1.53, 1.72, 1.83, 1.28, 2.00, 1.33, 2.00, 1.06, 2.25, 2.00~
## $ B365L <dbl> 2.10, 2.37, 2.00, 1.83, 3.50, 1.72, 3.25, 1.72, 10.00, 1.57, 1.7~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
## Rows: 2,633
## Columns: 3
## $ Date <dttm> 2017-01-01, 2017-01-02, 2017-01-02, 2017-01-02, 2017-01-02, 201~
## $ B365W <dbl> 1.28, 1.50, 1.90, 1.36, 1.40, 2.62, 2.25, 1.80, 1.50, 2.20, 1.44~
## $ B365L <dbl> 3.50, 2.50, 1.80, 3.00, 2.75, 1.44, 1.57, 1.90, 2.50, 1.61, 2.62~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Rows: 2,637
## Columns: 3
## $ Date <dttm> 2017-12-31, 2017-12-31, 2018-01-01, 2018-01-01, 2018-01-01, 201~
## $ B365W <dbl> 2.20, 2.75, 1.61, 2.50, 1.40, 2.20, 1.83, 1.66, 1.53, 1.72, 2.62~
## $ B365L <dbl> 1.61, 1.40, 2.20, 1.50, 2.75, 1.61, 1.83, 2.10, 2.37, 2.00, 1.44~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Rows: 2,610
## Columns: 3
## $ Date <dttm> 2018-12-31, 2018-12-31, 2018-12-31, 2018-12-31, 2018-12-31, 201~
## $ B365W <dbl> 1.36, 1.18, 1.57, 1.40, 2.62, 2.62, 2.10, 1.28, 1.40, 2.25, 1.20~
## $ B365L <dbl> 3.00, 4.50, 2.25, 2.75, 1.44, 1.44, 1.66, 3.50, 2.75, 1.57, 4.33~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Rows: 1,267
## Columns: 3
## $ Date <dttm> 2020-01-06, 2020-01-06, 2020-01-06, 2020-01-06, 2020-01-06, 202~
## $ B365W <dbl> 2.00, 1.57, 1.25, 1.83, 1.50, 1.66, 1.83, 3.00, 2.00, 3.50, 1.28~
## $ B365L <dbl> 1.72, 2.25, 3.75, 1.83, 2.50, 2.10, 1.83, 1.36, 1.72, 1.28, 3.50~
## Warning in force.type(f): NAs introduced by coercion

22
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Rows: 58
## Columns: 3
## $ Date <dttm> 2021-01-07, 2021-01-07, 2021-01-07, 2021-01-07, 2021-01-07, 202~
## $ B365W <dbl> 1.50, 2.50, 1.50, 1.61, 1.40, 2.62, 1.22, 5.00, 1.20, 1.80, 1.03~
## $ B365L <dbl> 2.50, 1.50, 2.50, 2.20, 2.75, 1.44, 4.00, 1.16, 4.33, 1.90, 15.0~
m.tennis$Sex <- "male"

We do the same for the files with female tennis results, and combine the result into one dataset.
f.files <- list.files("raw data/f",full.names=TRUE)
f1 <- read_excel(f.files[1])

## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :


## Expecting numeric in Z2159 / R2159C26: got '5..5'
select(f1,Date,B365W,B365L) %>% glimpse()

## Rows: 2,491
## Columns: 3
## $ Date <dttm> 2007-01-01, 2007-01-01, 2007-01-01, 2007-01-01, 2007-01-01, 200~
## $ B365W <dbl> 1.33, 3.75, 1.72, 1.83, 1.16, 2.50, 1.36, 2.50, 1.83, 1.53, 1.72~
## $ B365L <dbl> 3.00, 1.22, 2.00, 1.83, 4.50, 1.50, 2.87, 1.50, 1.83, 2.37, 2.00~
class(f1)

## [1] "tbl_df" "tbl" "data.frame"


f1

## # A tibble: 2,491 x 34
## WTA Location Tournament Date Tier Court Surface Round
## <dbl> <chr> <chr> <dttm> <chr> <chr> <chr> <chr>
## 1 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 2 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 3 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 4 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 5 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 6 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 7 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 8 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 9 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 10 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## # ... with 2,481 more rows, and 26 more variables: Best of <dbl>, Winner <chr>,
## # Loser <chr>, WRank <chr>, LRank <chr>, WPts <chr>, LPts <chr>, W1 <dbl>,
## # L1 <dbl>, W2 <dbl>, L2 <dbl>, W3 <dbl>, L3 <dbl>, Wsets <dbl>, Lsets <dbl>,
## # Comment <chr>, B365W <dbl>, B365L <dbl>, CBW <dbl>, CBL <dbl>, EXW <dbl>,
## # EXL <dbl>, PSW <dbl>, PSL <dbl>, UBW <dbl>, UBL <dbl>
f.tennis <- force.type(f1)

## Warning in force.type(f1): NAs introduced by coercion

23
## Warning in force.type(f1): NAs introduced by coercion

## Warning in force.type(f1): NAs introduced by coercion

## Warning in force.type(f1): NAs introduced by coercion


for (i in 2:length(f.files)){
f <- read_excel(f.files[i])
select(f,Date,B365W,B365L) %>% glimpse()
f <- force.type(f)
f.tennis <- bind_rows(f.tennis,f)
}

## Rows: 2,404
## Columns: 3
## $ Date <dttm> 2007-12-30, 2007-12-31, 2007-12-31, 2007-12-31, 2007-12-31, 200~
## $ B365W <dbl> 1.25, 1.07, 1.66, 1.16, 1.83, 1.33, 2.62, 1.20, 1.36, 1.50, 1.04~
## $ B365L <dbl> 3.75, 7.50, 2.10, 4.50, 1.83, 3.25, 1.44, 4.33, 3.00, 2.50, 9.00~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1051 / R1051C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1051 / R1051C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1482 / R1482C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1482 / R1482C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1493 / R1493C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1493 / R1493C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1689 / R1689C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1689 / R1689C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1704 / R1704C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1704 / R1704C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1731 / R1731C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1731 / R1731C15: got 'N/A'

24
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1785 / R1785C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1785 / R1785C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1808 / R1808C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1808 / R1808C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1817 / R1817C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1817 / R1817C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1824 / R1824C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1824 / R1824C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1853 / R1853C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1853 / R1853C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1867 / R1867C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1867 / R1867C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1877 / R1877C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1877 / R1877C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1923 / R1923C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1923 / R1923C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1989 / R1989C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1989 / R1989C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L2018 / R2018C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N2018 / R2018C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L2030 / R2030C12: got 'N/A'

25
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N2030 / R2030C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L2036 / R2036C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N2036 / R2036C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L2040 / R2040C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N2040 / R2040C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L2042 / R2042C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N2042 / R2042C14: got 'N/A'
## Rows: 2,433
## Columns: 3
## $ Date <dttm> 2009-01-04, 2009-01-05, 2009-01-05, 2009-01-05, 2009-01-05, 200~
## $ B365W <dbl> 1.400, 1.400, 3.000, 1.380, 1.660, 1.330, 1.660, 1.280, 1.500, 2~
## $ B365L <dbl> 2.75, 2.75, 1.36, 2.87, 2.10, 3.25, 2.10, 3.50, 2.50, 1.50, 6.50~
## Warning in force.type(f): NAs introduced by coercion
## Rows: 2,448
## Columns: 3
## $ Date <dttm> 2010-01-03, 2010-01-04, 2010-01-04, 2010-01-04, 2010-01-04, 201~
## $ B365W <dbl> 1.22, 1.10, 1.28, 2.10, 1.83, 1.25, 1.16, 1.16, 1.50, 1.36, 1.36~
## $ B365L <dbl> 4.00, 6.50, 3.50, 1.66, 1.83, 3.75, 4.50, 4.50, 2.50, 3.00, 3.00~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1651 / R1651C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1651 / R1651C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1654 / R1654C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1654 / R1654C15: got 'N/A'
## Rows: 2,468
## Columns: 3
## $ Date <dttm> 2011-01-02, 2011-01-03, 2011-01-03, 2011-01-03, 2011-01-03, 201~
## $ B365W <dbl> 1.50, 1.05, 1.57, 1.04, 1.66, 2.20, 1.44, 1.44, 1.72, 1.22, 1.16~
## $ B365L <dbl> 2.50, 9.00, 2.25, 10.00, 2.10, 1.61, 2.62, 2.62, 2.00, 4.00, 4.5~
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :

26
## Expecting numeric in L1596 / R1596C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1596 / R1596C14: got 'N/A'
## Rows: 2,407
## Columns: 3
## $ Date <dttm> 2012-01-02, 2012-01-02, 2012-01-02, 2012-01-02, 2012-01-02, 201~
## $ B365W <dbl> 1.36, 1.33, 1.44, 1.36, 1.18, 1.14, 1.66, 1.16, 1.36, 1.36, 1.30~
## $ B365L <dbl> 3.00, 3.25, 2.62, 3.00, 4.50, 5.50, 2.10, 5.00, 3.00, 3.00, 3.40~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1166 / R1166C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1166 / R1166C14: got 'N/A'
## Rows: 2,442
## Columns: 3
## $ Date <dttm> 2012-12-30, 2012-12-31, 2012-12-31, 2012-12-31, 2012-12-31, 201~
## $ B365W <dbl> 2.50, 1.53, 1.83, 2.50, 1.40, 1.57, 1.36, 1.53, 1.83, 1.61, 1.44~
## $ B365L <dbl> 1.50, 2.37, 1.83, 1.50, 2.75, 2.25, 3.00, 2.37, 1.83, 2.20, 2.62~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
## Rows: 2,476
## Columns: 3
## $ Date <dttm> 2013-12-30, 2013-12-30, 2013-12-30, 2013-12-30, 2013-12-30, 201~
## $ B365W <dbl> 2.20, 2.50, 1.57, 1.20, 1.61, 1.40, 1.16, 2.62, 1.53, 3.50, 3.25~
## $ B365L <dbl> 1.61, 1.50, 2.25, 4.33, 2.20, 2.75, 4.50, 1.44, 2.37, 1.28, 1.33~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Rows: 2,521
## Columns: 3
## $ Date <dttm> 2015-01-04, 2015-01-04, 2015-01-05, 2015-01-05, 2015-01-05, 201~
## $ B365W <dbl> 1.66, 1.72, 1.61, 1.36, 2.75, 1.44, 1.57, 1.72, 1.83, 1.66, 4.00~
## $ B365L <dbl> 2.10, 2.00, 2.20, 3.00, 1.40, 2.62, 2.25, 2.00, 1.83, 2.10, 1.22~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Rows: 2,522
## Columns: 3
## $ Date <dttm> 2016-01-03, 2016-01-03, 2016-01-04, 2016-01-04, 2016-01-04, 201~
## $ B365W <dbl> 2.10, 1.61, 1.40, 1.72, 2.50, 1.30, 2.75, 2.25, 1.50, 1.72, 1.50~
## $ B365L <dbl> 1.66, 2.20, 2.75, 2.00, 1.50, 3.40, 1.40, 1.57, 2.50, 2.00, 2.50~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

27
## Rows: 2,500
## Columns: 3
## $ Date <dttm> 2017-01-01, 2017-01-01, 2017-01-02, 2017-01-02, 2017-01-03, 201~
## $ B365W <dbl> 1.40, 1.30, 1.20, 1.10, 2.37, 1.40, 2.50, 1.66, 1.66, 1.72, 1.16~
## $ B365L <dbl> 2.75, 3.40, 4.33, 7.00, 1.53, 2.75, 1.50, 2.10, 2.10, 2.00, 5.00~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1646 / R1646C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1646 / R1646C14: got 'N/A'
## Rows: 2,469
## Columns: 3
## $ Date <dttm> 2017-12-31, 2017-12-31, 2018-01-01, 2018-01-01, 2018-01-01, 201~
## $ B365W <dbl> 1.90, 2.62, 1.66, 1.44, 2.75, 1.50, 2.10, 2.10, 1.90, 1.30, 1.72~
## $ B365L <dbl> 1.80, 1.44, 2.10, 2.62, 1.40, 2.50, 1.66, 1.66, 1.80, 3.40, 2.00~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
## Rows: 2,472
## Columns: 3
## $ Date <dttm> 2018-12-31, 2018-12-31, 2018-12-31, 2018-12-31, 2018-12-31, 201~
## $ B365W <dbl> 1.36, 1.44, 1.61, 1.50, 2.25, 2.00, NA, 1.30, 2.62, 1.50, 1.44, ~
## $ B365L <dbl> 3.00, 2.62, 2.20, 2.50, 1.57, 1.72, NA, 3.40, 1.44, 2.50, 2.62, ~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Rows: 1,055
## Columns: 3
## $ Date <dttm> 2020-01-06, 2020-01-06, 2020-01-06, 2020-01-06, 2020-01-06, 202~
## $ B365W <dbl> 1.61, 2.37, 1.90, 3.00, 1.04, 1.72, 2.50, 1.18, 1.50, 1.25, 1.22~
## $ B365L <dbl> 2.20, 1.53, 1.80, 1.36, 13.00, 2.00, 1.50, 4.50, 2.50, 3.75, 4.0~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


## Rows: 63

28
## Columns: 3
## $ Date <dttm> 2021-01-06, 2021-01-06, 2021-01-06, 2021-01-06, 2021-01-06, 202~
## $ B365W <dbl> 1.66, 2.25, 1.22, 1.30, 2.50, 1.57, 1.57, 2.50, 2.10, 1.25, 1.10~
## $ B365L <dbl> 2.10, 1.57, 4.00, 3.40, 1.50, 2.25, 2.25, 1.50, 1.66, 3.75, 7.00~
## Warning in force.type(f): NAs introduced by coercion

## Warning in force.type(f): NAs introduced by coercion


f.tennis$Sex <- "female"
tennis.data <- bind_rows(m.tennis,f.tennis)
tennis.data <- tennis.data[,c(55,1:54,56,57)]
save(tennis.data,file="tennis.Rda")

3.1 Tidy data


Is the dataset you have constructed a tidy dataset? For later convenience, we are going to reorder the data.
Set the seed of the random number generator as follows: set.seed(662288). The select half of your dataset
by randomly sampling rows. Now you have two subsets, each equally large (give or take one row). In the first
subset, label the winner as player A, and the loser as player B, and rename all other variables according to
this labeling. In this first subset, variable Winner becomes PlayerA, variable Loser is renamed as variable
PlayerB, WRank becomes RankA, etc. In that first subset, add a variable Winner and that variable takes
value A only. The difficult work relates to the second subset. In that subset, the winner should be labelled as
PlayerB, and the loser as PlayerA. This labeling applies to all other variables as well. The variable Winner
should be added to the second subset, taking value B only. Finally, merge both datasets into one big dataset.
If you have done it correctly, mean(dataset$Winner=="A") should be approximately 0.50. Retain only the
betting quotes from B365.
The reason for relabeling is the following. Before any match, you know the names of the two players, and
we call them PlayerA and PlayerB. Before the match you don’t know who wins! That outcome variable will
be in a new column, with the name Winner. This variable takes two values, A and B. This way you can talk
sensibly about the question ’how does the probability that player A wins, depend on the difference in ranking
between players A and B’.
Note that the data are not organized in a tidy way: the column Winner has the outcome as a column name.
A match is described by three variables: the first player (PlayerA), the second player (PlayerB) and the
outcome (either A or B wins). We need to organize the data in this A/B labeling. In the next few weeks,
we will model the outcome of the match to depend on covariates. The current labeling does not allow any
modeling, as there is no variation in the dependent variable (Winner always wins). As we need to retain
information from B365 only, we remove all other columns with betting quotes.
#tennis.data <- select(tennis.data,Sex:Comment,B365W,B365L,WTA,Tier)
tennis.data <- select(tennis.data,Sex,ATP,Location,Tournament,Date,Series,
Court,Surface,Round,`Best of`,Winner,Loser,WRank,LRank,W1,L1,
W2,L2,W3,L3,W4,L4,W5,L5,Wsets,Lsets,Comment,B365W,B365L,WTA,Tier)
set.seed(662288)
e <- sample(1:nrow(tennis.data),floor(nrow(tennis.data)/2))
part1 <- tennis.data[e,]
part2 <- tennis.data[-e,]
names(part1)[10:29] <- c("Best.of","PlayerA","PlayerB",
"RankA","RankB","A1","B1","A2","B2","A3","B3","A4","B4","A5","B5",
"SetsA","SetsB","Comment","B365A","B365B")
part1$Winner <- "A"
names(part2)[10:29] <- c("Best.of","PlayerB","PlayerA",
"RankB","RankA","B1","A1","B2","A2","B3","A3","B4","A4","B5","A5",
"SetsB","SetsA","Comment","B365B","B365A")

29
part2$Winner <- "B"
tennis.data.relabeled <- bind_rows(part1,part2)
mean(tennis.data.relabeled$Winner=="A")

## [1] 0.4999942
This seems to have done the trick, compare both datasets using View(tennis.data) and View(tennis.data.relabeled).

3.2 Check your data


Check your dataset. Are there any weird values, or problems? Note that the betting quotes should be at least
1.0. Save your dataset
We select completed matches only. Moreover, for later use we need data with ‘reasonable’ values for betting
quotes, so the sum of their inverse (known as the overround) should not exceed 1.15.
table(tennis.data.relabeled$Comment)

##
## Awarded Completed Disqualified Retired Sched Walkoer
## 1 83274 2 2700 2 1
## Walkover
## 461
tennis.data <- filter(tennis.data.relabeled,Comment=="Completed")
tennis.data <- filter(tennis.data,1/B365A + 1/B365B <=1.15)
lapply(tennis.data,summary)

## $Sex
## Length Class Mode
## 76363 character character
##
## $ATP
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 19.00 32.00 32.56 48.00 67.00 31718
##
## $Location
## Length Class Mode
## 76363 character character
##
## $Tournament
## Length Class Mode
## 76363 character character
##
## $Date
## Min. 1st Qu. Median
## "2002-12-30 00:00:00" "2008-08-22 00:00:00" "2012-07-21 00:00:00"
## Mean 3rd Qu. Max.
## "2012-06-27 11:51:29" "2016-06-28 00:00:00" "2021-01-13 00:00:00"
## NA's
## "2"
##
## $Series
## Length Class Mode
## 76363 character character
##

30
## $Court
## Length Class Mode
## 76363 character character
##
## $Surface
## Length Class Mode
## 76363 character character
##
## $Round
## Length Class Mode
## 76363 character character
##
## $Best.of
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 3.000 3.000 3.225 3.000 5.000
##
## $PlayerA
## Length Class Mode
## 76363 character character
##
## $PlayerB
## Length Class Mode
## 76363 character character
##
## $RankA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 24.00 52.00 72.87 91.00 2159.00 112
##
## $RankB
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 24.00 53.00 73.67 92.00 1890.00 122
##
## $A1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 6.000 4.834 6.000 7.000 2
##
## $B1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 6.000 4.833 6.000 7.000 2
##
## $A2
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 6.000 4.759 6.000 7.000 5
##
## $B2
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.00 6.00 4.77 6.00 7.00 4
##
## $A3
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.00 6.00 4.81 6.00 16.00 44708
##
## $B3
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

31
## 0.0 3.0 6.0 4.8 6.0 15.0 44708
##
## $A4
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.00 6.00 4.86 6.00 7.00 72110
##
## $B4
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.00 6.00 4.86 6.00 7.00 72110
##
## $A5
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 3.0 6.0 5.2 6.0 70.0 74739
##
## $B5
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 4.0 6.0 5.3 6.0 68.0 74739
##
## $SetsA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 2.000 1.246 2.000 3.000 2
##
## $SetsB
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 2.000 1.245 2.000 3.000 3
##
## $Comment
## Length Class Mode
## 76363 character character
##
## $B365A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.971 1.390 1.830 2.624 2.750 67.000
##
## $B365B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.967 1.400 1.830 2.639 2.870 101.000
##
## $WTA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 15.00 27.00 27.72 42.00 61.00 44645
##
## $Tier
## Length Class Mode
## 76363 character character
##
## $Winner
## Length Class Mode
## 76363 character character
tennis.data$Date <- as.Date(tennis.data$Date)
save(tennis.data,file="tennis.Rda")

The maximum values for the betting quotes are quite high. We should check this later when we transform
them into implied winning probabilities during one of the coming weeks.

32

You might also like