Problem Set 1 Solution Numerical Methods
Problem Set 1 Solution Numerical Methods
Problem Set 1 Solution Numerical Methods
7 feb 2023
with i ∼ N (0, 0.25). Construct an X matrix with a column of ones, and the vector x, and calculate the OLS
estimator (X 0 X)−1 X 0 y.
The purpose is to do some simple linear algebra and regression, so that you are familiar with basic functions.
Note that one of the arguments of rnorm is sd and not the variance.
set.seed(3572)
x <- rnorm(200,mean=0,sd=sqrt(2))
1
y <- 0.3+0.2*x + rnorm(200,sd=sqrt(0.25))
X <- cbind(rep(1,200),x)
b.ols <- solve(t(X) %*% X, t(X) %*% y)
b.ols
## [,1]
## 0.2890479
## x 0.2115689
lm(y~x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 0.2890 0.2116
1.3 Problem 4
Consider the matrix
1 0 0 1
0 2 0 0
A=
0
.
0 3 0
0 0 0 4
Calculate the inverse of A, and its eigenvalues and eigenvectors.
A <- diag(1:4)
A[1,4] <- 1
A
## eigen() decomposition
2
## $values
## [1] 4 3 2 1
##
## $vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.3162278 0 0 1
## [2,] 0.0000000 0 1 0
## [3,] 0.0000000 1 0 0
## [4,] 0.9486833 0 0 0
p <- c(0,1,1)
x <- c(1,2,3,4)
f.correct(x,p)
## [1] 2 6 12 20
R is would also evaluate the function if the parameters of the function are in the workspace, but this may
lead to unintended outcomes.
f.wrong <- function(x){
p[1] + p[2]*x + p[3]*xˆ2
}
f.wrong(x)
## [1] 2 6 12 20
p <- p/2
f.wrong(x)
## [1] 1 3 6 10
1.5 AR-model
Write a loop that generates data according to an AR(1) process: yt = αyt−1 + t , for t = 1, . . . , 2000. Make a
graph of your time series yt , for three different values of α. Take vart = 0.25.
We could use a package and a function to do this, but the point here is to write a loop.
ar.one <- function(T,alpha,sigma){
y0 <- rnorm(1,sd=sigma)
y <- rep(NA,T)
y[1] <- alpha * y0 + rnorm(1,sd=sigma)
for (t in 2:T){
y[t] <- alpha*y[t-1] + rnorm(1,sd=sigma)
}
3
y
}
y1 <- ar.one(2000,0.3,0.5)
y2 <- ar.one(2000,-0.1,0.5)
y3 <- ar.one(2000,0.95,0.5)
plot(1:2000,y1)
1.5
0.5
y1
−0.5
−1.5
1:2000
plot(1:2000,y2)
1.5
0.5
y2
−0.5
−1.5
1:2000
4
plot(1:2000,y3)
4
2
y3
0
−2
−4
1:2000
These plots are not very helpful. Below is a slightly more elaborate attempt.
library(ggplot2)
y.df <- data.frame(y=c(y1,y2,y3),alpha=as.factor(c(rep(0.3,2000),rep(-0.1,2000),rep(0.95,2000))),
i=rep(1:2000,3))
ggplot(y.df) + geom_point(aes(x=i,y=y,color=alpha),size=0.2)
5
5.0
2.5
alpha
−0.1
0.0
y
0.3
0.95
−2.5
−5.0
0 500 1000 1500 2000
i
ggplot(y.df) + geom_point(aes(x=i,y=y),size=0.2) + facet_wrap(~alpha)
6
−0.1 0.3 0.95
5.0
2.5
0.0
y
−2.5
−5.0
0 500 1000 1500 2000 0 500 1000 1500 2000 0 500 1000 1500 2000
i
## [1] 1.110223e-16
1+eps
## [1] 1
(1+eps)==1
## [1] TRUE
t−1 (0.975)S
• X̄ ± n−1 √n is a 95% confidence interval for µ, with t−1
n−1 (0.975) the 97.5th percentile of a t
distribution with n − 1 degrees of freedom, and S the square root of the unbiased estimator for the
variance of X, σ 2 .
Take n = 25, µ = 0 and σ 2 = 2, and show the validity of these two claims in a simple simulation experiment.
7
These statements concern finite sample properties, so we should assess them by repeatedly sampling from the
(true) distribution, and check whether the claims hold.
mu <- 0
sigma <- sqrt(2)
n <- 25
set.seed(63729)
for (b in 1:B){
# generate a sample
sample.b <- rnorm(n,mean=mu,sd=sigma)
m.sampled[b] <- mean(sample.b)
}
mean(m.sampled)
## [1] -0.000373545
mean(ci.sampled[,1]<mu & ci.sampled[,2]>mu)
## [1] 0.949782
The average of the sampled means is very close to the true value (µ = 0), and the coverage of the confidence
interval is very close to 0.95. Whether the choice of a million replications makes sense depends your the
specifications of your laptop or computer. First test your code with, say, B = 10. Later in the course, we will
see a technique that can be used to speed up this type of simulation studies.
8
First, we set the working directory to the one having the data, we load the appropriate library, and read the
data. When reading a csv file, make sure that the decimal point is the decimal point, and not a decimal
comma. This varies sometimes by language of the operating system. Nothing beats looking at the first few
lines of data in a simple text editor.
library(readxl)
suppressMessages(library(tidyverse))
aex <- read_excel("aex.xlsx")
glimpse(aex)
## Rows: 7,831
## Columns: 7
## $ Date <chr> "1992-10-12", "1992-10-13", "1992-10-14", "1992-10-15", "1~
## $ Open <chr> "126.382904", "126.773155", "126.056175", "126.097015", "1~
## $ High <chr> "127.181557", "127.004585", "126.555336", "126.568954", "1~
## $ Low <chr> "126.056175", "126.491806", "126.001724", "125.688614", "1~
## $ Close <chr> "126.945595", "126.850296", "126.487274", "125.824753", "1~
## $ `Adj Close` <chr> "126.945595", "126.850296", "126.487274", "125.824753", "1~
## $ Volume <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"~
dax <- read.csv("dax.csv")
glimpse(dax)
## Rows: 9,041
## Columns: 7
## $ Date <chr> "1987-12-30", "1987-12-31", "1988-01-01", "1988-01-04", "198~
## $ Open <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ High <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Low <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Close <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Adj.Close <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Volume <chr> "0", "null", "null", "0", "0", "0", "0", "0", "0", "0", "0",~
save(aex,dax,file="stock data.Rda")
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
aex <- read_excel("aex.xlsx")
glimpse(aex)
9
## Rows: 7,831
## Columns: 7
## $ Date <chr> "1992-10-12", "1992-10-13", "1992-10-14", "1992-10-15", "1~
## $ Open <chr> "126.382904", "126.773155", "126.056175", "126.097015", "1~
## $ High <chr> "127.181557", "127.004585", "126.555336", "126.568954", "1~
## $ Low <chr> "126.056175", "126.491806", "126.001724", "125.688614", "1~
## $ Close <chr> "126.945595", "126.850296", "126.487274", "125.824753", "1~
## $ `Adj Close` <chr> "126.945595", "126.850296", "126.487274", "125.824753", "1~
## $ Volume <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"~
dax <- read.csv("dax.csv",as.is=TRUE)
glimpse(dax)
## Rows: 9,041
## Columns: 7
## $ Date <chr> "1987-12-30", "1987-12-31", "1988-01-01", "1988-01-04", "198~
## $ Open <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ High <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Low <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Close <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Adj.Close <chr> "1005.190002", "null", "null", "956.489990", "996.099976", "~
## $ Volume <chr> "0", "null", "null", "0", "0", "0", "0", "0", "0", "0", "0",~
names(aex) <- tolower(names(aex))
head(aex)
## # A tibble: 6 x 7
## date open high low close `adj close` volume
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1992-10-12 126.382904 127.181557 126.056175 126.945595 126.945595 0
## 2 1992-10-13 126.773155 127.004585 126.491806 126.850296 126.850296 0
## 3 1992-10-14 126.056175 126.555336 126.001724 126.487274 126.487274 0
## 4 1992-10-15 126.097015 126.568954 125.688614 125.824753 125.824753 0
## 5 1992-10-16 126.264915 126.478195 124.858192 125.230293 125.230293 0
## 6 1992-10-19 124.967102 125.184914 124.141220 124.286430 124.286430 0
aex <- mutate(aex,date=ymd(date),open=as.numeric(open),high=as.numeric(high),
low=as.numeric(low),close=as.numeric(close),adj.close=as.numeric(`adj close`),
volume=as.numeric(volume)) %>% select(-`adj close`) %>%
mutate(d.return=(log(adj.close)-log(lag(adj.close))))
10
## 1 1987-12-30 1005.190002 1005.190002 1005.190002 1005.190002 1005.190002 0
## 2 1987-12-31 null null null null null null
## 3 1988-01-01 null null null null null null
## 4 1988-01-04 956.489990 956.489990 956.489990 956.489990 956.489990 0
## 5 1988-01-05 996.099976 996.099976 996.099976 996.099976 996.099976 0
## 6 1988-01-06 1006.010010 1006.010010 1006.010010 1006.010010 1006.010010 0
dax <- mutate(dax,date=ymd(date),open=as.numeric(open),high=as.numeric(high),
low=as.numeric(low),close=as.numeric(close),adj.close=as.numeric(adj.close),
volume=as.numeric(volume)) %>%
mutate(d.return=(log(adj.close)-log(lag(adj.close))))
## Joining, by = "date"
stock.returns2 <- left_join(aex,dax)
## Joining, by = "date"
stock.returns3 <- right_join(aex,dax)
## Joining, by = "date"
glimpse(filter(stock.returns1,date=="1988-01-25"))
## Rows: 1
## Columns: 15
## $ date <date> 1988-01-25
## $ open.aex <dbl> NA
## $ high.aex <dbl> NA
11
## $ low.aex <dbl> NA
## $ close.aex <dbl> NA
## $ volume.aex <dbl> NA
## $ adj.close.aex <dbl> NA
## $ d.return.aex <dbl> NA
## $ open.dax <dbl> 962.63
## $ high.dax <dbl> 962.63
## $ low.dax <dbl> 962.63
## $ close.dax <dbl> 962.63
## $ adj.close.dax <dbl> 962.63
## $ volume.dax <dbl> 0
## $ d.return.dax <dbl> -0.003991457
glimpse(filter(stock.returns2,date=="1988-01-25"))
## Rows: 0
## Columns: 15
## $ date <date>
## $ open.aex <dbl>
## $ high.aex <dbl>
## $ low.aex <dbl>
## $ close.aex <dbl>
## $ volume.aex <dbl>
## $ adj.close.aex <dbl>
## $ d.return.aex <dbl>
## $ open.dax <dbl>
## $ high.dax <dbl>
## $ low.dax <dbl>
## $ close.dax <dbl>
## $ adj.close.dax <dbl>
## $ volume.dax <dbl>
## $ d.return.dax <dbl>
glimpse(filter(stock.returns3,date=="1988-01-25"))
## Rows: 1
## Columns: 15
## $ date <date> 1988-01-25
## $ open.aex <dbl> NA
## $ high.aex <dbl> NA
## $ low.aex <dbl> NA
## $ close.aex <dbl> NA
## $ volume.aex <dbl> NA
## $ adj.close.aex <dbl> NA
## $ d.return.aex <dbl> NA
## $ open.dax <dbl> 962.63
## $ high.dax <dbl> 962.63
## $ low.dax <dbl> 962.63
## $ close.dax <dbl> 962.63
## $ adj.close.dax <dbl> 962.63
## $ volume.dax <dbl> 0
## $ d.return.dax <dbl> -0.003991457
dim(merge(aex,dax))
## [1] 7792 15
12
dim(merge(aex,dax,all.x=TRUE))
## [1] 7831 15
dim(merge(aex,dax,all.y=TRUE))
## [1] 9041 15
dim(merge(aex,dax,all=TRUE))
## [1] 9080 15
stock.returns <- stock.returns1
save(aex,dax,stock.returns,file="stock data.Rda")
mean(is.element(aex$date,stock.returns$date))
## [1] 1
mean(is.element(dax$date,stock.returns$date))
## [1] 1
stocks.long <- select(stock.returns,date,aex=d.return.aex,dax=d.return.dax) %>%
pivot_longer(!date,names_to="market",values_to="return")
head(stocks.long)
## # A tibble: 6 x 3
## date market return
## <date> <chr> <dbl>
## 1 1992-10-12 aex NA
## 2 1992-10-12 dax 0.00272
## 3 1992-10-13 aex -0.000751
## 4 1992-10-13 dax 0.0206
## 5 1992-10-14 aex -0.00287
## 6 1992-10-14 dax -0.0121
ggplot(stocks.long) + geom_density(aes(x=return,color=market))
13
40
30
market
density
aex
20 dax
10
14
aex dax
40
30
density
20
10
−0.15 −0.10 −0.05 0.00 0.05 0.10 −0.15 −0.10 −0.05 0.00 0.05 0.10
return
15
1.00
0.75
correlation
0.50
0.25
0.00
1987−12
1988−01
1988−02
1988−03
1988−04
1988−05
1988−06
1988−07
1988−08
1988−09
1988−10
1988−11
1988−12
1989−01
1989−02
1989−03
1989−04
1989−05
1989−06
1989−07
1989−08
1989−09
1989−10
1989−11
1989−12
1990−01
1990−02
1990−03
1990−04
1990−05
1990−06
1990−07
1990−08
1990−09
1990−10
1990−11
1990−12
1991−01
1991−02
1991−03
1991−04
1991−05
1991−06
1991−07
1991−08
1991−09
1991−10
1991−11
1991−12
1992−01
1992−02
1992−03
1992−04
1992−05
1992−06
1992−07
1992−08
1992−09
1992−10
1992−11
1992−12
1993−01
1993−02
1993−03
1993−04
1993−05
1993−06
1993−07
1993−08
1993−09
1993−10
1993−11
1993−12
1994−01
1994−02
1994−03
1994−04
1994−05
1994−06
1994−07
1994−08
1994−09
1994−10
1994−11
1994−12
1995−01
1995−02
1995−03
1995−04
1995−05
1995−06
1995−07
1995−08
1995−09
1995−10
1995−11
1995−12
1996−01
1996−02
1996−03
1996−04
1996−05
1996−06
1996−07
1996−08
1996−09
1996−10
1996−11
1996−12
1997−01
1997−02
1997−03
1997−04
1997−05
1997−06
1997−07
1997−08
1997−09
1997−10
1997−11
1997−12
1998−01
1998−02
1998−03
1998−04
1998−05
1998−06
1998−07
1998−08
1998−09
1998−10
1998−11
1998−12
1999−01
1999−02
1999−03
1999−04
1999−05
1999−06
1999−07
1999−08
1999−09
1999−10
1999−11
1999−12
2000−01
2000−02
2000−03
2000−04
2000−05
2000−06
2000−07
2000−08
2000−09
2000−10
2000−11
2000−12
2001−01
2001−02
2001−03
2001−04
2001−05
2001−06
2001−07
2001−08
2001−09
2001−10
2001−11
2001−12
2002−01
2002−02
2002−03
2002−04
2002−05
2002−06
2002−07
2002−08
2002−09
2002−10
2002−11
2002−12
2003−01
2003−02
2003−03
2003−04
2003−05
2003−06
2003−07
2003−08
2003−09
2003−10
2003−11
2003−12
2004−01
2004−02
2004−03
2004−04
2004−05
2004−06
2004−07
2004−08
2004−09
2004−10
2004−11
2004−12
2005−01
2005−02
2005−03
2005−04
2005−05
2005−06
2005−07
2005−08
2005−09
2005−10
2005−11
2005−12
2006−01
2006−02
2006−03
2006−04
2006−05
2006−06
2006−07
2006−08
2006−09
2006−10
2006−11
2006−12
2007−01
2007−02
2007−03
2007−04
2007−05
2007−06
2007−07
2007−08
2007−09
2007−10
2007−11
2007−12
2008−01
2008−02
2008−03
2008−04
2008−05
2008−06
2008−07
2008−08
2008−09
2008−10
2008−11
2008−12
2009−01
2009−02
2009−03
2009−04
2009−05
2009−06
2009−07
2009−08
2009−09
2009−10
2009−11
2009−12
2010−01
2010−02
2010−03
2010−04
2010−05
2010−06
2010−07
2010−08
2010−09
2010−10
2010−11
2010−12
2011−01
2011−02
2011−03
2011−04
2011−05
2011−06
2011−07
2011−08
2011−09
2011−10
2011−11
2011−12
2012−01
2012−02
2012−03
2012−04
2012−05
2012−06
2012−07
2012−08
2012−09
2012−10
2012−11
2012−12
2013−01
2013−02
2013−03
2013−04
2013−05
2013−06
2013−07
2013−08
2013−09
2013−10
2013−11
2013−12
2014−01
2014−02
2014−03
2014−04
2014−05
2014−06
2014−07
2014−08
2014−09
2014−10
2014−11
2014−12
2015−01
2015−02
2015−03
2015−04
2015−05
2015−06
2015−07
2015−08
2015−09
2015−10
2015−11
2015−12
2016−01
2016−02
2016−03
2016−04
2016−05
2016−06
2016−07
2016−08
2016−09
2016−10
2016−11
2016−12
2017−01
2017−02
2017−03
2017−04
2017−05
2017−06
2017−07
2017−08
2017−09
2017−10
2017−11
2017−12
2018−01
2018−02
2018−03
2018−04
2018−05
2018−06
2018−07
2018−08
2018−09
2018−10
2018−11
2018−12
2019−01
2019−02
2019−03
2019−04
2019−05
2019−06
2019−07
2019−08
2019−09
2019−10
2019−11
2019−12
2020−01
2020−02
2020−03
2020−04
2020−05
2020−06
2020−07
2020−08
2020−09
2020−10
2020−11
2020−12
2021−01
2021−02
2021−03
2021−04
2021−05
2021−06
2021−07
2021−08
2021−09
2021−10
2021−11
2021−12
2022−01
2022−02
2022−03
2022−04
2022−05
2022−06
2022−07
2022−08
2022−09
2022−10
2022−11
2022−12
2023−01
2023−02
month
monthly.correlations <- mutate(monthly.correlations,month=ymd(month,truncated=2))
ggplot(monthly.correlations,aes(x=month,y=correlation))+geom_point()+ylim(0,1)
16
1.00
0.75
correlation
0.50
0.25
0.00
## # A tibble: 2,963 x 34
## ATP Location Tournament Date Series Court Surface Round
## <dbl> <chr> <chr> <dttm> <chr> <chr> <chr> <chr>
## 1 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 2 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 3 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 4 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
17
## 5 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 6 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 7 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 8 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 9 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## 10 1 Adelaide AAPT Champi~ 2001-01-01 00:00:00 Interna~ Outdo~ Hard 1st ~
## # ... with 2,953 more rows, and 26 more variables: Best of <dbl>, Winner <chr>,
## # Loser <chr>, WRank <dbl>, LRank <chr>, W1 <dbl>, L1 <dbl>, W2 <dbl>,
## # L2 <dbl>, W3 <dbl>, L3 <dbl>, W4 <dbl>, L4 <dbl>, W5 <dbl>, L5 <dbl>,
## # Wsets <dbl>, Lsets <dbl>, Comment <chr>, CBW <dbl>, CBL <dbl>, GBW <dbl>,
## # GBL <dbl>, IWW <dbl>, IWL <dbl>, SBW <dbl>, SBL <dbl>
The variable LRank should be <dbl> so we write a function that forces WRank and LRank to be numeric. The
other variables that are forced to be numeric are from other excel files with incorrect types.
force.type <- function(d){
nm <- names(d)
if (is.element("WRank",nm)) d$WRank <- as.numeric(d$WRank)
if (is.element("LRank",nm)) d$LRank <- as.numeric(d$LRank)
if (is.element("B365W",nm)) d$B365W <- as.numeric(d$B365W)
if (is.element("B365L",nm)) d$B365L <- as.numeric(d$B365L)
if (is.element("WPts",nm)) d$WPts <- as.numeric(d$WPts)
if (is.element("LPts",nm)) d$LPts <- as.numeric(d$LPts)
if (is.element("LBW",nm)) d$LBW <- as.numeric(d$LBW)
if (is.element("LBL",nm)) d$LBL <- as.numeric(d$LBL)
d
}
First we read all the files into tibbles and store those in a list. Then we force the types if necessary (and if
the variable is present).
m.tennis <- force.type(f1)
## Rows: 2,854
## Columns: 3
## $ Date <dttm> 2001-12-31, 2001-12-31, 2001-12-31, 2001-12-31, 2001-12-31, 200~
## $ B365W <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ B365L <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## Warning in force.type(f): NAs introduced by coercion
## Rows: 2,861
## Columns: 3
## $ Date <dttm> 2002-12-30, 2002-12-30, 2002-12-30, 2002-12-30, 2002-12-30, 200~
## $ B365W <dbl> NA, NA, 1.364, NA, NA, NA, 1.667, 1.400, 1.667, 1.286, 1.800, NA~
## $ B365L <dbl> NA, NA, 2.875, NA, NA, NA, 2.100, 2.750, 2.100, 3.250, 1.909, NA~
## Warning in force.type(f): NAs introduced by coercion
## Rows: 2,877
18
## Columns: 3
## $ Date <dttm> 2004-01-05, 2004-01-05, 2004-01-05, 2004-01-05, 2004-01-05, 200~
## $ B365W <dbl> NA, 1.160, 2.000, 1.830, 1.400, NA, 1.800, 1.800, NA, 1.533, 1.4~
## $ B365L <dbl> NA, 4.500, 1.720, 1.830, 2.750, NA, 1.909, 1.900, NA, 2.375, 2.6~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O1724 / R1724C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O1906 / R1906C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting logical in O2015 / R2015C15: got 'N/A'
## Rows: 2,909
## Columns: 3
## $ Date <dttm> 2005-01-03, 2005-01-03, 2005-01-03, 2005-01-03, 2005-01-03, 200~
## $ B365W <dbl> 1.286, 1.833, 1.800, 1.667, 1.615, 1.333, 1.500, 1.533, 1.400, 1~
## $ B365L <dbl> 3.250, 1.833, 1.909, 2.100, 2.200, 3.000, 2.500, 2.375, 2.750, 2~
## Warning in force.type(f): NAs introduced by coercion
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in AA1402 / R1402C27: got '`1'
## Rows: 2,909
## Columns: 3
## $ Date <dttm> 2006-01-02, 2006-01-02, 2006-01-02, 2006-01-02, 2006-01-02, 200~
## $ B365W <dbl> 1.39, 1.53, 1.28, 1.53, 1.44, 1.50, 1.40, 1.83, 1.90, 1.12, 1.66~
## $ B365L <dbl> 2.75, 2.37, 3.25, 2.37, 2.62, 2.50, 2.75, 1.83, 1.80, 5.50, 2.10~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
19
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O2384 / R2384C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M2386 / R2386C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O2386 / R2386C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M2698 / R2698C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O2698 / R2698C15: got 'N/A'
## Rows: 2,806
## Columns: 3
## $ Date <dttm> 2007-01-01, 2007-01-01, 2007-01-01, 2007-01-01, 2007-01-02, 200~
## $ B365W <dbl> 2.87, 3.00, 2.00, 2.37, 1.50, 1.08, 2.87, 1.19, 2.62, 1.14, 1.83~
## $ B365L <dbl> 1.36, 1.33, 1.72, 1.53, 2.50, 6.50, 1.36, 4.00, 1.44, 5.00, 1.83~
## Rows: 2,707
## Columns: 3
## $ Date <dttm> 2007-12-31, 2007-12-31, 2007-12-31, 2007-12-31, 2007-12-31, 200~
## $ B365W <dbl> 1.53, 6.50, 3.75, 1.40, 1.72, 2.50, 1.44, NA, 1.57, 1.16, 1.28, ~
## $ B365L <dbl> 2.37, 1.10, 1.25, 2.75, 2.00, 1.50, 2.62, NA, 2.25, 4.50, 3.50, ~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
20
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1232 / R1232C15: got 'N/A'
## Rows: 2,675
## Columns: 3
## $ Date <dttm> 2011-01-02, 2011-01-02, 2011-01-03, 2011-01-03, 2011-01-03, 201~
## $ B365W <dbl> 1.53, 1.90, 1.83, 1.72, 1.36, 3.50, 1.53, 1.06, 1.36, 3.00, NA, ~
## $ B365L <dbl> 2.37, 1.80, 1.83, 2.00, 3.00, 1.28, 2.37, 8.00, 3.00, 1.36, NA, ~
## Rows: 2,607
## Columns: 3
## $ Date <dttm> 2012-01-01, 2012-01-01, 2012-01-02, 2012-01-02, 2012-01-02, 201~
## $ B365W <dbl> 4.33, 1.25, 3.25, 1.66, 1.40, 1.20, 1.36, 1.22, 1.36, 1.44, 1.50~
## $ B365L <dbl> 1.20, 3.75, 1.33, 2.10, 2.75, 4.33, 3.00, 4.00, 3.00, 2.62, 2.50~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
21
## Warning in force.type(f): NAs introduced by coercion
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1687 / R1687C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1687 / R1687C14: got 'N/A'
## Rows: 2,626
## Columns: 3
## $ Date <dttm> 2016-01-04, 2016-01-04, 2016-01-04, 2016-01-04, 2016-01-05, 201~
## $ B365W <dbl> 1.66, 1.53, 1.72, 1.83, 1.28, 2.00, 1.33, 2.00, 1.06, 2.25, 2.00~
## $ B365L <dbl> 2.10, 2.37, 2.00, 1.83, 3.50, 1.72, 3.25, 1.72, 10.00, 1.57, 1.7~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
## Rows: 2,633
## Columns: 3
## $ Date <dttm> 2017-01-01, 2017-01-02, 2017-01-02, 2017-01-02, 2017-01-02, 201~
## $ B365W <dbl> 1.28, 1.50, 1.90, 1.36, 1.40, 2.62, 2.25, 1.80, 1.50, 2.20, 1.44~
## $ B365L <dbl> 3.50, 2.50, 1.80, 3.00, 2.75, 1.44, 1.57, 1.90, 2.50, 1.61, 2.62~
## Warning in force.type(f): NAs introduced by coercion
22
## Warning in force.type(f): NAs introduced by coercion
We do the same for the files with female tennis results, and combine the result into one dataset.
f.files <- list.files("raw data/f",full.names=TRUE)
f1 <- read_excel(f.files[1])
## Rows: 2,491
## Columns: 3
## $ Date <dttm> 2007-01-01, 2007-01-01, 2007-01-01, 2007-01-01, 2007-01-01, 200~
## $ B365W <dbl> 1.33, 3.75, 1.72, 1.83, 1.16, 2.50, 1.36, 2.50, 1.83, 1.53, 1.72~
## $ B365L <dbl> 3.00, 1.22, 2.00, 1.83, 4.50, 1.50, 2.87, 1.50, 1.83, 2.37, 2.00~
class(f1)
## # A tibble: 2,491 x 34
## WTA Location Tournament Date Tier Court Surface Round
## <dbl> <chr> <chr> <dttm> <chr> <chr> <chr> <chr>
## 1 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 2 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 3 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 4 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 5 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 6 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 7 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 8 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 9 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## 10 1 Auckland ASB Classic 2007-01-01 00:00:00 Tier 4 Outdoor Hard 1st Ro~
## # ... with 2,481 more rows, and 26 more variables: Best of <dbl>, Winner <chr>,
## # Loser <chr>, WRank <chr>, LRank <chr>, WPts <chr>, LPts <chr>, W1 <dbl>,
## # L1 <dbl>, W2 <dbl>, L2 <dbl>, W3 <dbl>, L3 <dbl>, Wsets <dbl>, Lsets <dbl>,
## # Comment <chr>, B365W <dbl>, B365L <dbl>, CBW <dbl>, CBL <dbl>, EXW <dbl>,
## # EXL <dbl>, PSW <dbl>, PSL <dbl>, UBW <dbl>, UBL <dbl>
f.tennis <- force.type(f1)
23
## Warning in force.type(f1): NAs introduced by coercion
## Rows: 2,404
## Columns: 3
## $ Date <dttm> 2007-12-30, 2007-12-31, 2007-12-31, 2007-12-31, 2007-12-31, 200~
## $ B365W <dbl> 1.25, 1.07, 1.66, 1.16, 1.83, 1.33, 2.62, 1.20, 1.36, 1.50, 1.04~
## $ B365L <dbl> 3.75, 7.50, 2.10, 4.50, 1.83, 3.25, 1.44, 4.33, 3.00, 2.50, 9.00~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
24
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1785 / R1785C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1785 / R1785C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1808 / R1808C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1808 / R1808C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1817 / R1817C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1817 / R1817C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1824 / R1824C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1824 / R1824C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1853 / R1853C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1853 / R1853C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1867 / R1867C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1867 / R1867C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in M1877 / R1877C13: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in O1877 / R1877C15: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1923 / R1923C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1923 / R1923C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1989 / R1989C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1989 / R1989C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L2018 / R2018C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N2018 / R2018C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L2030 / R2030C12: got 'N/A'
25
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N2030 / R2030C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L2036 / R2036C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N2036 / R2036C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L2040 / R2040C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N2040 / R2040C14: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L2042 / R2042C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N2042 / R2042C14: got 'N/A'
## Rows: 2,433
## Columns: 3
## $ Date <dttm> 2009-01-04, 2009-01-05, 2009-01-05, 2009-01-05, 2009-01-05, 200~
## $ B365W <dbl> 1.400, 1.400, 3.000, 1.380, 1.660, 1.330, 1.660, 1.280, 1.500, 2~
## $ B365L <dbl> 2.75, 2.75, 1.36, 2.87, 2.10, 3.25, 2.10, 3.50, 2.50, 1.50, 6.50~
## Warning in force.type(f): NAs introduced by coercion
## Rows: 2,448
## Columns: 3
## $ Date <dttm> 2010-01-03, 2010-01-04, 2010-01-04, 2010-01-04, 2010-01-04, 201~
## $ B365W <dbl> 1.22, 1.10, 1.28, 2.10, 1.83, 1.25, 1.16, 1.16, 1.50, 1.36, 1.36~
## $ B365L <dbl> 4.00, 6.50, 3.50, 1.66, 1.83, 3.75, 4.50, 4.50, 2.50, 3.00, 3.00~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
26
## Expecting numeric in L1596 / R1596C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1596 / R1596C14: got 'N/A'
## Rows: 2,407
## Columns: 3
## $ Date <dttm> 2012-01-02, 2012-01-02, 2012-01-02, 2012-01-02, 2012-01-02, 201~
## $ B365W <dbl> 1.36, 1.33, 1.44, 1.36, 1.18, 1.14, 1.66, 1.16, 1.36, 1.36, 1.30~
## $ B365L <dbl> 3.00, 3.25, 2.62, 3.00, 4.50, 5.50, 2.10, 5.00, 3.00, 3.00, 3.40~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in L1166 / R1166C12: got 'N/A'
## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
## Expecting numeric in N1166 / R1166C14: got 'N/A'
## Rows: 2,442
## Columns: 3
## $ Date <dttm> 2012-12-30, 2012-12-31, 2012-12-31, 2012-12-31, 2012-12-31, 201~
## $ B365W <dbl> 2.50, 1.53, 1.83, 2.50, 1.40, 1.57, 1.36, 1.53, 1.83, 1.61, 1.44~
## $ B365L <dbl> 1.50, 2.37, 1.83, 1.50, 2.75, 2.25, 3.00, 2.37, 1.83, 2.20, 2.62~
## Warning in force.type(f): NAs introduced by coercion
## Warning in force.type(f): NAs introduced by coercion
## Rows: 2,476
## Columns: 3
## $ Date <dttm> 2013-12-30, 2013-12-30, 2013-12-30, 2013-12-30, 2013-12-30, 201~
## $ B365W <dbl> 2.20, 2.50, 1.57, 1.20, 1.61, 1.40, 1.16, 2.62, 1.53, 3.50, 3.25~
## $ B365L <dbl> 1.61, 1.50, 2.25, 4.33, 2.20, 2.75, 4.50, 1.44, 2.37, 1.28, 1.33~
## Warning in force.type(f): NAs introduced by coercion
27
## Rows: 2,500
## Columns: 3
## $ Date <dttm> 2017-01-01, 2017-01-01, 2017-01-02, 2017-01-02, 2017-01-03, 201~
## $ B365W <dbl> 1.40, 1.30, 1.20, 1.10, 2.37, 1.40, 2.50, 1.66, 1.66, 1.72, 1.16~
## $ B365L <dbl> 2.75, 3.40, 4.33, 7.00, 1.53, 2.75, 1.50, 2.10, 2.10, 2.00, 5.00~
## Warning in force.type(f): NAs introduced by coercion
28
## Columns: 3
## $ Date <dttm> 2021-01-06, 2021-01-06, 2021-01-06, 2021-01-06, 2021-01-06, 202~
## $ B365W <dbl> 1.66, 2.25, 1.22, 1.30, 2.50, 1.57, 1.57, 2.50, 2.10, 1.25, 1.10~
## $ B365L <dbl> 2.10, 1.57, 4.00, 3.40, 1.50, 2.25, 2.25, 1.50, 1.66, 3.75, 7.00~
## Warning in force.type(f): NAs introduced by coercion
29
part2$Winner <- "B"
tennis.data.relabeled <- bind_rows(part1,part2)
mean(tennis.data.relabeled$Winner=="A")
## [1] 0.4999942
This seems to have done the trick, compare both datasets using View(tennis.data) and View(tennis.data.relabeled).
##
## Awarded Completed Disqualified Retired Sched Walkoer
## 1 83274 2 2700 2 1
## Walkover
## 461
tennis.data <- filter(tennis.data.relabeled,Comment=="Completed")
tennis.data <- filter(tennis.data,1/B365A + 1/B365B <=1.15)
lapply(tennis.data,summary)
## $Sex
## Length Class Mode
## 76363 character character
##
## $ATP
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 19.00 32.00 32.56 48.00 67.00 31718
##
## $Location
## Length Class Mode
## 76363 character character
##
## $Tournament
## Length Class Mode
## 76363 character character
##
## $Date
## Min. 1st Qu. Median
## "2002-12-30 00:00:00" "2008-08-22 00:00:00" "2012-07-21 00:00:00"
## Mean 3rd Qu. Max.
## "2012-06-27 11:51:29" "2016-06-28 00:00:00" "2021-01-13 00:00:00"
## NA's
## "2"
##
## $Series
## Length Class Mode
## 76363 character character
##
30
## $Court
## Length Class Mode
## 76363 character character
##
## $Surface
## Length Class Mode
## 76363 character character
##
## $Round
## Length Class Mode
## 76363 character character
##
## $Best.of
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 3.000 3.000 3.225 3.000 5.000
##
## $PlayerA
## Length Class Mode
## 76363 character character
##
## $PlayerB
## Length Class Mode
## 76363 character character
##
## $RankA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 24.00 52.00 72.87 91.00 2159.00 112
##
## $RankB
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 24.00 53.00 73.67 92.00 1890.00 122
##
## $A1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 6.000 4.834 6.000 7.000 2
##
## $B1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 6.000 4.833 6.000 7.000 2
##
## $A2
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 6.000 4.759 6.000 7.000 5
##
## $B2
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.00 6.00 4.77 6.00 7.00 4
##
## $A3
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.00 6.00 4.81 6.00 16.00 44708
##
## $B3
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
31
## 0.0 3.0 6.0 4.8 6.0 15.0 44708
##
## $A4
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.00 6.00 4.86 6.00 7.00 72110
##
## $B4
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.00 6.00 4.86 6.00 7.00 72110
##
## $A5
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 3.0 6.0 5.2 6.0 70.0 74739
##
## $B5
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 4.0 6.0 5.3 6.0 68.0 74739
##
## $SetsA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 2.000 1.246 2.000 3.000 2
##
## $SetsB
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 2.000 1.245 2.000 3.000 3
##
## $Comment
## Length Class Mode
## 76363 character character
##
## $B365A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.971 1.390 1.830 2.624 2.750 67.000
##
## $B365B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.967 1.400 1.830 2.639 2.870 101.000
##
## $WTA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 15.00 27.00 27.72 42.00 61.00 44645
##
## $Tier
## Length Class Mode
## 76363 character character
##
## $Winner
## Length Class Mode
## 76363 character character
tennis.data$Date <- as.Date(tennis.data$Date)
save(tennis.data,file="tennis.Rda")
The maximum values for the betting quotes are quite high. We should check this later when we transform
them into implied winning probabilities during one of the coming weeks.
32