Solutions 1
Solutions 1
WINTER 2025
Toby Kenney
Homework Sheet 1
Model Solutions
[Note: all data in this homework are simulated.]
[Note: With many of these problems, there is no “correct” solution. These
model solutions give a range of reasonable approaches, but there are many other
good approaches that could be taken.]
Basic Questions
1. A former colleague has produced the code in the file HW1Q1.R to process the
sports-analytics dataset in the file HW1Q1.txt, before leaving the company.
The code is intended to remove all rows with Balls.remaining equal to
zero. However, it does not work. Explain why the code does not work, and
how to make it work, and how to restructure it in a better way.
Upon running the code, we get the error message “missing value where
TRUE/FALSE needed”. This indicates that cricket.data$Balls.remaining[i]==0
is null. We can see that this happens when i is 8390, and the correspond-
ing row is full of NA values. Looking at the new dimensions of the data
frame, we see that it has grown significantly, instead of having rows re-
moved. The problem comes from the precedence of the : operation. The
line to remove the bad row should include parentheses.
cricket.data<-rbind(cricket.data[1:(i-1),],cricket.data[(i+1):n,])
After making this change, we still get the same error at the same value of
i. However, the dimension of the data frame is now almost correct, but
the last several rows of the data frame are all NA. This is because the for
loop sets the indices at the start, and always runs for those indices, even
after the data frame is shorter. This code can be fixed by running the
loop backwards.
1
c r i c k e t . data<−r e a d . t a b l e ( ”HW1Q1. t x t ” )
n<−dim ( c r i c k e t . data ) [ 1 ]
Note that we have also added the line to recalculate n every time a line is
deleted.
This code successfully removes the lines with Balls.remaining==0 in
this dataset. However, it still has a bug. If the first or last row had
Balls.remaining==0, then the code would not work. To make this more
robust, we should use seq len, and we can use negative indices in the
second subset.
c r i c k e t . data<−r e a d . t a b l e ( ”HW1Q1. t x t ” )
n<−dim ( c r i c k e t . data ) [ 1 ]
This fixes the bug, and produces working code. The code is however
inefficient and it is easy to introduce bugs when modifying it. There are
several better ways. The simplest is to directly use the subset operation
to select the desired elements.
c r i c k e t . data<−r e a d . t a b l e ( ”HW1Q1. t x t ” )
Alternatively, we can use the dplyr package and its filter command.
2
c r i c k e t . data<−r e a d . t a b l e ( ”HW1Q1. t x t ” )
Both the improved solutions use a new variable for the cleaned data. This
is not essential, but is usually a good practice. It allows for easier debug-
ging, as the original data is still available for comparison.
If we insist on using the loop to remove individual rows one at a time,
the code can be improved by directly creating a vector of the rows to
be removed. As in the previous code, running the loop in reverse avoids
problems with the rows being renumbered. It is still somewhat inefficient
as every removed row needs the table to be recopied.
c r i c k e t . data<−r e a d . t a b l e ( ”HW1Q1. t x t ” )
3
Variable Meaning
Living.With The child’s household status — one of “Both”, “Mother”, “Father”, “Joint custody”, “Foster”,
“Other”
Family.Income The combined annual household income
Age The child’s age
Siblings The number of siblings the child has
Discipline.strict.rules The extent to which the caregivers strictly enforce rules
Discipline.punishment The extent to which the caregivers use punishment for misbehaviour
Parent.attention The average number of hours per week that the caregivers spend with the child
Freedom The extent to which the child is allowed to act without supervision.
Health.index An index summarising the child’s overall physicall health.
Programmes The average number of hours per week that the child spends in extracurricular programmes.
Screentime The average number of hours per week that the child spends using electronic devices.
Friends The number of friends the child has.
School.Grades.Mathematics The child’s average grade in school mathematics.
School.Grades.English The child’s average grade in school english.
Depression.Score A summary of various psychological surveys assessing the child’s susceptibility to depression.
4
D e p r e s s i o n . data<−r e a d . t a b l e ( ”HW1Q2. t x t ” )
l i b r a r y ( d p l y r ) # f o r f i l t e r , mutate and s e l e c t .
S c r e e n t i m e . pval<−r e p (NA, 4 )
c u t . o f f s <−c ( 5 , 1 0 , 2 0 , 3 0 )
## We t r y f o u r cut−o f f v a l u e s f o r s c r e e n t i m e .
f o r ( i in s e q a l o n g ( cut . o f f s )){
c u t . o f f <−c u t . o f f s [ i ]
temp . data<−D e p r e s s i o n . study . data%>%
mutate ( s c r e e n . e x c e s s=S c r e e n t i m e >c u t . o f f )%>% s e l e c t (−c ( ” S c r e e n t i m e ” ) )
## We c o n v e r t t h e S c r e e n t i m e t o an i n d i c a t o r v a r i a b l e .
model<−lm ( D e p r e s s i o n . S c o r e ˜ . , data=temp . data ) # f i t a l i n e a r model
model . sum<−summary ( model )
S c r e e n t i m e . p v a l [ i ]<−model . s u m $ c o e f f i c i e n t s [ ” s c r e e n . excessTRUE ” , 4 ]
## e x t r a c t t h e p−v a l u e f o r t h e c o e f f i c i e n t o f s c r e e n . excessTRUE
## ( Naming c o n v e n t i o n s may be d i f f e r e n t on o t h e r systems , s o you
## may need t o modify t h i s code ) .
}
b e s t . c u t . o f f <−c u t . o f f s [ S c r e e n t i m e . p v a l==min ( S c r e e n t i m e . p v a l ) ]
### S e l e c t t h e cut−o f f t h a t r e s u l t s i n t h e s m a l l e s t p−v a l u e .
5
3. A scientist is studying crop growth. Their research assistant was analysing
the data in the file HW1Q3.txt, and wrote the code in the file HW1Q3.R to
process the data, before leaving suddenly. Upon reviewing the code, the
scientist discovers that the code does not work. Fix the code.
There are several problems with the code. If we attempt to run the code,
we get an error message and two warnings:
Warning messages:
1: In distance[median, ] > cutoff :
longer object length is not a multiple of shorter object length
2: In distance[median, ] > cutoff :
longer object length is not a multiple of shorter object length
Error in outliers[[i]]$x : $ operator is invalid for atomic
vectors
The error messages are typically cryptic. However, examining the outliers
variable, we see that it consists of a single pair of variables x and y, instead
of a list of 10 such pairs — one for each cut-off.
[The error arises because outliers[[1]] refers to outliers$x, which is
a matrix, so the entries do not have names.]
The problem then becomes clear — the function get.outliers does not
accept a vector of values for cut.off, whereas the later code assumes
that it will. We can modify the function so that it can take a vector. [An
alternative approach would be to call the function from the loop. This is
slightly simpler, but if there is a danger that future users might expect it
to handle vectors of cut-offs, then fixing the function is a better solution.]
6
P la nt . data<−r e a d . t a b l e ( ”HW1Q3. t x t ” )
g e t . o u t l i e r s <−f u n c t i o n ( x , y , d i s t a n c e , median , c u t o f f ) {
## This f u n c t i o n e x t r a c t s a l l o b s e r v a t i o n s f u r t h e r from t h e median than t h e
## cut−o f f valu e , u s i n g t h e d i s t a n c e p r o v i d e d .
## I f c u t o f f i s a v e c t o r , then t h e f u n c t i o n r e t u r n s a l i s t o f t h e
## o u t l i e r s f o r each cut−o f f v a l u e
answer<− l i s t (NA, l e n g t h ( c u t o f f ) )
for ( i in seq along ( cutoff )){
answer [ [ i ]]<− l i s t ( ” x”=x [ d i s t a n c e [ median ,] > c u t o f f [ i ] , ] , ” y”=y [ d i s t a n c e [ median ,] > c u t
}
r e t u r n ( answer )
}
l i b r a r y ( glmnet ) # f o r LASSO
X<−model . matrix ( y i e l d ˜ . , data=P la nt . data )
# This c r e a t e s a matrix o f p r e d i c t o r s t o be used i n LASSO, c o n v e r t i n g
# c a t e g o r i c a l v a r i a b l e s to i n d i c a t o r s .
c o e f f s <−LASSO$glmnet . f i t $ b e t a [ , LASSO$lambda . 1 s e ]
### Using one s t a n d a r d d e v i a t i o n above t h e s m a l l e s t i s common p r a c t i c e
### t o e n s u r e a s p a r s e model
### E x t r a c t t h e o u t l i e r s a t 10 cut−o f f v a l u e s
c u t . o f f s <−6+s e q l e n ( 1 0 ) / 1 0
o u t l i e r s <−g e t . o u t l i e r s (X, Pl an t . d a t a $ y i e l d , d i s t a n c e , median , c u t . o f f s )
8
P la nt . data<−r e a d . t a b l e ( ”HW1Q3. t x t ” )
g e t . o u t l i e r s <−f u n c t i o n ( x , y , d i s t a n c e , median , c u t o f f ) {
## This f u n c t i o n e x t r a c t s a l l o b s e r v a t i o n s f u r t h e r from t h e median than t h e
## cut−o f f valu e , u s i n g t h e d i s t a n c e p r o v i d e d .
## I f c u t o f f i s a v e c t o r , then t h e f u n c t i o n r e t u r n s a l i s t o f t h e
## o u t l i e r s f o r each cut−o f f v a l u e
answer<− l i s t (NA, l e n g t h ( c u t o f f ) )
for ( i in seq along ( cutoff )){
answer [ [ i ]]<− l i s t ( ” x”=x [ d i s t a n c e [ median ,] > c u t o f f [ i ] , , drop=FALSE ] ,
”y”=y [ d i s t a n c e [ median ,] > c u t o f f [ i ] ] )
}
r e t u r n ( answer )
}
l i b r a r y ( glmnet ) # f o r LASSO
X<−model . matrix ( y i e l d ˜ . , data=P la nt . data )
# This c r e a t e s a matrix o f p r e d i c t o r s t o be used i n LASSO, c o n v e r t i n g
# c a t e g o r i c a l v a r i a b l e s to i n d i c a t o r s .
c o e f f s <−LASSO$glmnet . f i t $ b e t a [ , LASSO$lambda . 1 s e ]
### Using one s t a n d a r d d e v i a t i o n above t h e s m a l l e s t i s common p r a c t i c e
### t o e n s u r e a s p a r s e model
### E x t r a c t t h e o u t l i e r s a t 10 cut−o f f v a l u e s
c u t . o f f s <−6+s e q l e n ( 1 0 ) / 1 0
o u t l i e r s <−g e t . o u t l i e r s (X, Pl an t . d a t a $ y i e l d , d i s t a n c e , median , c u t . o f f s )
10
P la nt . data<−r e a d . t a b l e ( ”HW1Q3. t x t ” )
g e t . o u t l i e r s <−f u n c t i o n ( x , y , d i s t a n c e , median , c u t o f f ) {
## This f u n c t i o n e x t r a c t s a l l o b s e r v a t i o n s f u r t h e r from t h e median than t h e
## cut−o f f valu e , u s i n g t h e d i s t a n c e p r o v i d e d .
## I f c u t o f f i s a v e c t o r , then t h e f u n c t i o n r e t u r n s a l i s t o f t h e
## o u t l i e r s f o r each cut−o f f v a l u e
answer<− l i s t (NA, l e n g t h ( c u t o f f ) )
for ( i in seq along ( cutoff )){
answer [ [ i ]]<− l i s t ( ” x”=x [ d i s t a n c e [ median ,] > c u t o f f [ i ] , , drop=FALSE ] ,
”y”=y [ d i s t a n c e [ median ,] > c u t o f f [ i ] ] )
}
r e t u r n ( answer )
}
l i b r a r y ( glmnet ) # f o r LASSO
X<−model . matrix ( y i e l d ˜ . , data=P la nt . data )
# This c r e a t e s a matrix o f p r e d i c t o r s t o be used i n LASSO, c o n v e r t i n g
# c a t e g o r i c a l v a r i a b l e s to i n d i c a t o r s .
### E x t r a c t t h e o u t l i e r s a t 10 cut−o f f v a l u e s
c u t . o f f s <−6+s e q l e n ( 1 0 ) / 1 0
o u t l i e r s <−g e t . o u t l i e r s (X, Pl an t . d a t a $ y i e l d , d i s t a n c e , median , c u t . o f f s )
12
c r e a t e . f o l d s <−f u n c t i o n ( nf , l e n g t h ) {
### Make f o l d s f o r c r o s s −v a l i d a t i o n .
### The c a r e t package has b e t t e r f u n c t i o n s t o implement t h i s ,
### but we needed t o a v o i d t h e dependency .
n f o l d <−n f
Folds<−sample ( s e q l e n ( n f o l d ) , l e n g t h , r e p l a c e=TRUE, prob=r e p ( 1 , n f o l d ) / n f o l d )
#This v a r i a b l e i s l o c a l .
return ( Folds )
}
e l e c t i o n . data<−r e a d . t a b l e ( ”HW1Q4. t x t ” )
n<−dim ( e l e c t i o n . data ) [ 1 ]
### Prepare a t a b l e f o r t h e r e s u l t s :
Folds<−data . frame ( ” f o l d s ”=c ( 2 , 3 , 4 , 5 , 1 0 , 2 0 ) , ”MSE”=r e p (NA, 6 ) )
14
make . f u n c t i o n s <−f u n c t i o n ( ) {
## These v a r i a b l e s a r e l o c a l t o t h e make . f u n c t i o n s f u n c t i o n
## Thus , o n l y f u n c t i o n s d e f i n e d i n s i d e t h i s f u n c t i o n can a c c e s s them .
n f o l d <−0
Folds<−NULL
c r e a t e . f o l d s <−f u n c t i o n ( nf , l e n g t h ) {
### Make f o l d s f o r c r o s s −v a l i d a t i o n .
### The c a r e t package has b e t t e r f u n c t i o n s t o implement t h i s ,
### but we needed t o a v o i d t h e dependency .
n f o l d <<−n f
Folds<<−sample ( s e q l e n ( n f o l d ) , l e n g t h , r e p l a c e=TRUE, prob=r e p ( 1 , n f o l d ) / n f o l d )
r e t u r n ( l i s t ( ” c r e a t e . f o l d s ”= c r e a t e . f o l d s , ” c r o s s . v a l i d a t e ”= c r o s s . v a l i d a t e ) )
}
e l e c t i o n . data<−r e a d . t a b l e ( ”HW1Q4. t x t ” )
n<−dim ( e l e c t i o n . data ) [ 1 ]
5. The code in the file HW1Q5.R is a script for processing a reinsurance com-
pany’s contract records. Improve the code to make it more reusable and
less error-prone.
Examining the code, we identify several issues that cause the code not to
be reusable or robust.
• The exchange rates are hard-coded into the code. If an exchange rate
changes, it will need to be updated in multiple places. If any one is
missed, it will cause errors.
• The code is almost the same for all cases, but is repeated in each case.
Any updates to the methods need to be changed in every branch,
leading to the possibility of mistakes. The code should be redesigned
to use a single piece of code, either using a function or by using
variables to prepare before the code. Indeed, we can see a mistake in
the code — there is no checking the policy limit for Quota-sharing
contracts in Germany. This is almost certainly a mistake caused by
the bad code design.
• The branching code assumes that all entries are from the available set
of entries, and are correctly input. If any entry is misformatted or if
a new country or contract type is added, the entry will be incorrectly
processed. The code should check for this and produce an error.
• The filename automatically uses the current date. It may be nec-
essary to process records from a previous day. The code should be
modified to allow that, probably using a function.
• The main loop over transactions uses the : operator. If the transac-
tion list is empty, it will cause an error.
We first make a more general function to open the data. The function de-
faults to using todays date, but that can be overriden, or the full filename
can be given.
16
g e t . c o n t r a c t s <−f u n c t i o n ( dat=a s . c h a r a c t e r (
a s . Date (
date ( ) ,
format=”%a %b %d %H:%M:%S %Y” ) ) ,
f i l e n a m e=NULL) {
### By d e f a u l t , t h i s l o a d s today ’ s c o n t r a c t s , but a n o t h e r d a t e can be p r o v i d e d .
### A f u l l f i l e n a m e can be p r o v i d e d t o o v e r r i d e t h e d e f a u l t .
i f ( i s . null ( filename )){
### i f f i l e n a m e not give n , u s e t h e d e f a u l t f o r t h e g i v e n d a t e .
r e t u r n ( r e a d . t a b l e ( p a s t e ( ” C o n t r a c t s ” , dat , ” . t x t ” , s e p=” ” ) ) )
} else {
return ( read . t a b l e ( filename ) )
}
}
c o n t r a c t s <−g e t . c o n t r a c t s ( )
Rather than using if statements to select the exchange rate, we can create
a function to lookup the exchange rate from a table. We create a lookup
table from two vectors. We create lookup tables for country and currency.
We could create a single table to give the exchange rate for each coun-
try. However, this would include the Euro exchange rate twice, making
it possible that it could be updated incorrectly. Alternatively, it might
be desirable to have different rates, even for countries that use the same
currency if the transactions are processed at different times.
17
### Easy t o update l i s t o f a l l c o u n t r i e s
### With r e l e v a n t i n f o r m a t i o n
e x c h a n g e r a t e d a t a <− l i s t (
” c o u n t r i e s ”=c ( ” Canada ” , ”USA” , ” France ” , ” Germany ” , ” China ” ) ,
” c u r r e n c i e s ”=c ( ”CAD” , ”USD” , ”EUR” , ”EUR” , ”RMB” ) ,
” c u r r e n c y l i s t ”=c ( ”CAD” , ”USD” , ”EUR” , ”RMB” ) ,
” e x c h a n g e r a t e s ”=c ( 1 , 1 . 2 4 0 3 2 , 1 . 6 4 7 2 0 , 1 . 8 0 3 9 4 , 0 . 1 8 9 4 0 1 )
)
i f ( c o u n t r y%i n% d a t a $ c o u n t r i e s ) {
c o u n t r y . no=which ( c o u n t r y==d a t a $ c o u n t r i e s ) [ 1 ]
### I f m u l t i p l e matches , t h i s t a k e s t h e f i r s t
c u r r e n c y=d a t a $ c u r r e n c i e s [ c o u n t r y . no ]
} else {
s t o p ( p a s t e ( ” Country \ ” ” , country , ” \ ” not i n d a t a b a s e . ” , s e p =””))
}
i f ( c u r r e n c y %∗% i n d a t a $ c u r r e n c i e s ) {
c u r r e n c y . no=which ( c u r r e n c y==d a t a $ c u r r e n c y l i s t ) [ 1 ]
r e t u r n ( d a t a $ e x c h a n g e r a t e s [ c u r r e n c y . no ] )
} else {
s t o p ( p a s t e ( ” Currency \ ” ” , c u r r e n c y , ” \ ” not i n d a t a b a s e . ” , s e p =””))
}
}
18
day . r e c o r d s <−NULL
i f ( r e c o r d $ c o n t r a c t==”E x c e s s o f Loss ” ) {
r e c o r d $ c l a i m=r e c o r d $ l o s s −r e c o r d $ a t t a c h m e n t
} e l s e i f ( r e c o r d $ c o n t r a c t==”Quota Share ” ) {
r e c o r d $ c l a i m=r e c o r d $ l o s s ∗ r e c o r d $ p e r c e n t a g e
} e l s e i f ( r e c o r d $ c o n t r a c t==”C a t a s t r o p h e Cover ” ) {
r e c o r d $ c l a i m=r e c o r d $ c a t a s t r o p h e . l o s s −r e c o r d $ a t t a c h m e n t
} else {
## Give a c l e a r e r r o r message
s t o p ( p a s t e ( ” C o n t r a c t type \ ” ” , r e c o r d $ c o n t r a c t , ” \ ” not known . ” , s e p =””))
## I t i s i m p o r t a n t t o put q u o t a t i o n s around t h e e r r o r , a s
## t r a i l i n g s p a c e s can c a u s e e r r o r s .
}
### These l i n e s appear t o be a l m o s t t h e same i n a l l c a s e s
### I t i s b e t t e r t o put them o n l y once .
i f ( r e c o r d $ c l a i m <0){
r e c o r d $ c l a i m =0
}
i f ( r e c o r d $ c l a i m >r e c o r d $ l i m i t ) {
r e c o r d $ c l a i m=r e c o r d $ l i m i t
}
r e c o r d $ p r o f i t=record$premium−r e c o r d $ c l a i m
i f ( r e c o r d $ c o n t r a c t==”Quota Share ” ) {
r e c o r d $ p r o f i t=r e c o r d $ p r o f i t −r e c o r d $ c e d i n g . commission
}
r e c o r d $ c o n v e r t e d . p r o f i t <−r e c o r d $ p r o f i t ∗ exch . r a t e
}
For the exchange rate lookup table, we can use the names attribute to
create slightly neater lookup tables and functions.
19
### Use names a t t r i b u t e t o make b e t t e r lookup t a b l e s
e x c h a n g e r a t e d a t a <− l i s t (
” c u r r e n c y”=c u r r e n c y . lookup . t a b l e ,
” e x c h a n g e r a t e s ”=exchange . r a t e . lookup . t a b l e
)
c u r r e n c y <−d a t a $ c u r r e n c y [ c o u n t r y ]
i f ( i s . null ( currency )){
s t o p ( p a s t e ( ” Country \ ” ” , country , ” \ ” not i n d a t a b a s e . ” , s e p =””))
}
r a t e <−d a t a $ e x c h a n g e r a t e s [ c u r r e n c y ]
i f ( i s . null ( rate )){
s t o p ( p a s t e ( ” Currency \ ” ” , c u r r e n c y , ” \ ” not i n d a t a b a s e . ” , s e p =””))
}
return ( rate )
}
One slightly awkward problem with the exchange rate lookup function is
that we need to either use the exchange rate data as a global variable, or
pass it as a parameter for every call to get.rate.
A more advanced solution uses something called a “closure” to build the
fixed values into the get.rate function.
20
### Use a c l o s u r e t o f i x c o n s t a n t p a r a m e t e r s i n a f u n c t i o n .
### We do t h i s by u s i n g one f u n c t i o n t h a t r e t u r n s a new f u n c t i o n .
return ( f u n c t i o n ( country ){
### This o n l y works f o r a s i n g l e c o u n t r y .
### Would need t o be m o d i f i e d t o h a n d l e a v e c t o r .
c u r r <−c u r r e n c y [ c o u n t r y ]
i f ( i s . null ( curr )){
s t o p ( p a s t e ( ” Country \ ” ” , country , ” \ ” not i n d a t a b a s e . ” , s e p =””))
}
r a t e <−e x c h a n g e r a t e s [ c u r r ]
i f ( i s . null ( rate )){
s t o p ( p a s t e ( ” Currency \ ” ” , c u r r e n c y , ” \ ” not i n d a t a b a s e . ” , s e p =””))
}
return ( rate )
})
}
In this code, the look-up tables are built into the get.rate function when it
is created. Now the look-up tables are fixed. By calling make.rate.lookup
with different look-up tables, we could change them, but this creates a new
function.
Another approach we could take is to make a function to calculate the
claim for the different contract types so the main loop is shorter. This
does have the advantage of separating the code that deals with these types,
making it easier to add a new contract type or some similar modification.
21
g e t . c l a i m . amount ( r e c o r d ) {
i f ( r e c o r d $ c o n t r a c t==”E x c e s s o f Loss ” ) {
ans<−r e c o r d $ l o s s −r e c o r d $ a t t a c h m e n t
} e l s e i f ( r e c o r d $ c o n t r a c t==”Quota Share ” ) {
ans<−r e c o r d $ l o s s ∗ r e c o r d $ p e r c e n t a g e
} e l s e i f ( r e c o r d $ c o n t r a c t==”C a t a s t r o p h e Cover ” ) {
ans<−r e c o r d $ c a t a s t r o p h e . l o s s −r e c o r d $ a t t a c h m e n t
} else {
## Give a c l e a r e r r o r message
s t o p ( p a s t e ( ” C o n t r a c t type \ ” ” , r e c o r d $ c o n t r a c t , ” \ ” not known . ” , s e p =””))
}
i f ( ans <0){
ans=0
}
i f ( ans>r e c o r d $ l i m i t ) {
ans=r e c o r d $ l i m i t
}
r e t u r n ( ans )
}
day . r e c o r d s <−NULL
r e c o r d $ c l a i m <−g e t . c l a i m . amount ( r e c o r d )
r e c o r d $ p r o f i t=record$premium−r e c o r d $ c l a i m
i f ( r e c o r d $ c o n t r a c t==”Quota Share ” ) {
r e c o r d $ p r o f i t=r e c o r d $ p r o f i t −r e c o r d $ c e d i n g . commission
}
r e c o r d $ c o n v e r t e d . p r o f i t=r e c o r d $ p r o f i t ∗ exch . r a t e
6. A data scientist has produced the code in the file HW1Q6.R to process a
company’s data. Testing the code on a small subset of the data, she finds
that it takes 7 hours to process a dataset with 200,000 records, each with
400 predictors.
(a) Approximately how long would the program be expected to take for the
company’s whole database of 4,000,000 records with 1,200 variables each?
22
We see that the code has two nested loops, both of which concatenate
lists. Concatenating lists of length n is O(n) complexity, so building a
list of length n is O(n2 ) complexity. Thus if n is the number of records,
and p is the number of predictors, then the complexity of this program
is O(n2 p2 ). Thus, if we have 20 times as many records, and 3 times as
many predictors, then program execution will take 202 × 32 = 3600 times
as long, or 25200 hours, often referred to as 1050 days.
(b) Management deems the time required unacceptable. Rewrite the code
to run more efficiently for big datasets.
Apparently three years is too long. To make the code faster, we can
reorganise the code to create the lists first.
r e c o r d s <− l i s t ( )
f o r ( i i n s e q l e n ( num records ) ) {
f o r m a t t e d . r e c o r d <−r e p (NA, n u m v a r i a b l e s )
f or ( j in seq len ( num variables )){
r e c <−r e c o r d [ i , j ]
r e c <− s t r s p l i t ( r e c , ” − ” ) [ [ 1 ] ]
r e c <−mean ( a s . numeric ( r e c ) ) #t a k e mean o f r a n g e
f o r m a t t e d . r e c o r d [ j ]<− r e c #add t o t h e l i s t
}
r e c o r d s [ [ i ]]<− f o r m a t t e d . r e c o r d
}
This should be easily fast enough. However, the code can be made even
more efficient using vectorisation — the strsplit function can take a
vector as input, and will return a list of outputs. We can use the lapply
function to more efficiently process this.
r e c o r d s <− l i s t ( )
f o r ( i i n s e q l e n ( num records ) ) {
r e c o r d s [ [ i ]]<− u n l i s t (
# lapply produces a l i s t of v e c t o r s of length 1 . u n l i s t turns
# i t i n t o a s i n g l e vector as r e q u i r e d .
l a p p l y ( # When t h e i n p u t t o s t r s p l i t i s a v e c t o r , i t r e t u r n s a l i s t
s t r s p l i t ( record [ i ,] ,” −”) ,
f u n c t i o n ( x ) { mean ( a s . numeric ( x ) ) }
))
}
This has the same complexity O(np), but a slightly faster running time.
23
7. The file HW1Q7.txt contains data from an entertainment company about
electricity usage. The data are not formatted in a very convenient way.
Read the data into R and reformat into a more convenient way, and use
it to create a plot showing electricity used per hour (y-axis) vs number of
people (x-axis) with colour showing age group and size showing company
size, with a facet grid of type of event versus time of day. Make a list of
all corrections made to the data.
customers$ID<−a s . i n t e g e r ( customers$ID )
s t r ( customers )
summary ( c u s t o m e r s )
### Numerical v a l u e s a l l l o o k OK. Next we check t h e c a t e g o r i c a l
### v a r i a b l e ” S e c t o r ” .
table ( customers$Sector )
### Merge ” L i v e p e r f o r m a n c e ” and ” L i v e Performance ”
c u s t o m e r s $ S e c t o r [ c u s t o m e r s $ S e c t o r==”L i v e p e r f o r m a c e ”]<−” L i v e Performance ”
### Now c o n v e r t t o f a c t o r
c u s t o m e r s $ S e c t o r <−a s . f a c t o r ( c u s t o m e r s $ S e c t o r )
24
e v e n t s <−r e a d . t a b l e ( ”HW1Q7. t x t ” , s k i p =57)
s t r ( events )
### We n o t e t h a t Number . o f . p e o p l e i s c h a r a c t e r , when i t s h o u l d be numeric .
### This i n d i c a t e s t h e r e a r e p r o b a b l y some e r r o r s t h a t need t o be f i x e d .
t a b l e ( events$Time . o f . day )
### ” Late ” p r o b a b l y means t h e same a s ” Late Evening ” .
events$Time . o f . day [ events$Time . o f . day==”Late ”]<−” Late Evening ”
summary ( e v e n t s )
### Change s t r i n g s t o f a c t o r s
e v e n t s $ E v e n t . type<−a s . f a c t o r ( e v e n t s $ E v e n t . type )
events$Age . group<−a s . f a c t o r ( events$Age . group )
events$Time . o f . day<−a s . f a c t o r ( events$Time . o f . day )
l i b r a r y ( dplyr )
25
l i b r a r y ( ggplot2 )
g g p l o t ( data=e v e n t s . f u l l ,
mapping=a e s ( x=Number . o f . People ,
y=E l e c t r i c i t y . Usage ,
c o l o u r=Age . group ,
s i z e=S i z e ))+
geom point ()+
f a c e t g r i d ( Event . type ˜Time . o f . day)+
theme ( p l o t . t i t l e =e l e m e n t t e x t ( s i z e =20 , h j u s t = 0 . 5 ) ,
a x i s . t i t l e =e l e m e n t t e x t ( s i z e =20 , h j u s t = 0 . 5 ) ,
a x i s . t e x t=e l e m e n t t e x t ( s i z e =16) ,
l e g e n d . t i t l e =e l e m e n t t e x t ( s i z e =20 , h j u s t = 0 . 5 ) ,
l e g e n d . t e x t=e l e m e n t t e x t ( s i z e =16) ,
s t r i p . t e x t=e l e m e n t t e x t ( s i z e =16))
2000
Activity
1000
2000
Film
1000
Live Performance
2000
Age.group
Adult
Electricity.Usage
1000 All
Child
Senior
0 Young Adult
Young Child
2000
Size
50
Music
100
150
1000
2000
Sport
1000
0
Video game
2000
1000
0
0 3000 6000 9000 0 3000 6000 9000 0 3000 6000 9000 0 3000 6000 9000 0 3000 6000 9000
Number.of.People
26
Events
Age.group
• Events 185, 612 and 616 have age group “Adlut”, which is
almost certainly a misspelling of “Adult”.
• Events 14, 85, 184, 275, 300, 726 and 775 have age group
“YoungChild”, which should be “Young Child”.
Time.of.day
• Events 97, 214 and 457 have time of day “Late” which is
presumably the same as “Late Evening”.
Number.of.People
• Events 83 and 134 have commas in the number of people,
causing them to be interpreted as character.
27