0% found this document useful (0 votes)

5 views27 pages

Solutions 1

The document contains solutions to homework problems for a predictive analytics course, focusing on data processing and cleaning techniques in R. It discusses issues with existing code for removing entries with zero balls remaining in a dataset, and provides alternative methods for achieving the same goal more efficiently. Additionally, it addresses the need for better commenting in code related to a study on parenting techniques and children's mental health, as well as troubleshooting code for analyzing crop growth data.

Uploaded by

Spencer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views27 pages

Solutions 1

Uploaded by

Spencer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

ACSC/STAT 3740, Predictive Analytics

WINTER 2025
Toby Kenney
Homework Sheet 1
Model Solutions
[Note: all data in this homework are simulated.]
[Note: With many of these problems, there is no “correct” solution. These
model solutions give a range of reasonable approaches, but there are many other
good approaches that could be taken.]

Basic Questions
1. A former colleague has produced the code in the file HW1Q1.R to process the
sports-analytics dataset in the file HW1Q1.txt, before leaving the company.
The code is intended to remove all rows with Balls.remaining equal to
zero. However, it does not work. Explain why the code does not work, and
how to make it work, and how to restructure it in a better way.

Upon running the code, we get the error message “missing value where
TRUE/FALSE needed”. This indicates that cricket.data$Balls.remaining[i]==0
is null. We can see that this happens when i is 8390, and the correspond-
ing row is full of NA values. Looking at the new dimensions of the data
frame, we see that it has grown significantly, instead of having rows re-
moved. The problem comes from the precedence of the : operation. The
line to remove the bad row should include parentheses.
cricket.data<-rbind(cricket.data[1:(i-1),],cricket.data[(i+1):n,])
After making this change, we still get the same error at the same value of
i. However, the dimension of the data frame is now almost correct, but
the last several rows of the data frame are all NA. This is because the for
loop sets the indices at the start, and always runs for those indices, even
after the data frame is shorter. This code can be fixed by running the
loop backwards.

1
c r i c k e t . data<−r e a d . t a b l e ( ”HW1Q1. t x t ” )

n<−dim ( c r i c k e t . data ) [ 1 ]

### Remove e n t r i e s with z e r o b a l l s remaining , a s they must be m i s t a k e s .

for ( i in n :1){
i f ( c r i c k e t . d a t a $ B a l l s . r e m a i n i n g [ i ]==0){
n<−dim ( c r i c k e t . data ) [ 1 ]
c r i c k e t . data<−r b i n d ( c r i c k e t . data [ 1 : ( i − 1 ) , ] , c r i c k e t . data [ ( i +1): n , ] )
}
}

Note that we have also added the line to recalculate n every time a line is
deleted.
This code successfully removes the lines with Balls.remaining==0 in
this dataset. However, it still has a bug. If the first or last row had
Balls.remaining==0, then the code would not work. To make this more
robust, we should use seq len, and we can use negative indices in the
second subset.

c r i c k e t . data<−r e a d . t a b l e ( ”HW1Q1. t x t ” )

n<−dim ( c r i c k e t . data ) [ 1 ]

### Remove e n t r i e s with z e r o b a l l s remaining , a s they must be m i s t a k e s .

for ( i in n :1){
i f ( c r i c k e t . d a t a $ B a l l s . r e m a i n i n g [ i ]==0){
c r i c k e t . data<−r b i n d ( c r i c k e t . data [ s e q l e n ( i − 1 ) , ] , c r i c k e t . data [− s e q l e n ( i ) , ] )
}
}

This fixes the bug, and produces working code. The code is however
inefficient and it is easy to introduce bugs when modifying it. There are
several better ways. The simplest is to directly use the subset operation
to select the desired elements.

c r i c k e t . data<−r e a d . t a b l e ( ”HW1Q1. t x t ” )

c r i c k e t . data . good<−c r i c k e t . data [ c r i c k e t . d a t a $ B a l l s . r e m a i n i n g [ i ] > 0 , ]

Alternatively, we can use the dplyr package and its filter command.

2
c r i c k e t . data<−r e a d . t a b l e ( ”HW1Q1. t x t ” )

c r i c k e t . data . good<−c r i c k e t . data%>%f i l t e r ( B a l l s . remaining >0)

Both the improved solutions use a new variable for the cleaned data. This
is not essential, but is usually a good practice. It allows for easier debug-
ging, as the original data is still available for comparison.
If we insist on using the loop to remove individual rows one at a time,
the code can be improved by directly creating a vector of the rows to
be removed. As in the previous code, running the loop in reverse avoids
problems with the rows being renumbered. It is still somewhat inefficient
as every removed row needs the table to be recopied.

c r i c k e t . data<−r e a d . t a b l e ( ”HW1Q1. t x t ” )

rows . t o . remove<−which ( c r i c k e t . d a t a $ B a l l s . r e m a i n i n g [ i ]==0)

### Remove e n t r i e s with z e r o b a l l s remaining , a s they must be m i s t a k e s .

f o r ( i i n r e v ( rows . t o . remove ) {
c r i c k e t . data<−r b i n d ( c r i c k e t . data [ s e q l e n ( i − 1 ) , ] , c r i c k e t . data [− s e q l e n ( i ) , ] )
}

2. A government worker is investigating the effect of various parenting tech-

niques on children’s mental health. He has used the code in the file HW1Q2.R
to process the data in the file HW1Q2.txt. Add comments to the code to
make it easier to follow.
The variables in the data set are explained in the following table:

3
Variable Meaning
Living.With The child’s household status — one of “Both”, “Mother”, “Father”, “Joint custody”, “Foster”,
“Other”
Family.Income The combined annual household income
Age The child’s age
Siblings The number of siblings the child has
Discipline.strict.rules The extent to which the caregivers strictly enforce rules
Discipline.punishment The extent to which the caregivers use punishment for misbehaviour
Parent.attention The average number of hours per week that the caregivers spend with the child
Freedom The extent to which the child is allowed to act without supervision.
Health.index An index summarising the child’s overall physicall health.
Programmes The average number of hours per week that the child spends in extracurricular programmes.
Screentime The average number of hours per week that the child spends using electronic devices.
Friends The number of friends the child has.
School.Grades.Mathematics The child’s average grade in school mathematics.
School.Grades.English The child’s average grade in school english.
Depression.Score A summary of various psychological surveys assessing the child’s susceptibility to depression.

Here is one way the file could be commented.

4
D e p r e s s i o n . data<−r e a d . t a b l e ( ”HW1Q2. t x t ” )
l i b r a r y ( d p l y r ) # f o r f i l t e r , mutate and s e l e c t .

D e p r e s s i o n . study . data<−D e p r e s s i o n . data%>%

f i l t e r ( Age>10 ,
## The study f o c u s e s on c h i l d r e n aged 11 and up
L i v i n g . With%i n%c ( ” Both ” , ” Mother ” , ” Father ” , ” J o i n t Custody ” ) ,
## Only c h i l d r e n l i v i n g with t h e i r n a t u r a l p a r e n t s .
Health . index >70) # Exclude c h i l d r e n with v e r y poor h e a l t h .
D e p r e s s i o n . study . data<−D e p r e s s i o n . study . data%>%
mutate (
o n l y . c h i l d=S i b l i n g s ==0,
## t o s i m p l i f y a n a l y s i s , we d i s t i n g u i s h between o n l y c h i l d r e n
## and c h i l d r e n with s i b l i n g s .
income . group=c u t ( Family . Income ,
b r e a k s=c ( 0 , 2 5 0 0 0 , 4 0 0 0 0 , 1 0 0 0 0 0 , I n f ) ,
l a b e l s=c ( ” poor ” , ” low−income ” , ” middle−income ” , ” high−income ” ) )
## We d i v i d e income i n t o 4 g r o u p s t o d e a l with t h e
## heavy−t a i l e d d i s t r i b u t i o n .
)%>% s e l e c t (−c ( ” Family . Income ” , ” S i b l i n g s ” ) )
D e p r e s s i o n . study . data . t r a n s f o r m e d <−D e p r e s s i o n . study . data%>%
mutate (
l o g . prog=l o g ( Programmes ) ,
l o g . a t t e n t i o n=l o g ( Parent . a t t e n t i o n )
)%>% s e l e c t (−c ( ” Programmes ” , ” Parent . a t t e n t i o n ” ) )
## We l o g −t r a n s f o r m Programme time and Parent a t t e n t i o n b e c a u s e they
## a r e heavy−t a i l e d .

S c r e e n t i m e . pval<−r e p (NA, 4 )
c u t . o f f s <−c ( 5 , 1 0 , 2 0 , 3 0 )
## We t r y f o u r cut−o f f v a l u e s f o r s c r e e n t i m e .

f o r ( i in s e q a l o n g ( cut . o f f s )){
c u t . o f f <−c u t . o f f s [ i ]
temp . data<−D e p r e s s i o n . study . data%>%
mutate ( s c r e e n . e x c e s s=S c r e e n t i m e >c u t . o f f )%>% s e l e c t (−c ( ” S c r e e n t i m e ” ) )
## We c o n v e r t t h e S c r e e n t i m e t o an i n d i c a t o r v a r i a b l e .
model<−lm ( D e p r e s s i o n . S c o r e ˜ . , data=temp . data ) # f i t a l i n e a r model
model . sum<−summary ( model )
S c r e e n t i m e . p v a l [ i ]<−model . s u m $ c o e f f i c i e n t s [ ” s c r e e n . excessTRUE ” , 4 ]
## e x t r a c t t h e p−v a l u e f o r t h e c o e f f i c i e n t o f s c r e e n . excessTRUE
## ( Naming c o n v e n t i o n s may be d i f f e r e n t on o t h e r systems , s o you
## may need t o modify t h i s code ) .
}

b e s t . c u t . o f f <−c u t . o f f s [ S c r e e n t i m e . p v a l==min ( S c r e e n t i m e . p v a l ) ]
### S e l e c t t h e cut−o f f t h a t r e s u l t s i n t h e s m a l l e s t p−v a l u e .
5
3. A scientist is studying crop growth. Their research assistant was analysing
the data in the file HW1Q3.txt, and wrote the code in the file HW1Q3.R to
process the data, before leaving suddenly. Upon reviewing the code, the
scientist discovers that the code does not work. Fix the code.

There are several problems with the code. If we attempt to run the code,
we get an error message and two warnings:
Warning messages:
1: In distance[median, ] > cutoff :
longer object length is not a multiple of shorter object length
2: In distance[median, ] > cutoff :
longer object length is not a multiple of shorter object length
Error in outliers[[i]]$x : $ operator is invalid for atomic
vectors
The error messages are typically cryptic. However, examining the outliers
variable, we see that it consists of a single pair of variables x and y, instead
of a list of 10 such pairs — one for each cut-off.
[The error arises because outliers[[1]] refers to outliers$x, which is
a matrix, so the entries do not have names.]
The problem then becomes clear — the function get.outliers does not
accept a vector of values for cut.off, whereas the later code assumes
that it will. We can modify the function so that it can take a vector. [An
alternative approach would be to call the function from the loop. This is
slightly simpler, but if there is a danger that future users might expect it
to handle vectors of cut-offs, then fixing the function is a better solution.]

6
P la nt . data<−r e a d . t a b l e ( ”HW1Q3. t x t ” )

g e t . o u t l i e r s <−f u n c t i o n ( x , y , d i s t a n c e , median , c u t o f f ) {
## This f u n c t i o n e x t r a c t s a l l o b s e r v a t i o n s f u r t h e r from t h e median than t h e
## cut−o f f valu e , u s i n g t h e d i s t a n c e p r o v i d e d .
## I f c u t o f f i s a v e c t o r , then t h e f u n c t i o n r e t u r n s a l i s t o f t h e
## o u t l i e r s f o r each cut−o f f v a l u e
answer<− l i s t (NA, l e n g t h ( c u t o f f ) )
for ( i in seq along ( cutoff )){
answer [ [ i ]]<− l i s t ( ” x”=x [ d i s t a n c e [ median ,] > c u t o f f [ i ] , ] , ” y”=y [ d i s t a n c e [ median ,] > c u t
}
r e t u r n ( answer )
}

l i b r a r y ( glmnet ) # f o r LASSO
X<−model . matrix ( y i e l d ˜ . , data=P la nt . data )
# This c r e a t e s a matrix o f p r e d i c t o r s t o be used i n LASSO, c o n v e r t i n g
# c a t e g o r i c a l v a r i a b l e s to i n d i c a t o r s .

### C a l c u l a t e a d i s t a n c e matrix between p r e d i c t o r s .

i n v . cov<−s o l v e ( var (X[ , − 1 ] ) )
for ( i in seq len (n)){
for ( j in seq len (n)){
d i f f <−X[ i ,−1]−X[ j , −1]
d i s t a n c e [ i , j ]<− s q r t ( t ( d i f f )%∗%i n v . cov%∗% d i f f )
}
}

# The median p o i n t i s t h e one t h a t m i n i m i s e s t o t a l d i s t a n c e t o o t h e r

# points .
median<−which ( colSums ( d i s t a n c e )==min ( colSums ( d i s t a n c e ) ) )

LASSO<−cv . glmnet (X, Pl a nt . d a t a $ y i e l d , a l p h a =1)

# This p e r f o r m s c r o s s −v a l i d a t i o n t o s e l e c t t h e b e s t p e n a l t y parameter lambda .

c o e f f s <−LASSO$glmnet . f i t $ b e t a [ , LASSO$lambda . 1 s e ]
### Using one s t a n d a r d d e v i a t i o n above t h e s m a l l e s t i s common p r a c t i c e
### t o e n s u r e a s p a r s e model

### E x t r a c t t h e o u t l i e r s a t 10 cut−o f f v a l u e s
c u t . o f f s <−6+s e q l e n ( 1 0 ) / 1 0
o u t l i e r s <−g e t . o u t l i e r s (X, Pl an t . d a t a $ y i e l d , d i s t a n c e , median , c u t . o f f s )

### For each cut−o f f , c a l c u l a t e t h e Mean Squared E r r o r f o r t h e o u t l i e r s

### Not i n c l u d i n g t h e i n t e r c e p t i n t h e p r e d i c t i o n
MSE<−r e p ( 1 , 1 0 )
for ( i in seq len (10)){
7
p r e d i c t i o n s <−o u t l i e r s [ [ i ] ] $x [,−1]%∗% c o e f f s [ −1]
t r u e <−o u t l i e r s [ [ i ] ] $y
MSE[ i ]<−mean ( ( p r e d i c t i o n s −t r u e ) ˆ 2 )
}
When we run this modified code, we get the error
Error in outliers[[i]]$x[, -1] : incorrect number of dimensions
This is telling us that outliers[[i]]$x is not a matrix, but a vector. We
find that i==9 and outliers[[i]]$x only has one row. Because it only
has one row, R reduces it to a vector, which makes the code not work. We
need to further improve the get.outliers function by using the option
drop=FALSE in the subset.

8
P la nt . data<−r e a d . t a b l e ( ”HW1Q3. t x t ” )

g e t . o u t l i e r s <−f u n c t i o n ( x , y , d i s t a n c e , median , c u t o f f ) {
## This f u n c t i o n e x t r a c t s a l l o b s e r v a t i o n s f u r t h e r from t h e median than t h e
## cut−o f f valu e , u s i n g t h e d i s t a n c e p r o v i d e d .
## I f c u t o f f i s a v e c t o r , then t h e f u n c t i o n r e t u r n s a l i s t o f t h e
## o u t l i e r s f o r each cut−o f f v a l u e
answer<− l i s t (NA, l e n g t h ( c u t o f f ) )
for ( i in seq along ( cutoff )){
answer [ [ i ]]<− l i s t ( ” x”=x [ d i s t a n c e [ median ,] > c u t o f f [ i ] , , drop=FALSE ] ,
”y”=y [ d i s t a n c e [ median ,] > c u t o f f [ i ] ] )
}
r e t u r n ( answer )
}

### C a l c u l a t e a d i s t a n c e matrix between p r e d i c t o r s .

# The median p o i n t i s t h e one t h a t m i n i m i s e s t o t a l d i s t a n c e t o o t h e r

# points .
median<−which ( colSums ( d i s t a n c e )==min ( colSums ( d i s t a n c e ) ) )

LASSO<−cv . glmnet (X, Pl a nt . d a t a $ y i e l d , a l p h a =1)

# This p e r f o r m s c r o s s −v a l i d a t i o n t o s e l e c t t h e b e s t p e n a l t y parameter lambda .

### For each cut−o f f , c a l c u l a t e t h e Mean Squared E r r o r f o r t h e o u t l i e r s

### Not i n c l u d i n g t h e i n t e r c e p t i n t h e p r e d i c t i o n
MSE<−r e p ( 1 , 1 0 )
9
for ( i in seq len (10)){
p r e d i c t i o n s <−o u t l i e r s [ [ i ] ] $x [,−1]%∗% c o e f f s [ −1]
t r u e <−o u t l i e r s [ [ i ] ] $y
MSE[ i ]<−mean ( ( p r e d i c t i o n s −t r u e ) ˆ 2 )
}
This code now runs without error and produces a vector of MSE val-
ues. The final value is NaN (Not a Number), which is because there are
no outliers above the cut-off 7.0, so taking the mean of an empty set is
undefined.
There is still an error in the code, which does not give any obvious signs.
However, if we check the vector coeffs, we see that it is all zeros. The
reason is that we used LASSO$lambda.1se as the index, whereas, we need
to use the index for the corresponding value of lambda. That is, we can
use
coeffs<-LASSO$glmnet.fit$beta[,LASSO$lambda==LASSO$lambda.1se])
or
coeffs<-LASSO$glmnet.fit$beta[,which(LASSO$lambda==LASSO$lambda.1se)])
The full corrected code is

10
P la nt . data<−r e a d . t a b l e ( ”HW1Q3. t x t ” )

g e t . o u t l i e r s <−f u n c t i o n ( x , y , d i s t a n c e , median , c u t o f f ) {
## This f u n c t i o n e x t r a c t s a l l o b s e r v a t i o n s f u r t h e r from t h e median than t h e
## cut−o f f valu e , u s i n g t h e d i s t a n c e p r o v i d e d .
## I f c u t o f f i s a v e c t o r , then t h e f u n c t i o n r e t u r n s a l i s t o f t h e
## o u t l i e r s f o r each cut−o f f v a l u e
answer<− l i s t (NA, l e n g t h ( c u t o f f ) )
for ( i in seq along ( cutoff )){
answer [ [ i ]]<− l i s t ( ” x”=x [ d i s t a n c e [ median ,] > c u t o f f [ i ] , , drop=FALSE ] ,
”y”=y [ d i s t a n c e [ median ,] > c u t o f f [ i ] ] )
}
r e t u r n ( answer )
}

### C a l c u l a t e a d i s t a n c e matrix between p r e d i c t o r s .

# The median p o i n t i s t h e one t h a t m i n i m i s e s t o t a l d i s t a n c e t o o t h e r

# points .
median<−which ( colSums ( d i s t a n c e )==min ( colSums ( d i s t a n c e ) ) )

LASSO<−cv . glmnet (X, Pl a nt . d a t a $ y i e l d , a l p h a =1)

# This p e r f o r m s c r o s s −v a l i d a t i o n t o s e l e c t t h e b e s t p e n a l t y parameter lambda .

c o e f f s <−LASSO$glmnet . f i t $ b e t a [ , which ( LASSO$lambda==LASSO$lambda . 1 s e ) ]

### Using one s t a n d a r d d e v i a t i o n above t h e s m a l l e s t i s common p r a c t i c e
### t o e n s u r e a s p a r s e model

### For each cut−o f f , c a l c u l a t e t h e Mean Squared E r r o r f o r t h e o u t l i e r s

### Not i n c l u d i n g t h e i n t e r c e p t i n t h e p r e d i c t i o n
MSE<−r e p ( 1 , 1 0 )
11
for ( i in seq len (10)){
p r e d i c t i o n s <−o u t l i e r s [ [ i ] ] $x [,−1]%∗% c o e f f s [ −1]
t r u e <−o u t l i e r s [ [ i ] ] $y
MSE[ i ]<−mean ( ( p r e d i c t i o n s −t r u e ) ˆ 2 )
}
4. A government researcher is studying the effect of news coverage on elec-
tions. Their research assistant was analysing the data in the file HW1Q4.txt,
and wrote the code in the file HW1Q4.R to process the data, before leaving
suddenly. Upon reviewing the code, the researcher discovers that the code
does not work. Fix the code and improve it to reduce the risk of this type
of mistake happenning in future.

The error message here is somewhat cryptic. Examining it carefully, it is

saying that we have referred to the variable MSE from the data frame Folds,
but Folds is not a data frame, it is a vector. Looking at Folds confirms
that it is a vector of “1”s and “2”s. The problem is that Folds is the name
of the global variable used to communicate between the create.folds
and cross.validate functions. The problem can be fixed by renaming
the data frame used to save the results. However, using a global variable
to communicate implicitly between functions is dangerous, and it is likely
that this problem will happen again. To avoid this, the Folds variable
should be explicitly passed between the functions.

12
c r e a t e . f o l d s <−f u n c t i o n ( nf , l e n g t h ) {
### Make f o l d s f o r c r o s s −v a l i d a t i o n .
### The c a r e t package has b e t t e r f u n c t i o n s t o implement t h i s ,
### but we needed t o a v o i d t h e dependency .

n f o l d <−n f
Folds<−sample ( s e q l e n ( n f o l d ) , l e n g t h , r e p l a c e=TRUE, prob=r e p ( 1 , n f o l d ) / n f o l d )
#This v a r i a b l e i s l o c a l .
return ( Folds )
}

c r o s s . v a l i d a t e <−f u n c t i o n ( formula , data , Folds , n f o l d ) {

### F i r s t c a l l t h e c r e a t e . f o l d s f u n c t i o n t o s e t up t h e f o l d s .
### Then c a l l t h i s f u n c t i o n t o make c r o s s −v a l i d a t e d p r e d i c t i o n s .
### F o l d s i s now e x p l i c i t l y p a s s e d a s a parameter .

p r e d i c t i o n s <−r e p (NA, dim ( data ) [ 1 ] )

for ( i in seq len ( nfold )){

t r a i n i n g . data<−data [ F o l d s != i , ]
model<−lm ( formula , data=t r a i n i n g . data )
t e s t . data<−data [ F o l d s==i , ]
p r e d i c t i o n s [ F o l d s==i ]<− p r e d i c t ( model , newdata=t e s t . data )
}
return ( predictions )
}

e l e c t i o n . data<−r e a d . t a b l e ( ”HW1Q4. t x t ” )
n<−dim ( e l e c t i o n . data ) [ 1 ]

### Try d i f f e r e n t numbers o f f o l d s and compare t h e c r o s s −v a l i d a t e d MSE

### Prepare a t a b l e f o r t h e r e s u l t s :
Folds<−data . frame ( ” f o l d s ”=c ( 2 , 3 , 4 , 5 , 1 0 , 2 0 ) , ”MSE”=r e p (NA, 6 ) )

for ( i in seq len (6)){

### Now we can u s e a d i f f e r e n t name f o r t h e v a r i a b l e t o p a s s f o l d s
### between t h e f u n c t i o n s , u s i n g a lower −c a s e ” f ” .
f o l d s <−c r e a t e . f o l d s ( F o l d s $ f o l d s [ i ] , n )
cv . pred<−c r o s s . v a l i d a t e ( Outcome ˜ . , e l e c t i o n . data , f o l d s , F o l d s $ f o l d s [ i ] )
### c a l c u l a t e mean−s q u a r e d e r r o r .
Folds$MSE [ i ]<−mean ( ( cv . pred−e l e c t i o n . data$Outcome ) ˆ 2 )
}
13
If we feel it is necessary, or at least convenient to share variables implicitly,
we need to create an environment for these functions that will allow us to
pass variables that cannot clash with global variables. This can be done
by creating the interacting functions with another function.

14
make . f u n c t i o n s <−f u n c t i o n ( ) {
## These v a r i a b l e s a r e l o c a l t o t h e make . f u n c t i o n s f u n c t i o n
## Thus , o n l y f u n c t i o n s d e f i n e d i n s i d e t h i s f u n c t i o n can a c c e s s them .

n f o l d <−0
Folds<−NULL

c r e a t e . f o l d s <−f u n c t i o n ( nf , l e n g t h ) {
### Make f o l d s f o r c r o s s −v a l i d a t i o n .
### The c a r e t package has b e t t e r f u n c t i o n s t o implement t h i s ,
### but we needed t o a v o i d t h e dependency .

n f o l d <<−n f
Folds<<−sample ( s e q l e n ( n f o l d ) , l e n g t h , r e p l a c e=TRUE, prob=r e p ( 1 , n f o l d ) / n f o l d )

c r o s s . v a l i d a t e <−f u n c t i o n ( formula , data ) {

### F i r s t c a l l t h e c r e a t e . f o l d s f u n c t i o n t o s e t up t h e f o l d s .
### Then c a l l t h i s f u n c t i o n t o make c r o s s −v a l i d a t e d p r e d i c t i o n s .

p r e d i c t i o n s <−r e p (NA, dim ( data ) [ 1 ] )

for ( i in seq len ( nfold )){

r e t u r n ( l i s t ( ” c r e a t e . f o l d s ”= c r e a t e . f o l d s , ” c r o s s . v a l i d a t e ”= c r o s s . v a l i d a t e ) )
}

### To make them a c c e s s i b l e f u n c t i o n s , we need a few e x t r a l i n e s o f code .

f u n c t i o n s <−make . f u n c t i o n s ( )
c r e a t e . f o l d s <−f u n c t i o n s $ c r e a t e . f o l d s
c r o s s . v a l i d a t e <−f u n c t i o n s $ c r o s s . v a l i d a t e

e l e c t i o n . data<−r e a d . t a b l e ( ”HW1Q4. t x t ” )
n<−dim ( e l e c t i o n . data ) [ 1 ]

### Try d i f f e r e n t numbers o f f o l d s and compare t h e c r o s s −v a l i d a t e d MSE

15
### Prepare a t a b l e f o r t h e r e s u l t s :
Folds<−data . frame ( ” f o l d s ”=c ( 2 , 3 , 4 , 5 , 1 0 , 2 0 ) , ”MSE”=r e p (NA, 6 ) )

for ( i in seq len (6)){

create . folds ( Folds$folds [ i ] , n)
cv . pred<−c r o s s . v a l i d a t e ( Outcome ˜ . , e l e c t i o n . data )
### c a l c u l a t e mean−s q u a r e d e r r o r .
Folds$MSE [ i ]<−mean ( ( cv . pred−e l e c t i o n . data$Outcome ) ˆ 2 )
Another approach would be to make the functions into a package. This
would mean that unless the global variable Folds is exported, it would
remain in the package’s namespace and thus avoid any clashes with other
variables with the same name. Approaches that use global variables are a
bad idea for this simple code where only a single vector of folds needs to be
passed between functions, but might make things simpler if the functions
need to pass a large number of variables.

5. The code in the file HW1Q5.R is a script for processing a reinsurance com-
pany’s contract records. Improve the code to make it more reusable and
less error-prone.

Examining the code, we identify several issues that cause the code not to
be reusable or robust.

• The exchange rates are hard-coded into the code. If an exchange rate
changes, it will need to be updated in multiple places. If any one is
missed, it will cause errors.
• The code is almost the same for all cases, but is repeated in each case.
Any updates to the methods need to be changed in every branch,
leading to the possibility of mistakes. The code should be redesigned
to use a single piece of code, either using a function or by using
variables to prepare before the code. Indeed, we can see a mistake in
the code — there is no checking the policy limit for Quota-sharing
contracts in Germany. This is almost certainly a mistake caused by
the bad code design.
• The branching code assumes that all entries are from the available set
of entries, and are correctly input. If any entry is misformatted or if
a new country or contract type is added, the entry will be incorrectly
processed. The code should check for this and produce an error.
• The filename automatically uses the current date. It may be nec-
essary to process records from a previous day. The code should be
modified to allow that, probably using a function.
• The main loop over transactions uses the : operator. If the transac-
tion list is empty, it will cause an error.

We first make a more general function to open the data. The function de-
faults to using todays date, but that can be overriden, or the full filename
can be given.

16
g e t . c o n t r a c t s <−f u n c t i o n ( dat=a s . c h a r a c t e r (
a s . Date (
date ( ) ,
format=”%a %b %d %H:%M:%S %Y” ) ) ,
f i l e n a m e=NULL) {
### By d e f a u l t , t h i s l o a d s today ’ s c o n t r a c t s , but a n o t h e r d a t e can be p r o v i d e d .
### A f u l l f i l e n a m e can be p r o v i d e d t o o v e r r i d e t h e d e f a u l t .
i f ( i s . null ( filename )){
### i f f i l e n a m e not give n , u s e t h e d e f a u l t f o r t h e g i v e n d a t e .
r e t u r n ( r e a d . t a b l e ( p a s t e ( ” C o n t r a c t s ” , dat , ” . t x t ” , s e p=” ” ) ) )
} else {
return ( read . t a b l e ( filename ) )
}
}

c o n t r a c t s <−g e t . c o n t r a c t s ( )

Rather than using if statements to select the exchange rate, we can create
a function to lookup the exchange rate from a table. We create a lookup
table from two vectors. We create lookup tables for country and currency.
We could create a single table to give the exchange rate for each coun-
try. However, this would include the Euro exchange rate twice, making
it possible that it could be updated incorrectly. Alternatively, it might
be desirable to have different rates, even for countries that use the same
currency if the transactions are processed at different times.

17
### Easy t o update l i s t o f a l l c o u n t r i e s
### With r e l e v a n t i n f o r m a t i o n
e x c h a n g e r a t e d a t a <− l i s t (
” c o u n t r i e s ”=c ( ” Canada ” , ”USA” , ” France ” , ” Germany ” , ” China ” ) ,
” c u r r e n c i e s ”=c ( ”CAD” , ”USD” , ”EUR” , ”EUR” , ”RMB” ) ,
” c u r r e n c y l i s t ”=c ( ”CAD” , ”USD” , ”EUR” , ”RMB” ) ,
” e x c h a n g e r a t e s ”=c ( 1 , 1 . 2 4 0 3 2 , 1 . 6 4 7 2 0 , 1 . 8 0 3 9 4 , 0 . 1 8 9 4 0 1 )
)

g e t . r a t e <−f u n c t i o n ( country , data ) {

### This o n l y works f o r a s i n g l e c o u n t r y .
### Would need t o be m o d i f i e d t o h a n d l e a v e c t o r .

### We p a s s exchange r a t e data a s a parameter t o e n s u r e t h a t t h e u s e r

### i s aware i t i s used i n t h e f u n c t i o n . Leaving i t a s a g l o b a l
### v a r i a b l e would not be t e r r i b l e , but k e e p i n g i t a s a parameter
### r e d u c e s any danger o f t h e u s e r i n c o r r e c t l y s p e c i f y i n g i t .

i f ( c o u n t r y%i n% d a t a $ c o u n t r i e s ) {
c o u n t r y . no=which ( c o u n t r y==d a t a $ c o u n t r i e s ) [ 1 ]
### I f m u l t i p l e matches , t h i s t a k e s t h e f i r s t
c u r r e n c y=d a t a $ c u r r e n c i e s [ c o u n t r y . no ]
} else {
s t o p ( p a s t e ( ” Country \ ” ” , country , ” \ ” not i n d a t a b a s e . ” , s e p =””))
}
i f ( c u r r e n c y %∗% i n d a t a $ c u r r e n c i e s ) {
c u r r e n c y . no=which ( c u r r e n c y==d a t a $ c u r r e n c y l i s t ) [ 1 ]
r e t u r n ( d a t a $ e x c h a n g e r a t e s [ c u r r e n c y . no ] )
} else {
s t o p ( p a s t e ( ” Currency \ ” ” , c u r r e n c y , ” \ ” not i n d a t a b a s e . ” , s e p =””))
}
}

The following modified loop avoids repeating the same code.

18
day . r e c o r d s <−NULL

for ( i in seq along ( contracts )){

r e c o r d <−c o n t r a c t s [ i , ]
exch . r a t e <−g e t r a t e ( r e c o r d $ c o u n t r y , e x c h a n g e r a t e d a t a )

i f ( r e c o r d $ c o n t r a c t==”E x c e s s o f Loss ” ) {
r e c o r d $ c l a i m=r e c o r d $ l o s s −r e c o r d $ a t t a c h m e n t
} e l s e i f ( r e c o r d $ c o n t r a c t==”Quota Share ” ) {
r e c o r d $ c l a i m=r e c o r d $ l o s s ∗ r e c o r d $ p e r c e n t a g e
} e l s e i f ( r e c o r d $ c o n t r a c t==”C a t a s t r o p h e Cover ” ) {
r e c o r d $ c l a i m=r e c o r d $ c a t a s t r o p h e . l o s s −r e c o r d $ a t t a c h m e n t
} else {
## Give a c l e a r e r r o r message
s t o p ( p a s t e ( ” C o n t r a c t type \ ” ” , r e c o r d $ c o n t r a c t , ” \ ” not known . ” , s e p =””))
## I t i s i m p o r t a n t t o put q u o t a t i o n s around t h e e r r o r , a s
## t r a i l i n g s p a c e s can c a u s e e r r o r s .
}
### These l i n e s appear t o be a l m o s t t h e same i n a l l c a s e s
### I t i s b e t t e r t o put them o n l y once .
i f ( r e c o r d $ c l a i m <0){
r e c o r d $ c l a i m =0
}
i f ( r e c o r d $ c l a i m >r e c o r d $ l i m i t ) {
r e c o r d $ c l a i m=r e c o r d $ l i m i t
}
r e c o r d $ p r o f i t=record$premium−r e c o r d $ c l a i m
i f ( r e c o r d $ c o n t r a c t==”Quota Share ” ) {
r e c o r d $ p r o f i t=r e c o r d $ p r o f i t −r e c o r d $ c e d i n g . commission
}
r e c o r d $ c o n v e r t e d . p r o f i t <−r e c o r d $ p r o f i t ∗ exch . r a t e
}

For the exchange rate lookup table, we can use the names attribute to
create slightly neater lookup tables and functions.

19
### Use names a t t r i b u t e t o make b e t t e r lookup t a b l e s

c u r r e n c y . lookup . t a b l e <−c ( ”CAD” , ”USD” , ”EUR” , ”EUR” , ”RMB” ) ,

names ( c u r r e n c y . lookup . t a b l e )<−c ( ” Canada ” , ”USA” , ” France ” , ” Germany ” , ” China ” ) ,

exchange . r a t e . lookup . t a b l e=c ( 1 , 1 . 2 4 0 3 2 , 1 . 6 4 7 2 0 , 1 . 8 0 3 9 4 , 0 . 1 8 9 4 0 1 )

names ( exchange . r a t e . lookup . t a b l e )=c ( ”CAD” , ”USD” , ”EUR” , ”RMB” ) ,

e x c h a n g e r a t e d a t a <− l i s t (
” c u r r e n c y”=c u r r e n c y . lookup . t a b l e ,
” e x c h a n g e r a t e s ”=exchange . r a t e . lookup . t a b l e
)

g e t . r a t e <−f u n c t i o n ( country , data ) {

### This o n l y works f o r a s i n g l e c o u n t r y .
### Would need t o be m o d i f i e d t o h a n d l e a v e c t o r .

c u r r e n c y <−d a t a $ c u r r e n c y [ c o u n t r y ]
i f ( i s . null ( currency )){
s t o p ( p a s t e ( ” Country \ ” ” , country , ” \ ” not i n d a t a b a s e . ” , s e p =””))
}
r a t e <−d a t a $ e x c h a n g e r a t e s [ c u r r e n c y ]
i f ( i s . null ( rate )){
s t o p ( p a s t e ( ” Currency \ ” ” , c u r r e n c y , ” \ ” not i n d a t a b a s e . ” , s e p =””))
}
return ( rate )
}

One slightly awkward problem with the exchange rate lookup function is
that we need to either use the exchange rate data as a global variable, or
pass it as a parameter for every call to get.rate.
A more advanced solution uses something called a “closure” to build the
fixed values into the get.rate function.

20
### Use a c l o s u r e t o f i x c o n s t a n t p a r a m e t e r s i n a f u n c t i o n .
### We do t h i s by u s i n g one f u n c t i o n t h a t r e t u r n s a new f u n c t i o n .

make . r a t e . lookup<−f u n c t i o n ( c u r r e n c i e s , exchange . r a t e s ) {

#d e f i n e t h e s e l o c a l v a r i a b l e s t o f i x t h e v a l u e s .
c u r r e n c y=c u r r e n c i e s
e x c h a n g e r a t e s=exchange . r a t e s

return ( f u n c t i o n ( country ){
### This o n l y works f o r a s i n g l e c o u n t r y .
### Would need t o be m o d i f i e d t o h a n d l e a v e c t o r .

### Now t h i s f u n c t i o n o n l y n e e d s a s i n g l e parameter .

c u r r <−c u r r e n c y [ c o u n t r y ]
i f ( i s . null ( curr )){
s t o p ( p a s t e ( ” Country \ ” ” , country , ” \ ” not i n d a t a b a s e . ” , s e p =””))
}
r a t e <−e x c h a n g e r a t e s [ c u r r ]
i f ( i s . null ( rate )){
s t o p ( p a s t e ( ” Currency \ ” ” , c u r r e n c y , ” \ ” not i n d a t a b a s e . ” , s e p =””))
}
return ( rate )
})
}

### D e f i n e t h e same lookup t a b l e s a s t h e p r e v i o u s example

c u r r e n c y . lookup . t a b l e <−c ( ”CAD” , ”USD” , ”EUR” , ”EUR” , ”RMB” ) ,
names ( c u r r e n c y . lookup . t a b l e )<−c ( ” Canada ” , ”USA” , ” France ” , ” Germany ” , ” China ” ) ,

exchange . r a t e . lookup . t a b l e=c ( 1 , 1 . 2 4 0 3 2 , 1 . 6 4 7 2 0 , 1 . 8 0 3 9 4 , 0 . 1 8 9 4 0 1 )

names ( exchange . r a t e . lookup . t a b l e )=c ( ”CAD” , ”USD” , ”EUR” , ”RMB” ) ,

### And u s e them t o make t h e loo k −up f u n c t i o n .

g e t . r a t e <−make . r a t e . lookup ( c u r r e n c y . lookup . t a b l e , exchange . r a t e . lookup . t a b l e )

In this code, the look-up tables are built into the get.rate function when it
is created. Now the look-up tables are fixed. By calling make.rate.lookup
with different look-up tables, we could change them, but this creates a new
function.
Another approach we could take is to make a function to calculate the
claim for the different contract types so the main loop is shorter. This
does have the advantage of separating the code that deals with these types,
making it easier to add a new contract type or some similar modification.

21
g e t . c l a i m . amount ( r e c o r d ) {
i f ( r e c o r d $ c o n t r a c t==”E x c e s s o f Loss ” ) {
ans<−r e c o r d $ l o s s −r e c o r d $ a t t a c h m e n t
} e l s e i f ( r e c o r d $ c o n t r a c t==”Quota Share ” ) {
ans<−r e c o r d $ l o s s ∗ r e c o r d $ p e r c e n t a g e
} e l s e i f ( r e c o r d $ c o n t r a c t==”C a t a s t r o p h e Cover ” ) {
ans<−r e c o r d $ c a t a s t r o p h e . l o s s −r e c o r d $ a t t a c h m e n t
} else {
## Give a c l e a r e r r o r message
s t o p ( p a s t e ( ” C o n t r a c t type \ ” ” , r e c o r d $ c o n t r a c t , ” \ ” not known . ” , s e p =””))
}
i f ( ans <0){
ans=0
}
i f ( ans>r e c o r d $ l i m i t ) {
ans=r e c o r d $ l i m i t
}
r e t u r n ( ans )
}

day . r e c o r d s <−NULL

for ( i in seq along ( contracts )){

r e c o r d <−c o n t r a c t s [ i , ]
exch . r a t e <−g e t r a t e ( r e c o r d $ c o u n t r y , e x c h a n g e r a t e d a t a )

r e c o r d $ c l a i m <−g e t . c l a i m . amount ( r e c o r d )

r e c o r d $ p r o f i t=record$premium−r e c o r d $ c l a i m
i f ( r e c o r d $ c o n t r a c t==”Quota Share ” ) {
r e c o r d $ p r o f i t=r e c o r d $ p r o f i t −r e c o r d $ c e d i n g . commission
}
r e c o r d $ c o n v e r t e d . p r o f i t=r e c o r d $ p r o f i t ∗ exch . r a t e

6. A data scientist has produced the code in the file HW1Q6.R to process a
company’s data. Testing the code on a small subset of the data, she finds
that it takes 7 hours to process a dataset with 200,000 records, each with
400 predictors.
(a) Approximately how long would the program be expected to take for the
company’s whole database of 4,000,000 records with 1,200 variables each?

22
We see that the code has two nested loops, both of which concatenate
lists. Concatenating lists of length n is O(n) complexity, so building a
list of length n is O(n2 ) complexity. Thus if n is the number of records,
and p is the number of predictors, then the complexity of this program
is O(n2 p2 ). Thus, if we have 20 times as many records, and 3 times as
many predictors, then program execution will take 202 × 32 = 3600 times
as long, or 25200 hours, often referred to as 1050 days.

(b) Management deems the time required unacceptable. Rewrite the code
to run more efficiently for big datasets.

Apparently three years is too long. To make the code faster, we can
reorganise the code to create the lists first.

r e c o r d s <− l i s t ( )

f o r ( i i n s e q l e n ( num records ) ) {
f o r m a t t e d . r e c o r d <−r e p (NA, n u m v a r i a b l e s )
f or ( j in seq len ( num variables )){
r e c <−r e c o r d [ i , j ]
r e c <− s t r s p l i t ( r e c , ” − ” ) [ [ 1 ] ]
r e c <−mean ( a s . numeric ( r e c ) ) #t a k e mean o f r a n g e
f o r m a t t e d . r e c o r d [ j ]<− r e c #add t o t h e l i s t
}
r e c o r d s [ [ i ]]<− f o r m a t t e d . r e c o r d
}

This should be easily fast enough. However, the code can be made even
more efficient using vectorisation — the strsplit function can take a
vector as input, and will return a list of outputs. We can use the lapply
function to more efficiently process this.

r e c o r d s <− l i s t ( )

f o r ( i i n s e q l e n ( num records ) ) {
r e c o r d s [ [ i ]]<− u n l i s t (
# lapply produces a l i s t of v e c t o r s of length 1 . u n l i s t turns
# i t i n t o a s i n g l e vector as r e q u i r e d .
l a p p l y ( # When t h e i n p u t t o s t r s p l i t i s a v e c t o r , i t r e t u r n s a l i s t
s t r s p l i t ( record [ i ,] ,” −”) ,
f u n c t i o n ( x ) { mean ( a s . numeric ( x ) ) }
))
}

This has the same complexity O(np), but a slightly faster running time.

23
7. The file HW1Q7.txt contains data from an entertainment company about
electricity usage. The data are not formatted in a very convenient way.
Read the data into R and reformat into a more convenient way, and use
it to create a plot showing electricity used per hour (y-axis) vs number of
people (x-axis) with colour showing age group and size showing company
size, with a facet grid of type of event versus time of day. Make a list of
all corrections made to the data.

We first read and clean the customers table:

customers <−r e a d . t a b l e ( ”HW1Q7. t x t ” , s k i p =2,nrow=52)

### c o u l d u s e s t r i n g s A s F a c t o r s , but w i l l need t o f i x f a c t o r s manually
### anyway t o c o r r e c t m i s t a k e s .

customers$ID<−a s . i n t e g e r ( customers$ID )
s t r ( customers )
summary ( c u s t o m e r s )
### Numerical v a l u e s a l l l o o k OK. Next we check t h e c a t e g o r i c a l
### v a r i a b l e ” S e c t o r ” .
table ( customers$Sector )
### Merge ” L i v e p e r f o r m a n c e ” and ” L i v e Performance ”
c u s t o m e r s $ S e c t o r [ c u s t o m e r s $ S e c t o r==”L i v e p e r f o r m a c e ”]<−” L i v e Performance ”

### Now c o n v e r t t o f a c t o r
c u s t o m e r s $ S e c t o r <−a s . f a c t o r ( c u s t o m e r s $ S e c t o r )

We identify and fix several misspellings and abbreviations. We then do

the same for the events table.

24
e v e n t s <−r e a d . t a b l e ( ”HW1Q7. t x t ” , s k i p =57)

s t r ( events )
### We n o t e t h a t Number . o f . p e o p l e i s c h a r a c t e r , when i t s h o u l d be numeric .
### This i n d i c a t e s t h e r e a r e p r o b a b l y some e r r o r s t h a t need t o be f i x e d .

which ( i s . na ( a s . numeric ( events$Number . o f . P eo pl e ) ) )

events$Number . o f . P eo pl e [ c ( 8 3 , 1 3 4 ) ]
### Commas s h o u l d c l e a r l y be removed .
events$Number . o f . P eo pl e [ c (83 ,134)] < − c ( 3 2 1 6 , 5 6 9 1 )
### Now we can c o n v e r t i t t o numeric
events$Number . o f . People<−a s . i n t e g e r ( events$Number . o f . P eo pl e )
### Check t h a t i t c o n v e r t e d c o r r e c t l y
which ( i s . na ( events$Number . o f . P eo pl e ) )

### Now Check t h e l e v e l s o f f a c t o r v a r i a b l e s :

t a b l e ( e v e n t s $ E v e n t . type )
### Seems OK
t a b l e ( events$Age . group )
### ” Adlut ” i s a m i s s p e l l i n g o f ” Adult ” and ” YoungChild ” s h o u l d be
### ”Young C h i l d ” .
events$Age . group [ events$Age . group==”Adlut”]<−” Adult ”
events$Age . group [ events$Age . group==”YoungChild”]<−”Young C h i l d ”

t a b l e ( events$Time . o f . day )
### ” Late ” p r o b a b l y means t h e same a s ” Late Evening ” .
events$Time . o f . day [ events$Time . o f . day==”Late ”]<−” Late Evening ”

summary ( e v e n t s )
### Change s t r i n g s t o f a c t o r s
e v e n t s $ E v e n t . type<−a s . f a c t o r ( e v e n t s $ E v e n t . type )
events$Age . group<−a s . f a c t o r ( events$Age . group )
events$Time . o f . day<−a s . f a c t o r ( events$Time . o f . day )

Now that each table is cleaned, we join the two tables.

l i b r a r y ( dplyr )

e v e n t s . f u l l <−e v e n t s%>%l e f t j o i n ( customers , by=c ( ” Host . ID”=”ID ” ) )

Finally, we can make the plot requested.

25
l i b r a r y ( ggplot2 )
g g p l o t ( data=e v e n t s . f u l l ,
mapping=a e s ( x=Number . o f . People ,
y=E l e c t r i c i t y . Usage ,
c o l o u r=Age . group ,
s i z e=S i z e ))+
geom point ()+
f a c e t g r i d ( Event . type ˜Time . o f . day)+
theme ( p l o t . t i t l e =e l e m e n t t e x t ( s i z e =20 , h j u s t = 0 . 5 ) ,
a x i s . t i t l e =e l e m e n t t e x t ( s i z e =20 , h j u s t = 0 . 5 ) ,
a x i s . t e x t=e l e m e n t t e x t ( s i z e =16) ,
l e g e n d . t i t l e =e l e m e n t t e x t ( s i z e =20 , h j u s t = 0 . 5 ) ,
l e g e n d . t e x t=e l e m e n t t e x t ( s i z e =16) ,
s t r i p . t e x t=e l e m e n t t e x t ( s i z e =16))

This gives the following plot.

Afternoon Evening Late Evening Morning Overnight

2000

Activity
1000

2000

Film
1000

Live Performance
2000
Age.group
Adult
Electricity.Usage

1000 All
Child
Senior
0 Young Adult
Young Child

2000
Size
50
Music

100
150
1000

2000
Sport

1000

0
Video game

2000

1000

0
0 3000 6000 9000 0 3000 6000 9000 0 3000 6000 9000 0 3000 6000 9000 0 3000 6000 9000
Number.of.People

[This is not a particularly great figure. It could certainly be improved in

many ways.]
During this process, we fixed the following errors in the data:
Customers
Sector
• Customers 2 and 10 have sector “Live performance”, while
customers 7, 12, 15, 34, 50 and 51 have sector “Live Perfor-
mance”. These have been merged.

26
Events
Age.group
• Events 185, 612 and 616 have age group “Adlut”, which is
almost certainly a misspelling of “Adult”.
• Events 14, 85, 184, 275, 300, 726 and 775 have age group
“YoungChild”, which should be “Young Child”.
Time.of.day
• Events 97, 214 and 457 have time of day “Late” which is
presumably the same as “Late Evening”.
Number.of.People
• Events 83 and 134 have commas in the number of people,
causing them to be interpreted as character.

BPS21018 SEC Practical
No ratings yet
BPS21018 SEC Practical
92 pages
Hotel Management System: Project Report of
50% (2)
Hotel Management System: Project Report of
87 pages
Materi 4
No ratings yet
Materi 4
30 pages
OUTPUT 1 Spss Notated
No ratings yet
OUTPUT 1 Spss Notated
39 pages
Da (22C01156)
No ratings yet
Da (22C01156)
26 pages
早年自敲代码
No ratings yet
早年自敲代码
96 pages
Rsudio Problems
No ratings yet
Rsudio Problems
27 pages
Grade 7 English
No ratings yet
Grade 7 English
12 pages
FE418 RLectureNotes1
No ratings yet
FE418 RLectureNotes1
15 pages
Coding Self-Assessment 2023
No ratings yet
Coding Self-Assessment 2023
5 pages
R Programing Bhagu
No ratings yet
R Programing Bhagu
40 pages
Practice 1
No ratings yet
Practice 1
4 pages
Welcome To Cmpe140 Final Exam: Studentid
No ratings yet
Welcome To Cmpe140 Final Exam: Studentid
21 pages
Lab Book
No ratings yet
Lab Book
24 pages
Lec 6 Data Preprocessing Using R
No ratings yet
Lec 6 Data Preprocessing Using R
84 pages
Sta238 Wks - Week1+2
No ratings yet
Sta238 Wks - Week1+2
35 pages
Sheet
No ratings yet
Sheet
2 pages
CIND123 Module 2
No ratings yet
CIND123 Module 2
2 pages
Dav Lab
No ratings yet
Dav Lab
55 pages
Ali
No ratings yet
Ali
31 pages
Practical 2 Kunal
No ratings yet
Practical 2 Kunal
6 pages
Huraira
No ratings yet
Huraira
26 pages
R Programming
No ratings yet
R Programming
50 pages
Dsda Manual
No ratings yet
Dsda Manual
64 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Experiment 5
No ratings yet
Experiment 5
13 pages
Class 7
No ratings yet
Class 7
17 pages
Analysis Course HW1
No ratings yet
Analysis Course HW1
5 pages
Applied Statistics MAT1011
No ratings yet
Applied Statistics MAT1011
22 pages
Data Preparation: Treatment of Missing Values
No ratings yet
Data Preparation: Treatment of Missing Values
26 pages
Kanak Gupta 1116 SEC Assignment
No ratings yet
Kanak Gupta 1116 SEC Assignment
3 pages
Section 03
No ratings yet
Section 03
20 pages
Bigdata Programs&Solutions
No ratings yet
Bigdata Programs&Solutions
7 pages
Feature Engineering
No ratings yet
Feature Engineering
35 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
Curso Básico de Iniciación A La Programación Con R Álvaro Mauricio Bustamante Lozano
No ratings yet
Curso Básico de Iniciación A La Programación Con R Álvaro Mauricio Bustamante Lozano
9 pages
Aditya Garg DMDW
No ratings yet
Aditya Garg DMDW
40 pages
Maths Record Output .
No ratings yet
Maths Record Output .
24 pages
Da Lab File 2
No ratings yet
Da Lab File 2
13 pages
Formulario
No ratings yet
Formulario
7 pages
DA Lab Manual
No ratings yet
DA Lab Manual
42 pages
R Syntax Examples 1
No ratings yet
R Syntax Examples 1
6 pages
R Programming Interview Questions-1
No ratings yet
R Programming Interview Questions-1
20 pages
Da Lab It
No ratings yet
Da Lab It
20 pages
Cheat Sheet F
No ratings yet
Cheat Sheet F
2 pages
Singh Project1 Report
No ratings yet
Singh Project1 Report
12 pages
Unit 1
No ratings yet
Unit 1
21 pages
How To Do Reliability Analysis and Basic Factor Analysis in R
No ratings yet
How To Do Reliability Analysis and Basic Factor Analysis in R
4 pages
Cheat Sheet Final
No ratings yet
Cheat Sheet Final
2 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
A1rib T4
No ratings yet
A1rib T4
5 pages
R Basics
88% (8)
R Basics
8 pages
The Sana Palimpsest Materializing The
No ratings yet
The Sana Palimpsest Materializing The
30 pages
Data Preparation: Handling Missing Values and Outliers
No ratings yet
Data Preparation: Handling Missing Values and Outliers
28 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
R Assignment
No ratings yet
R Assignment
9 pages
Workshop Activity: X Seq y Length
No ratings yet
Workshop Activity: X Seq y Length
3 pages
Homework Solutions - MATLAB
No ratings yet
Homework Solutions - MATLAB
27 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
CS2610 Final Exam: If Is - Nan Print
No ratings yet
CS2610 Final Exam: If Is - Nan Print
5 pages
R - Tutorial: Matrices Are Vectors
No ratings yet
R - Tutorial: Matrices Are Vectors
13 pages
UNIT 1 - Basic C Programming
No ratings yet
UNIT 1 - Basic C Programming
38 pages
Mathematics
No ratings yet
Mathematics
15 pages
Purposive Communication
No ratings yet
Purposive Communication
5 pages
Question Bank NLP SOLUTIONS
No ratings yet
Question Bank NLP SOLUTIONS
21 pages
Detailed LP CO1 Q1
No ratings yet
Detailed LP CO1 Q1
4 pages
An Astrologer's Day
No ratings yet
An Astrologer's Day
7 pages
Walberg Theory of Educational Productivity
100% (1)
Walberg Theory of Educational Productivity
1 page
C (Chap 10 Structure & Union)
No ratings yet
C (Chap 10 Structure & Union)
14 pages
Edit 610 - Final Project
No ratings yet
Edit 610 - Final Project
9 pages
Useful Information For JUPAS Applicants 2025
No ratings yet
Useful Information For JUPAS Applicants 2025
14 pages
TKR College of Engineering and Technology: (Autonomous & Accredited With 'A' Grade by NAAC)
No ratings yet
TKR College of Engineering and Technology: (Autonomous & Accredited With 'A' Grade by NAAC)
2 pages
Stochastic Mechanics
No ratings yet
Stochastic Mechanics
113 pages
Action Research in Science - Assessment Strategies
No ratings yet
Action Research in Science - Assessment Strategies
19 pages
6phrase - Very Very Important - C MCQ - 4
No ratings yet
6phrase - Very Very Important - C MCQ - 4
21 pages
How To Check The Health of Your Laptop's Battery in Windows
No ratings yet
How To Check The Health of Your Laptop's Battery in Windows
8 pages
C++ Final
No ratings yet
C++ Final
129 pages
Otaremwa Moses Ronaldo
No ratings yet
Otaremwa Moses Ronaldo
45 pages
CS341 HomeworkSol PDF
No ratings yet
CS341 HomeworkSol PDF
5 pages
Combined Science Component 3
No ratings yet
Combined Science Component 3
7 pages
Madagascar
No ratings yet
Madagascar
7 pages
Still Vs Yet
No ratings yet
Still Vs Yet
5 pages
Abrahams Trials English
No ratings yet
Abrahams Trials English
19 pages
Planificare Engleza 8A L1
No ratings yet
Planificare Engleza 8A L1
11 pages
Math 9 Q3 M2
No ratings yet
Math 9 Q3 M2
4 pages
Fail Computation in CATIA V5R16 (GSA)
No ratings yet
Fail Computation in CATIA V5R16 (GSA)
2 pages
Context Free Grammar - Kannada
No ratings yet
Context Free Grammar - Kannada
6 pages
Project Report PDF
No ratings yet
Project Report PDF
5 pages

Solutions 1

Uploaded by

Solutions 1

Uploaded by

ACSC/STAT 3740, Predictive Analytics

### Remove e n t r i e s with z e r o b a l l s remaining , a s they must be m i s t a k e s .

### Remove e n t r i e s with z e r o b a l l s remaining , a s they must be m i s t a k e s .

c r i c k e t . data . good<−c r i c k e t . data [ c r i c k e t . d a t a $ B a l l s . r e m a i n i n g [ i ] > 0 , ]

c r i c k e t . data . good<−c r i c k e t . data%>%f i l t e r ( B a l l s . remaining >0)

rows . t o . remove<−which ( c r i c k e t . d a t a $ B a l l s . r e m a i n i n g [ i ]==0)

### Remove e n t r i e s with z e r o b a l l s remaining , a s they must be m i s t a k e s .

2. A government worker is investigating the effect of various parenting tech-

Here is one way the file could be commented.

D e p r e s s i o n . study . data<−D e p r e s s i o n . data%>%

### C a l c u l a t e a d i s t a n c e matrix between p r e d i c t o r s .

# The median p o i n t i s t h e one t h a t m i n i m i s e s t o t a l d i s t a n c e t o o t h e r

LASSO<−cv . glmnet (X, Pl a nt . d a t a $ y i e l d , a l p h a =1)

### For each cut−o f f , c a l c u l a t e t h e Mean Squared E r r o r f o r t h e o u t l i e r s

### C a l c u l a t e a d i s t a n c e matrix between p r e d i c t o r s .

# The median p o i n t i s t h e one t h a t m i n i m i s e s t o t a l d i s t a n c e t o o t h e r

LASSO<−cv . glmnet (X, Pl a nt . d a t a $ y i e l d , a l p h a =1)

### For each cut−o f f , c a l c u l a t e t h e Mean Squared E r r o r f o r t h e o u t l i e r s

### C a l c u l a t e a d i s t a n c e matrix between p r e d i c t o r s .

# The median p o i n t i s t h e one t h a t m i n i m i s e s t o t a l d i s t a n c e t o o t h e r

LASSO<−cv . glmnet (X, Pl a nt . d a t a $ y i e l d , a l p h a =1)

c o e f f s <−LASSO$glmnet . f i t $ b e t a [ , which ( LASSO$lambda==LASSO$lambda . 1 s e ) ]

### For each cut−o f f , c a l c u l a t e t h e Mean Squared E r r o r f o r t h e o u t l i e r s

The error message here is somewhat cryptic. Examining it carefully, it is

c r o s s . v a l i d a t e <−f u n c t i o n ( formula , data , Folds , n f o l d ) {

p r e d i c t i o n s <−r e p (NA, dim ( data ) [ 1 ] )

for ( i in seq len ( nfold )){

### Try d i f f e r e n t numbers o f f o l d s and compare t h e c r o s s −v a l i d a t e d MSE

for ( i in seq len (6)){

c r o s s . v a l i d a t e <−f u n c t i o n ( formula , data ) {

p r e d i c t i o n s <−r e p (NA, dim ( data ) [ 1 ] )

for ( i in seq len ( nfold )){

### To make them a c c e s s i b l e f u n c t i o n s , we need a few e x t r a l i n e s o f code .

### Try d i f f e r e n t numbers o f f o l d s and compare t h e c r o s s −v a l i d a t e d MSE

for ( i in seq len (6)){

g e t . r a t e <−f u n c t i o n ( country , data ) {

### We p a s s exchange r a t e data a s a parameter t o e n s u r e t h a t t h e u s e r

The following modified loop avoids repeating the same code.

for ( i in seq along ( contracts )){

c u r r e n c y . lookup . t a b l e <−c ( ”CAD” , ”USD” , ”EUR” , ”EUR” , ”RMB” ) ,

exchange . r a t e . lookup . t a b l e=c ( 1 , 1 . 2 4 0 3 2 , 1 . 6 4 7 2 0 , 1 . 8 0 3 9 4 , 0 . 1 8 9 4 0 1 )

g e t . r a t e <−f u n c t i o n ( country , data ) {

make . r a t e . lookup<−f u n c t i o n ( c u r r e n c i e s , exchange . r a t e s ) {

### Now t h i s f u n c t i o n o n l y n e e d s a s i n g l e parameter .

### D e f i n e t h e same lookup t a b l e s a s t h e p r e v i o u s example

exchange . r a t e . lookup . t a b l e=c ( 1 , 1 . 2 4 0 3 2 , 1 . 6 4 7 2 0 , 1 . 8 0 3 9 4 , 0 . 1 8 9 4 0 1 )

### And u s e them t o make t h e loo k −up f u n c t i o n .

for ( i in seq along ( contracts )){

We first read and clean the customers table:

customers <−r e a d . t a b l e ( ”HW1Q7. t x t ” , s k i p =2,nrow=52)

We identify and fix several misspellings and abbreviations. We then do

which ( i s . na ( a s . numeric ( events$Number . o f . P eo pl e ) ) )

### Now Check t h e l e v e l s o f f a c t o r v a r i a b l e s :

Now that each table is cleaned, we join the two tables.

e v e n t s . f u l l <−e v e n t s%>%l e f t j o i n ( customers , by=c ( ” Host . ID”=”ID ” ) )

Finally, we can make the plot requested.

This gives the following plot.

[This is not a particularly great figure. It could certainly be improved in

You might also like