Stat-340 - Assignment 4 - 2014 Spring Term: Part 1 - Breakfast Cereals - Easy
Stat-340 - Assignment 4 - 2014 Spring Term: Part 1 - Breakfast Cereals - Easy
n = Cn
0.5
for some constant C. Take logarithms of both sides to give:
log SE = log C 0.5 log n
(why?). This now looks like a linear regression of log SE vs. log n with
the intercept being log C and the slope of 0.5 (why?).
Create derived variables for the log(sample size), log(se_p90), and log(se_mean).
9. Use Proc Reg to nd the relationship between the log(se) of each statistic
and log(n). You can do both ts in the same procedure your code will
look something like:
p90: model log_se_p90 = log_n;
mean: model log_se_mean = log_n;
Have a look at the residual plots for both models do you notice anything
odd about the plots for the model for the 90th percentile?
Send the parameter estimates to a table for inclusion in your report using
the ODS Output facility.
Use Proc Transpose to create a table that looks like:
c 2013 Carl James Schwarz 10
Parameter Intercept Slope
P90 xxx xxx
Mean xxx xxx
Look at the estimated slope. Is the decline in standard error with sample
size consistent with the
n rule? How did you tell?
10. Plot the log(se) values against log(sample size) for both estimates on the
same graph using Proc SGplot. You should have 2 scatter statements to
plot the actual values, two series, and two reg statements for the separate
regression lines within SGplot. You can label each regression line using
the curvelabel= option on the reg statement.
Have a look at the series plots for both models do you notice anything
odd about the plots for the model for the 90th percentile? Why do you
think this has happened?
It is tedious to add the regression equation to the plot (but you could do
this using the annotate feature of SAS as shown in the previous question).
It is not necessary to annotate the plot with the equations.
11. Finally, after you have debugged your program, increase the number of
replicate samples at each sample size to 100 from 10. If you set up the
macro variable correctly, this will require one simple change.
Hand in the following using the online submission system:
Your SAS code.
A PDF le containing the the output from your SAS program.
A one page (maximum) double spaced PDF le containing a short write
up on this analysis suitable for a manager of trac operations who has
had one course in statistics. You should include:
A (very) brief description of the dataset.
A graph showing the decline in log(se) as a function of log(n) with an
accompanying table of the t and an explanation of the implications
of the slope when investigating the improvement in precision as a
function of sample size.
c 2013 Carl James Schwarz 11
Part 03: Review and preparing for term test
In this part of this assignment, you will work on a few short exercises designed
to review some of the material from the rst three assignments and introduce
some new things about the Data step.
Put all of the code from all of the sub-parts in one single SAS le. There is NO
writeup for this part of the assignment.
1. Problems with input data
Outdoor temperature is measured in degrees Celcius (http://en.wikipedia.
org/wiki/Celsius) in Canada and degrees Fahrenheit (http://en.wikipedia.
org/wiki/Fahrenheit) in the US. Look at http://www.stat.sfu.ca/
~cschwarz/Stat-340/Assignments/Assign04/assign04-part03-ds01.
txt which has the temperatures in degrees Celcius or degrees Fahrenheit
in a few cities in early January.
Read in the data and convert all temperatures to degrees Celcius. Note
that
C = (F 32)
5
9
Print out the nal dataset that contains the city and temperature (1 dec-
imal place) but not the observation number. Make sure that the label
for the temperature indicates it is in degrees Celcius.
2. Column input
So far you have used the list input style of SAS. In this style of input,
variables are separated by at least one delimiter (typically a blank) and
there is no requirement that data values be aligned in the input le.
In some cases (particularly in dealing with data that was collected many
years ago), space on the input record was at a premium and data was often
crunched together without spaces between values. For example, an old
style of input medium was the punch card (http://en.wikipedia.org/
wiki/Punched_card) in which you had at most 80 columns of data.
Look at http://www.stat.sfu.ca/~cschwarz/Stat-340/Assignments/
Assign04/assign04-part03-ds02.txt. The rst two records are the
variable names and a character counter so you can see where the various
columns in the data are. The data variables are:
Make of car in columns 1-5.
Model of car in columns 6-12.
Miles per gallon (mpg), Imperial measurement of fuel economy, in
columns 13-14.
c 2013 Carl James Schwarz 12
Weight of the car in columns 15-18.
Price of the car in columns 19-22.
In column input, you specify the columns that contain the variable. For
example, to read in the make and model of the car, your code would look
something like:
data *****;
infile *****;
length make ***** model ****;
input make $ 1-5 model $ 6-12;
Write SAS code to read in all of the variables from the car data (using
the URL method) and print out the nal dataset.
3. Split-Apply-Combine (SAC) paradigm
The Split-Apply-Combine (SAC) paradigm is a common task, implemented
in SAS using the By statement, various procedures, and the ODS OUT-
PUT or OUTPUT OUT= commands within the procedure.
For example, suppose you wanted to compare the average amount of sugar
by shelf in the cereal data. Your code would look something like:
data cereal; /* define the grouping variable */
/* read in cereal data */
/* make sure shelf and sugar are defined */
run;
proc sort data=cereal by shelf; run; /* sort by grouping variable */
proc univariate data=cereal .... cibasic ;
by shelf; /* separate analysis for each shelf */
var sugars;
ods output .....;
run;
proc sgplot data=....; /* plot the estimate and 95% ci */
....
run;
Youve also analyzed the proportion of survival by passenger class in the
Titanic, the number of accidents per day across the months, etc.
Go back to the accident dataset, and make a side-by-side condence in-
terval plot of the proportion of fatal accidents by MONTH. You will have
to create a month variable from the date, create a fatality indicator, use
Proc Freq or Proc Genmod to estimate the proportion of fatalities in each
month, and nally Proc SGplot to plot the nal estimates and condence
c 2013 Carl James Schwarz 13
intervals.
Are you surprised by the results? There is NO write up for this part.
c 2013 Carl James Schwarz 14
Commments from the marker.
Here are the comments from the marker from previous years assignments.
Part 01 - Cereal
Most papers began with a quick blurb about the dataset itself, followed by a
detailed discussion of the mean, sd, gini, etc without even a mention of the
fact that it was calories/serving that was the variable of interest. Most people
who lost marks on this question lost it because they didnt mention the variable
they were analyzing.
Other mistakes included not using enough replicates in bootstrapping. A
number of people used only 5 replicates, and deemed it sucient. Some even
presented histograms from those bootstrap runs, and claimed that things looked
"normal".
CJS - use 5-10 replicates to TEST your program, but dont forget to increase
the number of replicate bootstrap samples to around 1000.
Part 02 - Accidents
Once again, plenty of papers didnt mention the accident index, or even a
method by which the graphs were generated. After a quick blurb about this
data having 150000 data points and hailing from the UK government, they
delved right into analyzing the graph.
I had very few truly satisfactory slope interpretations. A lot of people seem
to default into the formulaic: If X increases by one unit, then Y increases by
blah units, which in this case is very confusing. I didnt penalize this in most
cases, but I still dont like it. In the instances where the interpretation was sim-
ply the equivalent of "precision increases with sample size", I took marks o.
That statement in itself is not news this experiment was performed specically
to assess the nature of the relationship.
CJS - Yes, it is true that the formal denition of a slope is the change in Y
c 2013 Carl James Schwarz 15
per unit change in X. But statisticians are not robots who just repeat textbook
denitions! You should always put your work in terms of the project.
c 2013 Carl James Schwarz 16