Index (BMI) of The Athletes. in The Graph, Label The Axes Correctly, Include Units of Measure and
Index (BMI) of The Athletes. in The Graph, Label The Axes Correctly, Include Units of Measure and
Page |1
ASSIGNMENT 2
This question uses information from the data file AIS.sav found under the Assessment tab on the
STA2300 Course StudyDesk (also see AIS.txt for more details about the source and the variables
reported in the data set). Make sure the Variable View in SPSS is setup properly with all ‘labels’
correctly defined (with units), all ‘values’ assigned correctly for categorical variables and the correct
‘measure’ selected for all variables.
Use SPSS to find the answers to the following questions, but do not copy and paste SPSS output into
your answers for parts (c) and (d).
(a) (5 marks) Using SPSS produce an appropriate graph to display the distribution of the body mass
index (BMI) of the athletes. In the graph, label the axes correctly, include units of measure and
provide an appropriate title which includes your name.
STA2300 Data Analysis S2, 2019
Page |2
(b) (4 marks) Using the graph produced in part (a) only (don’t refer to SPSS summary statistics),
describe in no more than 60 words, the distribution of the BMI of athletes. Include comments on
shape, centre and spread of the distribution and the existence of outliers and/or gaps, if any. Do not
perform any calculations; use the graph only.
We believe that the athlete’s distribution of BMI is more correctly described by a positively skewed
distribution and that over time the degree of skewing has increased; that is, there is proportionately
much more shifting of the distribution curve at the upper end than the lower. In an obesogenic
environment, the positive skewing of the distribution curve of BMIs increases over time as heavier
individuals gain more weight than lighter individuals.
(c) (4 marks) What are the mean and standard deviation of the distribution of the BMI of the athletes?
(You can use SPSS to calculate the descriptive statistics but do not copy/paste SPSS output).
Statistics
Body Mass Index (kg/cm^2)
STA2300 Data Analysis S2, 2019
Page |3
Mean 22.9559
Std. Deviation 2.86393
(d) (4 marks) Using SPSS find the median and IQR of the distribution of the BMI of athletes and
report in your answer. (Do not copy/paste SPSS output).
Statistics
Body Mass Index (kg/cm^2)
Median 22.7200
IQR 3.42
(e) (3 marks) For the distribution of the BMI of the athletes, which statistics are appropriate to
measure its centre and spread? Give a reasonable explanation for your choice.
Mean is the most frequently used measure of central tendency and generally considered the best
measure of it. However, there are some situations where either median are preferred.
i. response variable(s),
ii. factor and its levels, and
iii. experimental unit.
(c) (4 marks) Are the four principles of experimental design used in this study? Explain, in the context
of the study.
The first principle of an experimental design is randomization is used, which is a random process of
assigning treatments to the experimental units. The random process implies that every possible
allotment of treatments has the same probability. Here The education specialist divided a large class
of 90 high school students and randomly allocated each of the teaching methods to 30 students. He
then assessed the performance of the students at the end of the term. The second principle of an
experimental design is replication, which is a repetition of the basic experiment. In other words, it is
a complete run for all the treatments to be tested in the experiment. In all experiments, some kind of
variation is introduced because of the fact that the experimental. Here in this study An education
specialist proposed three different teaching methods to teach statistics in high school. The first
method, A, consisted of class room instructions; the second method, B, consisted of classroom
instructions and tutorials; and the third method, C, consisted of classroom instructions, tutorials and
homework.
(d) (3 marks) Explain explicitly what a confounding variable is. Identify one plausible confounding
variable in this study and explain why it is a confounding variable.
Gender of student
Amount of Study time
An extraneous variable becomes a confounding variable when the extraneous variable changes
systematically along with the independent variable(s) that you are studying. The question arises:
Why is this third, suspect extraneous variable, Amount of Study time, a confounding variable? The
STA2300 Data Analysis S2, 2019
Page |5
answer is that the variable, Amount of Study time, changed systematically with the independent
variable that we were measuring (i.e., teaching methods) .
The data set AIS.sav contains information that was collected as part of a comprehensive study on 102
male and 100 female athletes randomly selected from the Australian Institute of Sport (courtesy of
Richard Telford and Ross Cunningham).
A researcher is interested to know if the Plasma Ferritin concentration (coded) is associated with
the Gender.
(a) (4 marks) Use SPSS to produce a contingency table displaying the relationship between the
Plasma Ferritin concentration (coded) and Gender of the athletes (you should use SPSS to
produce this contingency table). The title for this table should reflect its contents. (Note that by
convention, a table title should appear above the table). Include your name in the title.
Chi-Square Tests
Asymptotic
Significance
Value df (2-sided)
Pearson Chi-Square 31.136a 2 .000
STA2300 Data Analysis S2, 2019
Page |6
Symmetric Measures
Approximate
Value Significance
Nominal by Contingency
.365 .000
Nominal Coefficient
N of Valid Cases 202
(b) (2 marks) What proportion of athletes had medium level of Plasma Ferritin concentration
(coded) and were male?
Plasma ferritin - coded *Male
Midium (50ng/mL-120ng/mL) 53 (54.1%)
(c) (2 marks) Of the female athletes, what proportion had medium level of Plasma Ferritin
concentration (coded)?
(d) (6 marks) Does there appear to be an association between the level of Plasma Ferritin
concentration (coded) and Gender? Explain in less than 100 words, using a numerical example
from a conditional distribution table (produced by SPSS) to support your conclusion.
Gender
Female Male Total
Plasma ferritin -Low (Less thanCount 51 21 72
coded 50ng/mL) % within Plasma
70.8% 29.2% 100.0%
ferritin - coded
% within Gender 51.0% 20.6% 35.6%
% of Total 25.2% 10.4% 35.6%
Midium (50ng/mL-Count 45 53 98
120ng/mL) % within Plasma
45.9% 54.1% 100.0%
ferritin - coded
% within Gender 45.0% 52.0% 48.5%
% of Total 22.3% 26.2% 48.5%
High (More thanCount 4 28 32
120ng/mL) % within Plasma
12.5% 87.5% 100.0%
ferritin - coded
% within Gender 4.0% 27.5% 15.8%
% of Total 2.0% 13.9% 15.8%
Total Count 100 102 202
% within Plasma
49.5% 50.5% 100.0%
ferritin - coded
% within Gender 100.0% 100.0% 100.0%
% of Total 49.5% 50.5% 100.0%
Plasma ferritin level was obtained from 202. The normality test revealed that ferritin was not
normally distributed (W-S p-value < 0.000). There is an association between the level of Plasma
Ferritin concentration (coded) and Gender
The analysis revealed that 51(70.8%) female were in the LO while 28 (87.5%) male were the HI
obesity groups. Additionally, total 98 (48.5%) were ferritin medium.
STA2300 Data Analysis S2, 2019
Page |8
a) (2 marks) Identify the variable of interest and the unit of measurement of this variable.
b) (3 marks) Based on this distribution, what is the weight loss that 99% of the clients of Body
Fit will exceed in the first week of the program?
To compute the 99th percentile, we use the formula X=μ + Zσ
Using Z=2.326the 99th percentile: X = 2.9 + 2.326(0.45) = 3.94
c) (4 marks) Based on this distribution, what percentage of clients of Body Fit will lose weight
between 1.9kg and 3.9kg in the first week of the program?
Mean= (1.9kg+3.9kg)/2
=2.9
Std dev =(3.9-1.9)/4.4
=0.45
=1.9+3.9=25.8 %
.
d) (3 marks) Based on this distribution, if Body Fit had 8500 clients, how many of them had lost
less than 3.5kg in the first week of the program?
=25.8-3.5
=22.3
(a) (2 marks) What are the two variables the doctors at AIS will need to include in the analysis?
What type of variables are they?
Age, physical Activity and diet are the two variables will need to include in the analysis. They
are demographic variables.
(b) (4 marks) Use an appropriate graph to display the relationship between the two variables
identified in part (a). Label the axes correctly, include units of measurement and provide an
appropriate title which includes your name.
Correlations
R= .846**
P-Value = .000
(c) (4 marks) From the graph in part (b), describe (in no more than 30 words) the form, direction
and scatter of this relationship, and identify any outliers.
The main objective of this study is to examine the relationship between elevated Body Mass
Index (BMI) and Weight. Our analysis revealed a stronger correlation of Weight and BMI.
In addition, BMI was more strongly correlated with Weight. There is a statistically significant
association (P<0.05).
(d) (4 marks) Calculate an appropriate statistic to measure the strength and direction of the
relationship between the two variables you identified in part (a). Justify your choice of this
statistic and interpret what it tells you about the relationship.
Correlations
Body Mass
Index
Weight (kg) (kg/cm^2)
Weight (kg) Pearson Correlation 1 .846**
Sig. (2-tailed) .000
Sum of Squares and
38978.244 6781.399
Cross-products
Covariance 193.922 33.738
N 202 202
Body Mass IndexPearson Correlation .846** 1
(kg/cm^2) Sig. (2-tailed) .000
Sum of Squares and
6781.399 1648.624
Cross-products
Covariance 33.738 8.202
N 202 202
**. Correlation is significant at the 0.01 level (2-tailed).
STA2300 Data Analysis S2, 2019
P a g e | 11
The aim of this analysis was to compare the relationship between perceived body weight and BMI
based on self-reported height and weight in athletes. Our analysis revealed a stronger correlation of
Weight and BMI where r=0.84 which is significantly significant (p<0.05).
(e) (6 marks) Use SPSS output to write the equation of the regression line which could be used
to predict the BMI of the athletes from their weight, and then plot the regression line on the
graph produced in part (b).
Model Summaryb
Adjusted R Std. Error of
Model R R Square Square the Estimate
1 .846a .716 .714 1.53102
a. Predictors: (Constant), Weight (kg)
b. Dependent Variable: Body Mass Index (kg/cm^2)
ANOVAa
Sum of
Model Squares df Mean Square F Sig.
1 Regression 1179.821 1 1179.821 503.334 .000b
Residual 468.803 200 2.344
Total 1648.624 201
a. Dependent Variable: Body Mass Index (kg/cm^2)
b. Predictors: (Constant), Weight (kg)
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 9.906 .592 16.746 .000
Weight
.174 .008 .846 22.435 .000
(kg)
a. Dependent Variable: Body Mass Index (kg/cm^2)
STA2300 Data Analysis S2, 2019
P a g e | 12
Residuals Statisticsa
Minimu Maximu Std.
m m Mean Deviation N
Predicted Value 16.4824 31.3403 22.9559 2.42276 202
Residual -3.75746 5.53074 .00000 1.52720 202
Std. Predicted
-2.672 3.461 .000 1.000 202
Value
Std. Residual -2.454 3.612 .000 .998 202
a. Dependent Variable: Body Mass Index (kg/cm^2)
STA2300 Data Analysis S2, 2019
P a g e | 13
(f) (3 marks) Using the regression equation from part (e), predict the BMI of the athlete who
weighs 130kg. Would you consider this to be an accurate prediction? Why or why not?
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 9.906 .592 16.746 .000
Weight
.174 .008 .846 22.435 .000
(kg)
a. Dependent Variable: Body Mass Index (kg/cm^2)
Ŷ=a+b1x1+b2x2
=9.906+0.174
=9.906+0.174(130)
(g) (1 mark) What proportion of the variability in the BMI of the athletes has been explained by
their weight?
Model Summaryb
Adjusted R Std. Error of
Model R R Square Square the Estimate
1 .846a .716 .714 1.53102
a. Predictors: (Constant), Weight (kg)
b. Dependent Variable: Body Mass Index (kg/cm^2)
(a) The proportion of the variability in the BMI of the athletes by their weight =0.716
STA2300 Data Analysis S2, 2019
P a g e | 14
(b) (3 marks) What is an appropriate model to represent the above variable? Write down the
parameters of the model.
Probability model is the appropriate one. The parameters of the model are the mean and
standard deviation
(c) (4 marks) Discuss how the conditions of the appropriate model are satisfied in the current
study. Explain the conditions in the context of the problem.
The normal distribution is the most important and most widely used distribution in statistics.
It is sometimes called the "bell curve," although the tonal qualities of such a bell would be
less than pleasing. Continuous for all values of X between -∞ and ∞ so that each conceivable
interval of real numbers has a probability other than zero
(d) (2 marks) Find the mean and standard deviation of the variable using the parameters of the
model.
=12*0.15
=1.8
Variance=122*0.15-1.8 = 19.8
(e) (3 marks) Find the probability that at most 3 of the 10 randomly selected microchips are
defective.
Naively, this is the Binomial Distribution, where finding a defective chip is the event of interest, so
is labeled as a "SUCCESS". The sampling must, as stated in the problem, be done with replacement.
The formula you need for finding "k successes in N trials" or in this case 3 defective chips from a
sample of 5 selected randomly with replacement is:
(N choose k) p^k (1-p)^(N-k) where N is the sample size, in this case 5, p is the probability of
"SUCCESS", which in this case is p=0.2 as given, and (N choose k) = N!/{k!(N-k)!}
Expanding the factorials for this combinatoric results in:
N choose k = 5!/{3! (5-3!} = 5!/(3!2!) = (5*4*3*2*1)/{(3*2*1)(2*1)}
= (5*4)/(2*1) <-- 3*2*1 = 6 cancels out
= 20/2
= 10
So the probability for part (a) is 10*(0.2)^3*(0.8)^2 = 10 * 0.2 * 0.2 * 0.2 * 0.8 * 0.8 = .0512 or just
over 5%. Using this formula with these parameters and an excel spreadsheet, the following table
shows the probability of getting k=0,1,2,3,4,5 success (defective computer chips). As a side note, the
excel spreadsheet function COMBIN calculates the combinatoric part of the formula.
K P(N=k)
0 0.32768
1 0.4096
2 0.2048
3 0.0512
4 0.0064
5 0.00032
So for part B, the probability of at least one defective chip is the same as the probability of ONE,
TWO, THREE, FOUR, or even FIVE defective chips in the sample. Adding the last probabilities in
the table results in:
STA2300 Data Analysis S2, 2019
P a g e | 16
(f) (5 marks) In a random sample of 150 microchips produced by the company, determine the
probability that 30 or more will be defective. State and check relevant assumptions,
conditions or rules of thumb that should be considered before performing the calculations to
determine this probability.