A1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

FE590. Assignment #1.

2017-09-22

Question 1

Question 1.1

Generate a vector x containing 10,000 realizations of a random normal variable with mean 2.0 and standard
deviation 3.0, and plot a histogram of x using 100 bins. To get help generating the data, you can type ?rnorm
at the R prompt, and to get help with the histogram function, type ?hist at the R prompt.

Solution:

x=rnorm(10000, mean = 2, sd = 3)
i=round(min(x)-0.5)
j=round(max(x)+0.5)
k=(j-i)/100
breaks=seq(i, j, k)
hist(x,breaks)

Histogram of x
350
250
Frequency

150
50
0

−10 −5 0 5 10

1
Question 1.2

Confirm that the mean and standard deviation are what you expected using the commands mean and sd.

Solution:

mean(x)

## [1] 2.072998
sd(x)

## [1] 3.012833
my response

Question 1.3

Using the sample function, take out 10 random samples of 500 observations each. Calculate the mean of
each sample. Then calculate the mean of the sample means and the standard deviation of the sample means.

Solution:

y=replicate(10,sample(x,500))
z=colMeans(y)
mean(z)

## [1] 2.052088
sd(z)

## [1] 0.104518

Question 2
Sir Francis Galton was a controversial genius who discovered the phenomenon of “Regression to the Mean.”
In this problem, we will examine some of the data that illustrates the principle.

Question 2.1

First, install and load the library HistData that contains many famous historical data sets. Then load the
Galton data using the command data(Galton). Take a look at the first few rows of Galton data using the
command head(Galton).

Solution:

library(HistData)
data(Galton)
attach(Galton)
head(Galton)

2
## parent child
## 1 70.5 61.7
## 2 68.5 61.7
## 3 65.5 61.7
## 4 64.5 61.7
## 5 64.0 61.7
## 6 67.5 62.2
As you can see, the data consist of two columns. One is the height of a parent, and the second is the height
of a child. Both heights are measured in inches.
Plot one histogram of the heights of the children and one histogram of the heights of the parents. This
histograms should use the same x and y scales.

Solution:

hist(Galton[, 1], col="red", main="Heights")


hist(Galton[, 2], add=T, col=rgb(0, 1, 0, 0.5))

Heights
200
150
Frequency

100
50
0

64 66 68 70 72

Galton[, 1]

Comment on the shapes of the histograms.

3
Solution: The histogram appears to be symmetric ,has multiple peaks in the
middle and is not skewed to either sides.

Question 2.2

Make a scatterplot the height of the child as a function of the height of the parent. Label the x-axis “Parent
Height (inches),” and label the y-axis “Child Height (inches).” Give the plot a main tile of “Galton Data.”
Perform a linear regression of the child’s height onto the parent’s height. Add the regression line to the
scatter plot.
Using the summary command, print a summary of the linear regression results.

Solution:

plot(parent, child, xlab="Parent Height (inches).", ylab="Child Height (inches).", main="Galton Data")
mod=lm(child~parent)
abline(mod)

Galton Data
74
72
Child Height (inches).

70
68
66
64
62

64 66 68 70 72

Parent Height (inches).

summary(mod)

##
## Call:
## lm(formula = child ~ parent)
##
## Residuals:

4
## Min 1Q Median 3Q Max
## -7.8050 -1.3661 0.0487 1.6339 5.9264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.94153 2.81088 8.517 <2e-16 ***
## parent 0.64629 0.04114 15.711 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.239 on 926 degrees of freedom
## Multiple R-squared: 0.2105, Adjusted R-squared: 0.2096
## F-statistic: 246.8 on 1 and 926 DF, p-value: < 2.2e-16
coefficients(mod)

## (Intercept) parent
## 23.9415302 0.6462906
What is the slope of the line relating a child’s height to the parent’s height? Can you guess why Galton says
that there is a “regression to the mean”?

Solution: Since y=mx+c .Therefore the coefficient of x will give us the slope of
the line. Using the coef command we calculate the slope=0.6462906. Regression
to the mean would mean that the extreme values in a statistical data would
eventually regress to the mean value of the data or in effect extreme outcomes
tend to be followed by more moderate ones.

Is there a significant relationship a child’s height to the parent’s height? If so, how can you tell from the
regression summary?

Solution: Yes there is a significant relationship between the two heights. From
the summary we can see the significant codes that a high value of significance is
indicated by *** whereas the lowest value of significane is given by just a blank.
Since the P-value computed is very low and a high value of significance is assured
by *** therefore there is a significant relationship.

Question 3
If necessary, install the ISwR package, and then attach the bp.obese data from the package. The data frame
has 102 rows and 3 columns. It contains data from a random sample of Mexican-American adults in a small
California town.

Question 3.1

The variable sex is an integer code with 0 representing male and 1 representing female. Use the table
function operation on the variable ‘sex’ to display how many men and women are represented in the sample.

5
Solution:

library(ISwR)
data(bp.obese)
attach(bp.obese)
table(sex)

## sex
## 0 1
## 44 58

Question 3.2

The cut function can convert a continuous variable into a categorical one. Convert the blood pressure variable
bp into a categorical variable called bpc with break points at 80, 120, and 240. Rename the levels of bpc
using the command levels(bpc) <- c("low", "high").

Solution:

bpc=cut(bp,breaks=c(80,120,240))
levels(bpc) <- c("low", "high")

Question 3.3

Use the table function to display a relationship between sex and bpc.

Solution:

table(sex,bpc)

## bpc
## sex low high
## 0 16 28
## 1 28 30

Question 3.4

Now cut the obese variable into a categorical variable obesec with break points 0, 1.25, and 2.5. Rename
the levels of obesec using the command levels(obesec) <- c("low", "high").
Use the ftable function to display a 3-way relationship between sex, bpc, and obesec.

Solution:

obesec=cut(obese,breaks=c(0,1.25,2.5))
levels(obesec) <- c("low", "high")
ftable(sex,bpc,obesec)

6
## obesec low high
## sex bpc
## 0 low 12 4
## high 15 13
## 1 low 14 14
## high 4 26
Which group do you think is most at risk of suffering from obesity?

Solution: The group with the highest risk of suffering from obesity are the
females with high value of bpc.

Question 4

Using the Boston data in the MASS library, run a linear regression fit to determine a predictive model for
the median value of a home using the indicators of rooms per dwelling and the property tax.
library(MASS)
attach(Boston)
data(Boston)
lm.fit<-lm(rm~tax)
plot(rm~tax, data=Boston)
abline(lm.fit)
8
7
rm

6
5
4

200 300 400 500 600 700

tax

summary(lm.fit)

##

7
## Call:
## lm(formula = rm ~ tax)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.40980 -0.43974 -0.05209 0.37595 2.80920
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.7816725 0.0784284 86.470 < 2e-16 ***
## tax -0.0012175 0.0001776 -6.855 2.09e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6727 on 504 degrees of freedom
## Multiple R-squared: 0.08529, Adjusted R-squared: 0.08348
## F-statistic: 47 on 1 and 504 DF, p-value: 2.087e-11
Is there evidence that the indicators are useful(why or why not)?

Solution: Yes the indicators are useful in determining the statistical relationship
also there is evidence from the summary where we see the residual standard
error to be very low making the predicted values to be nearly equal to the
deterministic values.

You might also like