Returns To Education: Chapter 1: Defining and Collecting Data
Returns To Education: Chapter 1: Defining and Collecting Data
The data includes 3,010 men aged between 24 and 34 with 34 variables that is
collected from the Young Men Cohort of the National Longitudinal Survey.
Firstly, you should read description data on second spreadsheet. Secondly, you
create a random sample with size of 30 from the data. Finally, you answer the following
questions.
Chapter 1: Defining and Collecting Data
1. How was this data collected? Is the sample representative of the population?
2. Is the data Missing Values of the data? How do you find them? How do you deal with
them?
Chapter 2: Organizing and Visualizing Variables
3. Classify each of the following variables: nearc4, educ, age, black, wage, IQ, married,
exper, lwage and expersq.
4. Use your subsample to graph the variables in exercise 3. Show characteristics of each
the variables.
Chapter 3: Numerical Description Measures
5. Use your subsample to calculate the statistics for wage, IQ, educ and exper.
6. Use your subsample to calculate the covariance and the coefficient of correlation
between wage and IQ. Repeat for wage and educ, wage and exper.
Chapter 5: Discrete Probability Distributions
7. Suppose that proportion of blacks in the population is equal to proportion of backs in
your subsample.
7.1 How many blacks do you expect in your subsample? How many blacks are in the
data?
7.2 Choose a simple random sample with size of 20 from the population. What is the
probability that there are 6 blacks.
7.3 Choose a randomly 5 people in your subsample. Find the probability that there is at
least one black.
8. Suppose that “educ” is a discrete random variable that has Poisson distribution such
that its mean equals to the sample mean in your subsample. If we randomly choose a
person, what is the probability that he/she has 6 years in education.
Chapter 6: The Normal Distribution and Other Continuous Distributions
9. Suppose that lwage is a normally distributed random variable with mean μ = 6.26 and
standard deviation σ = 0.44. Find the probability that lwage is in the interval (5,6).
10. Assume age is a discrete uniform random variable X such that 24 ≤ x ≤ 34. Find the
mean and the variance of age. Compare these corresponding values to the values in your
subsample. Create the bar chart for age. Is age a uniform distribution?
Chapter 7: Sampling Distributions
IQ ~ N , 2 mean IQ 2 var IQ
Assume , where and .
11. Find the probability that a random sample n = 20 observations will have a sample
mean IQ that falls in the interval from 100 to 103.
12. How large must the random sample be if we want the standard error of the sample
mean to be 1?
Chapter 8: Confidence Interval Estimation
13. Construct a 97% confidence interval on the true mean lwage based on data in the
subsample. Suppose that lwage is a normally distributed random variable with standard
deviation σ = 0.44
14. Construct a 94% confidence interval on the true mean IQ based on data in the
subsample. Assume that IQ is a normally distributed random variable.
15. If we wanted the error in estimating the mean lwage from the two-size confidence
interval to be 0.2 at 95% confidence. What sample size should be used? Assume that
lwage is a normally distributed random variable with standard deviation σ = 0.44.
16. Calculate a 99% confidence interval on the true proportion of all people who nearby
4-year college (nearc4) based on the data.
17. If we wanted the error in estimating the proportion of blacks from two-size
confidence interval to be 0.01 at 99% confidence. What sample size should be used?
Chapter 9: Fundamentals of Hypothesis Testing: One-Sample Tests
18. Use your subsample to test the hypothesis H0: mean(IQ) = 100 against H1: mean(IQ)
≠ 100 at 2% by three methods. Assume that IQ is a normally distributed random
variable with standard deviation σ = 15
19. Use your subsample to test the hypothesis H0: mean(lwage) = 6 against H1:
mean(lwage) > 6 at 10% . Assume that lwage is a normally distributed random
variable.
20. It is argued that the proportion of people with less than 10 years of work experience is
not more than 0.07. Use the given data to test this argument at 2% .
Chapter 10: Two-Sample Tests
Assume that the variables are normally distributed and the variance within each
population be equal for all populations.
21. Test the difference between the averages wage by variable black at 1% significance
level based on the subsample.
22. Test the difference between two independent population proportions that people with
low IQ (IQ < 90) by variable black at 5% significance level.
23. Test the difference between the variances lwage by variable black at 1% significance
level based on the subsample.
Chapter 11: Analysis of Variance
Assume that the groups by variable black follow a normal distribution and have equal
variances. Using your subsample for testing the hypothesis.
24. Use F-test to test the difference between the means wage by variable black at 5%
significance level. Compare the result and T-test.
Chapter 13: Simple Linea Regression
Suppose you are interested in studying the effect of education on wages. Use your
subsample to calculate the regression equation of educ on wage.
25. Interpret the slope and the intercept for the output.
Model 1:
wage 0 1educ 2 exper 1
Model 2:
lwage 0 1educ 2exper 3expersq 2
Interpret the regression coefficients in the models. Which model is better? Why?
Instructions
Q Answers R-code
Preparation:
# install packages (library) if not installed
install.packages("readxl")
install.packages("ggplot2")
library(ggplot2)
library(readxl)
wage <- read_excel("wage.xlsx")
# copy and paste wage.xlsx file to the Documents Folder or change
the path.
id_number<-101
set.seed(id_number)
# choice a number to set.seed
yourdata<-wage[sample(1:nrow(wage),30),]
# generate a random sample with size of 30 from the data.
Mỗi sinh viên sử dụng mã id_number riêng để có được các mẫu khác nhau.
1 Dữ liệu được thu thập bằng cách khảo sát những
người đàn ông độ tuổi từ 24-34. Nhiều khả năng
một số người không được tiếp cận cho khảo sát nên
dễ xảy ra tính chệch (bias).
2 Kết quả cho ra 2059 giá trị khuyết. table(is.na(wage))
Xử lý dữ liệu khuyết bằng 2 cách: Bổ sung hoặc
loại bỏ. Bổ sung gồm 2 cách: ghi nhận lại và ước
lượng. Tùy từng trường hợp cụ thể mà sẽ có cách
xử lý tương ứng. Trường hợp này, ta có thể loại bỏ
quan sát có giá trị khuyết. # new_data<-
na.omit(wage)
# dot plot
ggplot(data =
yourdata,aes(x=educ))
+geom_dotplot(fill="blue
",binwidth = 0.5)
# stem-leaf
stem(yourdata$educ,2)
# bar
ggplot(data =
yourdata,aes(x=factor(bl
ack),fill=factor(black))
)+geom_bar()
5 summary(yourdata[,c("wag
e", "IQ", "educ",
"exper")])
6 Tính cov, r cho các cặp biến khi có mặt dữ liệu cov(na.omit(yourdata[,c(
"wage","IQ","educ","expe
khuyết. r")]))
cor(na.omit(yourdata[,c(
"wage","IQ","educ","expe
r")]))
ab # 29
2 ((34-24+1)^2-1)/12
b a 1
2
1
2 # 10
12
mean(yourdata$age)
# 28.23333
var(yourdata$age)
# 10.59885
hist(yourdata$educ)
Đồ thị cho thấy age có khả năng phân phối đều.
# 15.42376
103 100 pnorm((103-
mu)*sqrt(20)/sig)-
pnorm((100-
n n mu)*sqrt(20)/sig)
0.3246231 # 0.3246231
#
pnorm(103,mu,sig/sqrt(20
))-
pnorm(100,mu,sig/sqrt(20
))
12 (sig/1)^2
SE mean
n # 237.8923
2
n
SE mean
238
13 n 30 z.alpha.2<-
qnorm(1.5/100,lower.tail
0.44 = FALSE)
1 97% z /2 2.17 # 2.17009
n<-
length(yourdata$lwage)
z /2
n # 30
x , x mu<-mean(yourdata$lwage)
# 6.289448
CI<-c(mu-
z.alpha.2*0.44/sqrt(n),m
u+z.alpha.2*0.44/sqrt(n)
)
# 6.115119 6.463777
14 n 2061 IQ<-na.omit(wage$IQ)
n<-length(IQ)
x 102.450 mu<-mean(IQ)
sig<-sd(IQ)
s 15.424 z<-qnorm(0.03,lower.tail
= FALSE)
1 94% z /2 1.881
s CI<-c(mu-
z /2 z*sig/sqrt(n),mu+z*sig/s
n qrt(n))
15 n<-(1.96*0.44/0.2)^2
z /2
n # 18.59334
2
n z /2 19
16 n 3010 n<-length(wage$nearc4)
f<-
2053 length(wage$nearc4[wage$
f nearc4==1])/n
3010
1 99% z /2 2.576 # table(wage$nearc4)
f 1 f CI<-c(f-2.576*sqrt(f*(1-
z /2 f)/n),f+2.576*sqrt(f*(1-
n f)/n))
f , f
# 0.6601949 0.7039247
f 1 f
17 f<-
length(black[black==1])/
z /2 length(black)
n
2 # table(wage$black)
z
n /2 f 1 f 13597
n<-(2.756/0.01)^2*f*(1-
f)
# 13596.54
18 H 0 : 100 IQ<-na.omit(yourdata$IQ)
n<-length(IQ)
H1 : 100 mu<-mean(IQ)
z.alpha.2<-
0.02 qnorm(0.01,lower.tail =
FALSE)
n 24 CI<-c(mu-
z*15/sqrt(n),mu+z*15/sqr
x 105.875 t(n))
15 # 98.75204 112.99796
z /2 2.326
CI x z /2 , x z / 2
n n
98.75204, 112.99796
Giá trị cần kiểm định nằm trong miền tin cậy nên
chấp nhận H0 z.test<-(mu-
100)*sqrt(n)/15
z
x n
1.92
# 1.918767
abs(z.test)<z.alpha.2
z /2 2.326 # TRUE
z z /2
Vì nên ta chấp nhận H0
pvalue 2 z 0.05501383
2*pnorm(-abs(z.test))
Vì p-value > nên ta chấp nhận H0
# 0.05501383
19 H 0 : 6 n<-
length(yourdata$lwage)
H1 : 6 mean.x<-
mean(yourdata$lwage)
0.1 s<-sd(yourdata$lwage)
n 30
t<-(mean.x-6)*sqrt(n)/s
x 6.289 # 3.194853
s 0.496
t.alpha<-qt(0.1,n-
t
x n
3.194853
1,lower.tail = FALSE)
s # 1.311434
t 1.311
Vì t < t nên ta bác bỏ H0
20 H 0 : p 0.07 n<-length(wage$educ)
f<-
H1 : p 0.07 length(wage$educ[wage$ed
uc<10])/length(wage$educ
n 3010 )
f 0.07076412 z.alpha<-
qnorm(0.02,lower.tail =
2% FALSE)
z
f p n
9.094 # 2.053749
p 1 p z<-(f-
0.07)*sqrt(n)/sqrt(0.07*
z 2.054 (1-0.07))
Vì z > z nên ta bác bỏ H0, tức là có đủ cơ sở để bác # 9.093994
bỏ ý kiến trên.
21 H 0 : 1 2 x1<-
subset(yourdata,black==1
H1 : 1 2 )$wage
x2<-
n1 7 subset(yourdata,black==0
)$wage
n2 23 n1<-length(x1)
n2<-length(x2)
x1 373.1429 mean.x1<-mean(x1)
mean.x2<-mean(x2)
x2 677
s<-sqrt(((n1-1)*var(x1)+
s 2 n 1 s12 n2 1 s22
1 78667.96
(n2-1)*var(x2))/(n1+n2-
p 2))
n1 n2 2
x x 2.5097
t<-(mean.x1-mean.x2)/
1 2 1 2 (s*sqrt(1/n1+1/n2))
t
1 1 # -2.509706
sp
n1 n2 t.alpha.2<-
qt(0.005,28,lower.tail =
t /2,n 2 2.763 FALSE)
Vì |t| > t/2 nên ta chấp nhận H0 # 2.763262
22 H 0 : 1 2 newdata<-
na.omit(wage[,c("IQ","bl
H1 : 1 2 ack")])
x1<-
n1 1764 subset(newdata,black==0)
$IQ
p1 0.1218821 x2<-
subset(newdata,black==1)
n2 297 $IQ
n1<-length(x1)
p2 0.5791246 n2<-length(x2)
x1 x2 p1<-length(x1[x1<90])/n1
p 0.1877729 p2<-length(x2[x2<90])/n2
n1 n2 p.bar<-
length(newdata$IQ[newdat
z
p1 p2 1 2 18.667
a$IQ<90])/length(newdata
$IQ)
1 1
p 1 p
n1 n2
# (n1*f1+n2*f2)/(n1+n2)
# -18.66723
23 H 0 : 12 22 x1<-
yourdata$wage[yourdata$b
H1 : 12 22 lack==0]
x2<-
n1 23 yourdata$wage[yourdata$b
lack==1]
n2 7 n1<-length(x1)
n2<-length(x2)
s12 96108.64 s1.2<-var(x1)
s2.2<-var(x2)
s22 14718.81
f<-s1.2/s2.2
s12
f 2 6.529647 # 6.529647
s2
f.u<-
f /2,22,6 9.26 qf(0.005,22,6,lower.tail
= FALSE)
f1 /2,22,6 0.231
# 9.526445
f low f f up f.l<-qf(1-
Vì nên ta chấp nhận H0 0.005,22,6,lower.tail =
FALSE)
# 0.2313489
2 nj 2 mean(x2))^2)
SST xij x 2698202.7
j 1 i 1
2 2
SSA n j x j x MSA<-SSA/(2-1)
j 1 MSE<-SSE/(30-2)
2 2
n1 x1 x n2 x2 x 495499.8
nj
2 2
SSE xij x j
j 1 i 1
n1 n2
x
2 2
xi1 x1 i 2 x2
i 1 i 1
2202702.9
SST SSA SSE
SSA f<-MSA/MSE
MSA 495499.8
c 1 # 6.298623
SSE
MSE 78667.96
nc f.low<-
qf(0.025,1,28,lower.tail
MSA = TRUE)
f 6.299
MSE # 0.0009997776
f /2,1,28 5.6096, f1 /2,1,28 0.001
f.up<-
qf(0.025,1,28,lower.tail
Kết quả ta bác bỏ H0. = FALSE)
# 5.609564
So sách với t-test ở bài 21 ta có nhận xét sau: bài 21
ta chấp nhận ở mức 1% nhưng lại bác bỏ ở mức # ANV<-lm(data =
5%. Kết quả này cũng phù hợp với ANOVA one- yourdata,wage~factor(bla
ck))
way. Tuy nhiên, khi t-test dễ sử dụng và linh hoạt # anova(ANV)
hơn ANOVA khi chỉ có 2 nhóm.
H1 : 1 0 # t<-summary(ols)
$coefficients[2]/summary
b1 0 55.34 (ols)$coefficients
t 3.163 # 3.162504
estb1 17.5
t /2,28 1.701131 qt(0.05,28,lower.tail =
FALSE)
|t| > t/2,28 ta bác bỏ H0 # 1.701131
30 2 mean(yourdata$wage))^2)
SST yi y 269820.7 SSE<-sum((-
i 1 153.92036+55.34129*yourd
ata$educ-
30 2
SSR yi y 710128.39 yourdata$wage)^2)
i 1
30 2 SSR/SST
SSE yi yi 1988074.36
i 1
SSR SSE
r2 1 0.2632
SST SST # summary(ols)
2 # 0.2632
Ý nghĩa: Hệ số xác định r đo lường mức đo lường #
hợp của mô hình. Cụ thể, mô mình giải thích cor(yourdata$educ,yourda
ta$wage)^2
khoảng 26.32% biến động trong wage gây ra bởi # summary(ols)$r.squared
educ.