Problem Set #1
Problem Set #1
Instructions:
1. Upload file nlsw88.csv into R
> nlsw88 <- read.csv('nlsw88.csv')
> View(nlsw88)
> ncol(nlsw88)*nrow(nlsw88)
Image 1 Image 2
The reason why these two histograms have outliers can be identified by looking at the
histogram and observing that some bars are significantly taller and shorter than the majority. On
the other hand we used bins to experiment with bin widths to highlight details or emphasize the
overall distribution which clearly can be seen in Image 2.
[1] 106
[1] 8984
Therefore, out of 8984 values, 106 are outliers meaning that, based on a certain z-score
threshold, 106 values in the dataset deviate significantly from the mean.
4. What is your point OLS estimate of beta_1 hat? Construct 99% confidence interval
for beta_1 hat?
To construct the 99% confidence interval for beta_1 hat we used the following code.
> conf_interval<-confint(model_1,level = 0.99)
> print(conf_interval)
We proceed by examining the 0.5% and 99.5% values of the variable "yrs_school" and
subsequently formulate the corresponding interval.
0.5 % 99.5 %
(Intercept) 0.50364239 0.8015131
yrs_school 0.08174972 0.1040900
5. Compute the covariance between lwage and yrs_school variables. Compute the
variance of yrs_school variable. Estimate beta_1 hat coefficient using the statistical
measures you have computed in this step.
To calculate the variance of the variable yrs_school we used the following code.
> var(Book2$yrs_school)
The variance of the yrs_school variable is = 6.50374
To calculate the estimated beta_1 hat coefficient using the statistical measure computed we used
the following code.
> cov(Book2$lwage,Book2$yrs_school)/var(Book2$yrs_school)
6. “For any simple linear regression, the model forecast for mean value of the
regressor is the mean value of y variable”. Statement is TRUE or FALSE? Explain
briefly.
We know that the model of a simple linear regression looks like this:
And the model forecast for the mean value of the regressor is:
To determine whether it matches the mean value of Y, you must provide its expression:
It's important to note that in a simple linear regression model, there is an assumption that the
error term 𝑢 has a mean of zero, expressed as E[𝑢] = 0.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.560314 0.101986 -25.105 < 2e-16 ***
yrs_school 0.132494 0.007284 18.189 < 2e-16 ***
ttl_exp 0.069832 0.003984 17.527 < 2e-16 ***
black -0.188827 0.041584 -4.541 5.9e-06 ***
In a multiple linear regression analysis predicting the logarithm of wages (lwage) based on years
of schooling (yrs_school), total work experience (ttl_exp), and a dummy variable indicating
whether a woman is black (black), the interpretation of the regression coefficients is as follows:
● Intercept: The expected logarithm of pay is 0.397540 when all selected independent
variables are set to 0.
● Years_school: Holding all other variables constant, an increase of one unit in years of
schooling raises the expected logarithm of pay by 0.076128, on average.
● Ttl_exp: Holding all other variables constant, an increase of one unit in total work
experience raises the expected logarithm of salary by 0.040124, on average.
● Black: On average, a black woman (with the dummy variable set to 1) earns 0.108496
less than a non-black woman, assuming all other independent variables remain constant.
The statistical significance of these coefficients is indicated by the t-values, which are
significantly far from zero, and the very small p-values for small alpha levels. Consequently, we
can confidently reject the Null Hypothesis, which posits that the true value of each coefficient is
zero.
10. Could you claim that there is a racial discrimination based on women race?
In the multiple linear regression analysis presented earlier, the dummy variable "black"
exhibits a negative coefficient of -0.108496. This implies that, on average, the logarithm of
wages for black women tends to be lower. The significance of this negative coefficient, at a very
low alpha level, provides initial evidence supporting the assertion that black women earn less
than their non-black counterparts.
However, it's crucial to acknowledge that factors beyond race may contribute to this
observed difference. Incorporating additional potential explanatory variables into the model
would be enlightening. Examining whether the "black" dummy variable remains statistically
significant after considering factors such as the specific professions pursued, educational
attainment, cost of living at their place of employment, and others, is essential. This investigation
aims to discern whether the lower average salary for black women is predominantly influenced
by their race or if other variables play a significant role.
A more conclusive understanding of the impact of race on earnings will only emerge
when the model incorporates additional variables beyond year of education and total experience,
and still yields a statistically significant coefficient for the "black" dummy variable.