STAT Activity
STAT Activity
1. Question...............................................................................................................................................1
1.1. Bivariate statistical analysis...............................................................................................................4
1.1.1. Line graph.................................................................................................................................5
1.1.2. Variable relationship by Bar graph........................................................................................6
1.1.3. Histograms................................................................................................................................8
1.1.4. Correlation coefficients............................................................................................................9
1.1.5. Partial and semi partial correlation......................................................................................10
1.1.6. Linear Regression Model.......................................................................................................11
2. Question_2.............................................................................................................................................19
1. Question
Assume that you wanted to investigate the determinants of wage earning in Wolaita Sodo town.
Accordingly, you have collected data on wage earnings of 100 randomly selected daily laborers
working in wolaita Sodo town. The daily earning data of the wage earners were collected in Birr.
Assume also that you have collected data on some independent variables such as sex [male=0,
female=1); age of the respondents in years; birth location of the respondents where urban=0, and
rural =1; marital status of the respondents measured as married=1 and 0 otherwise; family size of
the respondents measured in numbers; education of the respondents measured in years of
schooling; experience in daily laborer (in years). The file is named “earning data”. Run the
regression, check for validity of the assumptions and interpret your results (13 Pts).
Given earning Data:
Q1.1: Labeled variables with respective values
1
1.1.1. Line graph
2
(g) Wage vs married
1.1.2. Variable relationship by Bar graph
3
4
1.1.3. Histograms
5
1.1.4. Correlation coefficients
6
1.1.5. Partial and semi partial correlation
7
1.1.6. Linear Regression Model
Fit linear regression model was shown as below where wage is dependent variable and others as
independent variables and the result is interpreted.
8
Q1.2:
As part of your inferential analysis, test whether daily wage earnings vary by sex. Explain
whether the wage earnings statistically significantly vary between male and female. Which sex is
more likely to earn more wage per day? Explain.
Answer:
Since the p-value (0.008) is less than the significance level of 0.05, we reject the null hypothesis. This
means that there is a statistically significant difference in daily wage earnings between males and females.
For instance as shown in above graph (esp. bar graph and histograms) Mean daily wage for female is
higher than that for males , I can conclude that females are more likely to earn more wage per day on
average.
Q1.3:
o Regression diagnostics test for Heteroskedasticity
For heteroskedasticity test regression table is outlined as follows;
9
10
Explanation;
p-value of 0.000 is less than 0.05%, it indicates strong evidence against the null hypothesis of
homoskedasticity (constant variance of errors), suggesting the presence of heteroskedasticity (non-
constant variance of errors).
11
o Multicollinearity test
12
Explanation:
When conducting regression diagnostics to test for heteroskedasticity, a common approach is to
use statistical tests such as the Breusch-Pagan test or the White test. If the p-value from these
tests is less than 0.05%, it indicates strong evidence against the null hypothesis of
homoskedasticity (constant variance of errors), suggesting the presence of heteroskedasticity
(non-constant variance of errors). ased on the results of the multicollinearity test, which shows a
mean variance of 1.15 and a covariance of 7.748, we can draw the following conclusions:
Mean Variance: which is Variance Inflation Factor – VIF, mean VIF value of 1.15 indicates that
there is no significant multicollinearity among the predictors in your regression model. Generally,
a VIF value greater than 10 suggests significant multicollinearity. Since 1.15 is well below this
threshold, it indicates that multicollinearity is not a concern for your model.
Covariance: The covariance value of 7.748 does not directly indicate multicollinearity.
Covariance measures the joint variability of two variables. While a higher covariance might
indicate a stronger linear relationship between the variables, it doesn't directly reflect
multicollinearity the way VIF does.
o Specification and Normality
To test and check acceptance of null hypothesis multiple regression was conducted as follows.
Explanation: The F-value of 9.65466, coupled with a p-value of 0.00, indicates that the overall
regression model is statistically significant. This means that there is a very low probability that
13
the observed relationship between the dependent and independent variables is due to random
chance. Specifically, the p-value of 0.00 (less than any conventional alpha level, such as 0.05 or
0.01) strongly suggests that at least one of the predictor variables is significantly related to the
outcome variable.
Q1.4: Evaluate the test of the null hypothesis that the slope coefficients in the model are
jointly zero. Is it accepted or rejected? Why?
Answer: Since the p-value is significantly less than 0.05, we reject the null hypothesis. This
means there is sufficient evidence to suggest that at least one of the independent variables
(education, experience, sex, married, family size, birth location) has a non-zero effect on the
wage-earning dependent variable.
In conclusion, based on the F-test results, we reject the null hypothesis that the slope coefficients
in the model are jointly zero.
Q1.5: Evaluate the measure of goodness-of-fit measure.
Answer: The measure of goodness-of-fit in this context is the R-squared value, which is given as
0.4235. R-squared, also known as the coefficient of determination, represents the proportion of
the variance in the dependent variable (wage-earning) that is explained by the independent
variables (education, experience, sex, married, family size, birth location).
R-squared value of 0.4235 indicates that approximately 42.35% of the variance in wage-earning
can be explained by the independent variables included in the model.
Now, regarding the test of the null hypothesis that the slope coefficients in the model are jointly
zero:
The F-value is 9.65, with a p-value of 0.000.
This indicates a significant relationship between the independent variables and the dependent
variable. The p-value of 0.000 means that the probability of observing such an extreme F-value
under the null hypothesis (that all slope coefficients are zero) is virtually zero.
So, we reject the null hypothesis. Therefore, it's not accepted.
In summary:
The R-squared value (0.4235) indicates the proportion of variance in the dependent variable
explained by the independent variables.
The F-test with an F-value of 9.65 and a p-value of 0.000 indicates that the overall model is
statistically significant.
14
Consequently, we reject the null hypothesis that the slope coefficients in the model are jointly
zero, implying that at least one of the independent variables has a significant effect on the
dependent variable.
Therefore, the test is rejected based on the statistical significance of the F-test and the non-zero
R-squared value, indicating a reasonable level of explanatory power of the independent variables
on the dependent variable.
Q1.6: Which of those independent variables are statistically significant in explaining the
dependent variable about the mean? How do you know this?
Answer: The regression analysis reveals that education and experience significantly boost daily
wage earnings, highlighting the importance of investing in education and gaining work
experience. However, the analysis also uncovers a persistent gender wage gap, with females
earning more than males. Additionally, marital status positively influences earnings, while larger
family sizes are associated with lower daily wages. These findings emphasize the need for
targeted policies to address wage disparities and support individuals in balancing family
responsibilities with career advancement.
Q1.7: Which of those independent variables are not statistically significant in explaining the
dependent variable about the mean? How do you know this?
Answer: Larger family sizes negatively impact daily wage earnings. This could be because
individuals with larger families might face greater financial pressures and responsibilities,
potentially limiting their availability for higher-paying jobs or requiring them to accept lower-
paying positions due to immediate financial needs. I know this from data on covariance.
Q1.8: Interpret the coefficients of those statistically significant variables. Explain whether the
coefficients are meaningful both theoretically and in practice
Answer:
Practical Meaningfulness and Theoretical
Sex (Gender): If sex is found to be statistically significant, the coefficient associated with it
indicates the average difference in wage-earning between the genders. Theoretical interpretations
could suggest gender-based wage disparities, with one gender earning more than the other for
similar work. However, it's important to consider societal, cultural, and legal factors influencing
gender-based wage differentials.
15
Age: could reflect factors such as career progression, skills accumulation, or age-related
discrimination in the labor market.
Education: A positive coefficient for education implies that as education level increases by one
unit (e.g., from high school to college), wage-earning tends to increase by the amount indicated
by the coefficient. Theoretically, this aligns with the idea that higher education often leads to
better job opportunities and higher income potential. In practice, this suggests that individuals
with higher levels of education may earn more than those with lower levels of education.
Experience: Similarly, a positive coefficient for experience suggests that as years of experience
increase, wage-earning tends to increase. This is consistent with the notion that individuals gain
skills and expertise over time, which can lead to higher wages. Practically, this implies that more
experienced workers may command higher salaries than less experienced counterparts.
Married: The coefficient for being married might signify differences in wage-earning between
married and unmarried individuals. Theoretically, this could relate to various factors such as
stability, responsibility, and potential differences in job opportunities for married individuals.
However, it's essential to be cautious in interpreting this variable, as it may also reflect other
socio-economic factors associated with marriage.
Family Size: A statistically significant coefficient for family size might imply that individuals
with larger families tend to have different wage-earning levels compared to those with smaller
families. Theoretically, this could be related to the financial responsibilities associated with
supporting a larger family, which might influence career choices and opportunities. However,
practical implications may vary depending on cultural and societal norms regarding family
dynamics.
Birth Location: The coefficient associated with birth location suggests that individuals born in
different locations may have different wage-earning levels. Theoretical interpretations could
involve differences in economic opportunities, cost of living, and regional disparities in job
markets. Practically, this implies that individuals born in certain regions may have higher or
lower earning potentials compared to others.
Rank the importance of the variables according to their strength in influencing wage earning of
the respondents from the highest to the lowest.
16
2. Question_2
Assume that you are investigating causes and effects of crime in Ethiopia. Accordingly, assume
that you have regressed crime rate measured as violent crimes per 100,000 people (crime rate),
on other six explanatory variables: murders per 1,000,000(murder rate), the percent of the
population living in slum areas devoid of basic services (slum areas), the percent of the
population that is fully employed (employed), percent of population with a high school education
or above (educated), percent of population living under poverty line (poverty), and percent of
population that are single parents (single). Use the file named "murder rate data". (12 Pts)
Q- (a) Graph your dependent variable (crime rate) by each of your independent variable using
scatter line and explain your results. Which of the independent variables are positively
correlated with the crime rate? Which one is negatively correlated?
Scatter line for Crime rate vs Murder Scatter line for Crime rate vs Slum
Scatter line for Crime rate vs unemployment Scatter line for Crime rate vs education
17
Scatter line for Crime rate vs Poverty Sc atter line for Crime rate vs Single
Answer:
Based on the relationship observed from the scatter plots above
Positively Correlated with Crime Rate:
Murder: There appears to be a positive linear relationship between the murder rate and the crime
rate. As the murder rate increases, the overall crime rate tends to increase as well.
Slum: Similarly, there seems to be a positive linear relationship between the presence of slums
and the crime rate. Higher levels of slum areas are associated with higher crime rates.
Poverty: The scatter plot suggests a positive linear relationship between poverty and crime rate.
Areas with higher poverty levels tend to have higher crime rates.
Single Parent: The scatter plot indicates a positive linear relationship between the percentage of
single-parent households and the crime rate. Higher rates of single-parent households are
associated with higher crime rates.
Negatively Correlated with Crime Rate:
Unemployed: The scatter plot suggests that the relationship between the unemployment rate and
crime rate is not linear. This implies that while unemployment may be related to crime, the
association is not straightforward and may be influenced by other factors.
Education: Similarly, the scatter plot indicates a non-linear relationship between education
levels and crime rate. While higher education levels may generally correlate with lower crime
rates, the relationship may not be strictly linear and can be influenced by various socio-economic
factors.
18
In conclusion, Murder, slum, poverty, and single parent variables are positively correlated with
the crime rate.
Unemployed and education variables exhibit Negatively Correlated relationships with the crime
rate.
Q - (b) Present correlation matrix of your dependent variable against your independent variables
and interpret coefficients.
Answer:
Model Summary:
F-value: 62.5
p-value: 0.000
R-squared: 0.8950
Root MSE (Mean Squared Error): 152.36
Interpretation:
F-value and p-value:
F-value (62.5): The F-value indicates the overall significance of the regression model. A high F-
value suggests that the model explains a significant portion of the variance in the dependent
variable.
19
p-value (0.000): The p-value associated with the F-statistic is very small (0.000), which indicates
that the overall model is statistically significant. This means that at least one of the independent
variables is significantly related to the dependent variable (crime rate).
R-squared (0.8950):
R-squared (0.8950): The R-squared value represents the proportion of variance in the dependent
variable (crime rate) that can be explained by the independent variables. An R-squared value of
0.8950 means that 89.50% of the variability in crime rate is explained by the variables included
in the model. This is a very high R-squared value, suggesting that the model fits the data very
well.
Root MSE (152.36):
Root MSE (152.36): The root mean squared error measures the standard deviation of the
residuals (prediction errors). It provides an estimate of the average distance between the observed
values and the predicted values. A lower RMSE indicates a better fit of the model. In this context,
an RMSE of 152.36 suggests the typical error in the model’s predictions of the crime rate.
Interpretation of Coefficients:
To interpret the coefficients of each independent variable, we typically look at their values and
the corresponding p-values. Since specific coefficients are not provided, we will discuss the
general interpretation:
Murder:
A positive coefficient would indicate that higher murder rates are associated with higher overall
crime rates. Given the strong model fit, it’s likely that murder has a significant impact on the
crime rate.
Slum: A positive coefficient for slum areas would suggest that higher proportions of slum areas
are associated with higher crime rates. This is consistent with theories linking poor living
conditions to higher crime rates.
Unemployed: The coefficient for unemployment rate would show the relationship between
unemployment and crime rate. If it’s positive, higher unemployment is associated with higher
crime rates. This is plausible as higher unemployment can lead to economic strain and potentially
more crime.
20
Education: A negative coefficient for education would suggest that higher education levels are
associated with lower crime rates. This aligns with theories that higher education can lead to
better job opportunities and lower propensity for crime.
Poverty: A positive coefficient for poverty would indicate that higher poverty levels are
associated with higher crime rates. Poverty can lead to economic desperation and higher crime
rates.
Single Parent: A positive coefficient for single-parent households would suggest that higher
rates of single-parent households are associated with higher crime rates. This could reflect socio-
economic challenges and reduced supervision often faced by single-parent families.
Conclusion:
The high F-value and very low p-value indicate that the overall regression model is highly
significant.
The R-squared value of 0.8950 suggests that the model explains a very large portion of the
variance in crime rates.
The Root MSE of 152.36 indicates a reasonable level of prediction error.
Q – (C): Now, regress your dependent variable (crime rate) on the five independent variables
explained above.
21
o Homoscedasticity assumption of error term
Explanation:
The R-squared value of 0.8950 indicates a very high level of explanatory power, meaning that
89.50% of the variance in the dependent variable is explained by the independent variables in the
model. This suggests that the model fits the data extremely well.
The F-value of 62.51, along with a p-value of 0.00, indicates that the overall regression model is
highly statistically significant. The p-value of 0.00 (less than any conventional significance level,
such as 0.05 or 0.01) implies that there is a very low probability that the observed relationship
between the dependent and independent variables is due to random chance. Therefore, we can
confidently reject the null hypothesis that there is no relationship between the variables.
Homoscedasticity, the assumption that the variance of the error terms is constant across all levels
of the independent variables, is crucial for the validity of regression results. The high R-squared
22
value and the significant F-value and p-value strongly suggest that the model is robust and
reliable. However, these statistics alone do not directly test for homoscedasticity.
o Multicollinearity assumption
23
Q-(d): Interpret statistically significant coefficients of your regression results. How did you
know as to which variables are statistically significant or not?
Answer: To interpret the statistically significant coefficients of your regression results, let's first identify
which variables are statistically significant. The significance of a variable in regression analysis is
typically determined by its p-value. A common threshold for statistical significance is a p-value of less
than 0.05.
Murder (p-value = 0.000): This p-value is less than 0.05, indicating that the variable 'Murder' is
statistically significant. The coefficient for 'Murder' in the regression model can be interpreted as having a
meaningful impact on the dependent variable.
Slum (p-value = 0.000): This p-value is also less than 0.05, indicating that the variable 'Slum' is
statistically significant. The coefficient for 'Slum' is considered meaningful.
Unemployed (p-value = 0.782): This p-value is greater than 0.05, indicating that the variable
'Unemployed' is not statistically significant. Its coefficient does not provide strong evidence of having an
impact on the dependent variable.
Education (p-value = 0.477): This p-value is less than 0.05, indicating that the variable 'Education' is
statistically significant. Its coefficient is considered meaningful in explaining variations in the dependent
variable.
Poverty (p-value = 0.130): This p-value is less than 0.05, indicating that the variable 'Poverty' is
statistically significant. The coefficient for 'Poverty' provides strong evidence of impact.
Single (p-value = 0.013): This p-value is less than 0.05, indicating that the variable 'Single' is statistically
significant. The coefficient for 'Single' is meaningful.
Reason to decide significance is a common threshold for statistical significance which is a p-value of less
than 0.05.
24
25