0% found this document useful (0 votes)
3 views27 pages

Assignment (1)

The assignment focuses on data analysis using a dataset from Kaggle related to student performance, covering summary statistics, visualizations, correlations, regression modeling, and hypothesis testing. Key findings include strong correlations between scores in different subjects and significant regression coefficients indicating the impact of reading and writing scores on math scores. The analysis concludes that all tested relationships are statistically significant.

Uploaded by

abrarmahir818
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views27 pages

Assignment (1)

The assignment focuses on data analysis using a dataset from Kaggle related to student performance, covering summary statistics, visualizations, correlations, regression modeling, and hypothesis testing. Key findings include strong correlations between scores in different subjects and significant regression coefficients indicating the impact of reading and writing scores on math scores. The analysis concludes that all tested relationships are statistically significant.

Uploaded by

abrarmahir818
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Jahangirnagar University

Department of Statistics and Data Science


Course Title: Econometrics
Course Code: STAT-305
Submitted to: Md Moyazzem Hossain,PhD
Professor
Department of Statistics and Data Science
Jahangirnagar University

Submitted by: Group-10 (Roll:85,92,94,96,98,109,122)

Assignment On:
Data Analysis: Summary Statistics, Visualization, Correlation,
Regression, and Hypothesis Testing
WELCOME TO OUR PRESENTATION

Name Roll
Zihadul Islam Api 96
Mahir Abrar Hossain 98
Protiva Paul Diba 122
Md. Alif Hossain 94
Md. Arif Faisal Anik 92
Md. Saif Masnun 109
Jubaer Al Hasan Tanvir 85
Sadique Ahmed Shovon 2227
Assignment Tasks:

1. Download a dataset from Kaggle.


2. Compute the summary statistics of the available
variables. Make comments.
3. Visualize the variables using the appropriate
charts/graphs. Make comments
4. Compute the correlations of the variables. Comments on
your results.
5. Estimate a regression model and interpret your findings.
6. Test the hypothesis about your calculated correlation
coefficient and regression coefficients.
Dataset Acquisition from Kaggle
The dataset we used for this analysis R Programming:
is the “Student Performance StudentPerformance=read.csv("D:\
Dataset”. \Stat 3rd year\\Stat 305
Assignment\\StudentsPerformanc
The dataset can be accessed and e.csv")
downloaded from Kaggle at the head(StudentPerformance)
following link:
https://www.kaggle.com/datasets/
spscientist/students-performance-
in-exams
Variables In The Dataset
Numerical Variables:
The numerical variables in this dataset are Math Score,Reading Score and
Writing Score.

Categorical Variables:
The categorical variables we have in this dataset are
Gender,Race/Ethnicity,Parental level of Education,Test Preparation Course.
Summary Statistics and Key Insights
Math Score: Summary Statistics Value
R Programming:
Minimum 0.00
summary(StudentPerformance$math.score)
1st Quadrant 57.00
varianceMathScore=var(StudentPerformance$math.s
core) Median 66
varianceMathScore
Mean 66.09
StandardDeviationMathScore=sd(StudentPerformanc
e$math.score) 3rd Quadrant 77.00
StandardDeviationMathScore Maximum 100.00
library(moments)
Variance 229.919
SkewnessMathScore=moments::skewness(StudentPe
rformance$math.score) Standard Deviation 15.16308
SkewnessMathScore Skewness -0.2785166
KurtosisMathScore=moments::kurtosis(StudentPerfor
mance$math.score) Kurtosis 3.267597

KurtosisMathScore
Summary Statistics and Key Insights
Reading Score: Summary Statistics Value

R Programming: Minimum 17.00

summary(StudentPerformance$reading.score) 1st Quadrant 59.00

varianceReadingScore=var(StudentPerformance$read Median 70
ing.score)
varianceReadingScore Mean 69.17

StandardDeviationReadingScore=sd(StudentPerforma 3rd Quadrant 79.00


nce$reading.score)
Maximum 100.00
StandardDeviationReadingScore
SkewnessReadingScore=moments::skewness(Studen Variance 213.1656
tPerformance$reading.score)
Standard Deviation 14.60019
SkewnessReadingScore
Skewness -0.2587157
KurtosisReadingScore=moments::kurtosis(StudentPer
formance$reading.score) Kurtosis 2.926081
KurtosisReadingScore
Summary Statistics and Key Insights
Writing Score: Summary Statistics Value

Minimum 10.00
R Programming:
1st Quadrant 57.75
summary(StudentPerformance$writing.score)
varianceWritingScore=var(StudentPerformance$writin Median 69.00
g.score)
Mean 68.05
varianceWritingScore
3rd Quadrant 79.00
StandardDeviationWritingScore=sd(StudentPerforman
ce$writing.score) Maximum 100.00
StandardDeviationWritingScore
Variance 230.908
SkewnessWritingScore=moments::skewness(Student
Performance$writing.score) Standard Deviation 15.19566
SkewnessWritingScore Skewness -0.2890096
KurtosisWritingScore=moments::kurtosis(StudentPerf
ormance$writing.score) Kurtosis 2.960808

KurtosisWritingScore
Frequency Table for Categorical Variables
Gender: R Programming:
table(StudentPerformance$gender)
Male Female
482 518
table(StudentPerformance$test.pr
eparation.course)
Test Preparation Course: table(StudentPerformance$race.et
hnicity)
Completed None table(StudentPerformance$parenta
358 642 l.level.of.education)
Frequency Table for Categorical Variables
Race/Ethnicity: Parental Level of Education:
• Female
Group Frequency Degree Frequency
Group A 89 Associate’s Degree 222
Group B 190 Bachelor’s Degree 118
Group C 319 Master’s degree 59
Group D 262 High School 196
Group E 140 Some College 226
Some High School 179
Data Visualization and Observation
R Programming:
#Histogram for Math Scores:
hist(StudentPerformance$math.score,
col="blue",
main="Distribution of Math Score",
xlab="Math Scores",ylab="Count")
Data Visualization and Observation
R Programming:
#Histogram for Reading Score
hist(StudentPerformance$reading.score,
col="gray",
main="Distribution of Reading
Score",
xlab="Reading
Scores",ylab="Count")
Data Visualization and Observation
R programming:
#Histogram for Writing Score
hist(StudentPerformance$writing.score,
col="pink",
main="Distribution of Writing Score",
xlab="Writing Scores",ylab="Count")
Data Visualization and Observation
R programming:
#Bar chart for Gender
barplot(table(StudentPerformance$gender),
col=c("skyblue","lightgreen"),
main="Distribution of Gender",
xlab="Gender",ylab="Count")
Data Visualization and Observation
R programming:
#Bar chart for Test Preparation Course
barplot(table(StudentPerformance$test.
preparation.course),
col=c("orange","cyan"),
main="Test Preparation Course
Completion",
xlab="Course Status",
ylab="Count")
Data Visualization and Observation
R Programming:
#Pie chart for Ethnicity
EthnicityFreq=table(StudentPerformance$race.ethnicity)
EthnicityLabels=paste0(names(EthnicityFreq),

"(",round(100*EthnicityFreq/sum(EthnicityFreq),1),"%)")
pie(EthnicityFreq,
labels=EthnicityLabels,
col=rainbow(length(EthnicityFreq)),
main="Distribution of Ethnicity")
Data Visualization and Observation
R programming:
#Pie chart for Parental Level of Education
EduFreq=table(StudentPerformance$parental.level.of.education)
EduLabels=paste0(names(EduFreq),
"(",round(100*EduFreq/sum(EduFreq),1),"%)")
pie(EduFreq,
labels=EduLabels,
col=rainbow(length(EduFreq)),
main="Distribution of Parental Level of Education")
R Programming:
library(dplyr)
NumericalVariables=StudentPerformance %>%
select(math.score,reading.score,writing.score)

Correlation CorrelationMatrix=cor(NumericalVariables,use="complete.obs")
print(CorrelationMatrix)
Analysis and math.score reading.score writing.score

Interpretation math.score 1.0000000


reading.score 0.8175797
0.8175797
1.0000000
0.8026420
0.9545981
writing.score 0.8026420 0.9545981 1.0000000

Here we see that,the variables are strongly correlated with


each other.The strongest correlation is in between Reading
Score and Writing Score.
Regression Model Estimation and Findings

R Programming:
StudentPerformance$gender=as.factor(StudentPerformance$gender)
StudentPerformance$race.ethnicity=as.factor(StudentPerformance$race.ethnicity)
StudentPerformance$parental.level.of.education=as.factor(StudentPerformance$parental.level.of.edu
cation)
StudentPerformance$test.preparation.course=as.factor(StudentPerformance$test.preparation.course)
RegressionModel=lm(math.score~reading.score+writing.score+gender+test.preparation.course,data=
StudentPerformance)
summary(RegressionModel)
Regression Model Estimation and Findings
Regression Coefficients:
Variable Estimate Std. Error t-value P-value

(Intercept) -10.904 1.131 -9.642 <2e-16***

Reading Score 0.298 0.044 6.750 <2.51e-11***

Writing Score 0.698 0.044 15.756 <2e-16***

Gender (Male) 13.633 0.397 34.303 <2e-16***

Test Preparation 3.582 0.420 8.535 <2e-16***


Course (None)
Regression Model Estimation and Interpretation

Thus,the model is given below:


Math Score=-10.904+0.298*Reading Score+0.698*Writing Score+13.633* Gender
(Male)+3.582*Test Preparation Course (None)
Interpretation:
Here,all the variables are statistically significant since p<0.005.
If we increase one unit in reading score,on an average,math score will increase
by 0.298 units .
If we increase one unit in writing score,on an average,math score will increase by
0.698 units .
If all variables be absent,then the math score will be -10.904.
Male students score,on an average,13.633 units higher in math compared to
female students,holding all other variables constant.
Students who did not complete the test preparation course, on an average, 3.582
units lower in math compared to those who completed the course, holding all other
variables constant.
Hypothesis Testing for Correlations
R programming:
CorrelationTestReadingVsMath=cor.test(StudentPerformance$readin
g.score,StudentPerformance$math.score)
print(CorrelationTestReadingVsMath)
CorrelationTestReadingVsWriting=cor.test(StudentPerformance$read
ing.score,StudentPerformance$writing.score)
print(CorrelationTestReadingVsWriting)
CorrelationTestWritingVsMath=cor.test(StudentPerformance$writing
.score,StudentPerformance$math.score)
print(CorrelationTestWritingVsMath)
library(corrplot)
corrplot(CorrelationMatrix,method="color")
Hypothesis Testing for Correlations
H0=True correlation is equal to 0.
Ha=True correlation is not equal to 0.
Under the null,
Variables Correlation Coefficient 95% Confidence P-value
(r) Interval
Reading Score Vs Math 0.818 (0.796,0.837) <2.2e-16***
Score

Reading Score Vs Writing 0.955 (0.949,0.960) <2.2e-16***


Score

Writing Score Vs Math 0.803 (0.779,0.824) <2.2e-16***


Score
Hypothesis Testing
for Correlations

Since,for all the


correlation
coefficients,p-value is
greater than
0.005,thus the null
hypothesis is
rejected,that is,for all
pairs of variables,the
correlation coefficient
is statistically
significant.
Hypothesis Testing for Regression Coefficients
R Programming:
H0=The coefficient is equal to 0. CoefficientsSummary=summary(Regression
Ha=The coefficient is not equal to 0. Model)$coefficients
print(CoefficientsSummary)
Under the null,
Variable Estimate Std. Error t-value P-value

(Intercept) -10.904 1.131 -9.642 <2e-16***

Reading Score 0.298 0.044 6.750 <2.51e-11***

Writing Score 0.698 0.044 15.756 <2e-16***

Gender (Male) 13.633 0.397 34.303 <2e-16***

Test Preparation 3.582 0.420 8.535 <2e-16***


Course (None)
Hypothesis Testing for Correlations

Since,for all the regression


coefficients,the p-value is greater
than the significance level (0.005)
,thus null hypothesis is rejected,that
is,the coefficients are statistically
significant.
THANK YOU

You might also like