Student Analysis
Student Analysis
In [2]: df = pd.read_csv("StudentsPerformance.csv")
In [3]: df.head()
bachelor's
0 female group B standard none 72 72 74
degree
some
1 female group C standard completed 69 90 88
college
master's
2 female group B standard none 90 95 93
degree
associate's
3 male group A free/reduced none 47 57 44
degree
some
4 male group C standard none 76 78 75
college
In [4]: df.shape
(1000, 8)
Out[4]:
1. Dataset information
math score
reading score
writing score
Check Duplicates
In [5]: df.isna().sum()
gender 0
Out[5]:
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64
In [6]: df.duplicated().sum()
0
Out[6]:
In [7]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 1000 non-null object
1 race/ethnicity 1000 non-null object
2 parental level of education 1000 non-null object
3 lunch 1000 non-null object
4 test preparation course 1000 non-null object
5 math score 1000 non-null int64
6 reading score 1000 non-null int64
7 writing score 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
In [8]: df.nunique()
gender 2
Out[8]:
race/ethnicity 5
parental level of education 6
lunch 2
test preparation course 2
math score 81
reading score 72
writing score 77
dtype: int64
In [9]: df.describe()
Insight
From above description of numerical data, all means are very close to each other - between
66 and 68.05;
All standard deviations are also close - between 14.6 and 15.19;
While there is a minimum score 0 for math, for writing minimum is much higher = 10 and
for reading myet higher = 17
bachelor's
0 female group B standard none 72 72 74
degree
some
1 female group C standard completed 69 90 88
college
master's
2 female group B standard none 90 95 93
degree
associate's
3 male group A free/reduced none 47 57 44
degree
some
4 male group C standard none 76 78 75
college
# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_fe
print('\nWe have {} categorical features : {}'.format(len(categorical_features), ca
bachelor's
0 female group B standard none 72 72 74 218 72
degree
some
1 female group C standard completed 69 90 88 247 82
college
master's
2 female group B standard none 90 95 93 278 92
degree
associate's
3 male group A free/reduced none 47 57 44 148 49
degree
some
4 male group C standard none 76 78 75 229 76
college
Insights
.From above values we get students have performed the worst in Maths
.Histogram
In [18]: plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
sns.histplot(data=df,x='average',kde=True,hue='lunch')
plt.subplot(142)
sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='lunch')
plt.subplot(143)
sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='lunch')
plt.show()
Insights
In [19]: plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
ax =sns.histplot(data=df,x='average',kde=True,hue='parental level of education')
plt.subplot(142)
ax =sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='parental leve
plt.subplot(143)
ax =sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='parental le
plt.show()
Insights
2nd plot shows that parent's whose education is of associate's degree or master's degree
their male child tend to perform well in exam
3rd plot we can see there is no effect of parent's education on female students.
In [20]: plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
ax =sns.histplot(data=df,x='average',kde=True,hue='race/ethnicity')
plt.subplot(142)
ax =sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='race/ethnic
plt.subplot(143)
ax =sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='race/ethnicit
plt.show()
Insights
Students of group A and group B tends to perform poorly in exam irrespective of whether
they are male or female
In [21]: plt.figure(figsize=(18,8))
plt.subplot(1, 4, 1)
plt.title('MATH SCORES')
sns.violinplot(y='math score',data=df,color='red',linewidth=3)
plt.subplot(1, 4, 2)
plt.title('READING SCORES')
sns.violinplot(y='reading score',data=df,color='green',linewidth=3)
plt.subplot(1, 4, 3)
plt.title('WRITING SCORES')
sns.violinplot(y='writing score',data=df,color='blue',linewidth=3)
plt.show()
Insights
From the above three plots its clearly visible that most of the students score in between 60-
80 in Maths whereas in reading and writing most of them score from 50-80
plt.subplot(1, 5, 1)
size = df['gender'].value_counts()
labels = 'Female', 'Male'
color = ['red','green']
plt.subplot(1, 5, 2)
size = df['race/ethnicity'].value_counts()
labels = 'Group C', 'Group D','Group B','Group E','Group A'
color = ['red', 'green', 'blue', 'cyan','orange']
plt.subplot(1, 5, 3)
size = df['lunch'].value_counts()
labels = 'Standard', 'Free'
color = ['red','green']
plt.subplot(1, 5, 4)
size = df['test preparation course'].value_counts()
labels = 'None', 'Completed'
color = ['red','green']
plt.subplot(1, 5, 5)
size = df['parental level of education'].value_counts()
labels = 'Some College', "Associate's Degree",'High School','Some High School',"Bac
color = ['red', 'green', 'blue', 'cyan','orange','grey']
plt.tight_layout()
plt.grid()
plt.show()
Insights
Number of students who have not enrolled in any test preparation course is greater
Number of students whose parental education is "Some College" is greater followed closely
by "Associate's Degree"
In [23]: f,ax=plt.subplots(1,2,figsize=(20,10))
sns.countplot(x=df['gender'],data=df,palette ='bright',ax=ax[0],saturation=0.95)
for container in ax[0].containers:
ax[0].bar_label(container,color='black',size=20)
plt.pie(x=df['gender'].value_counts(),labels=['Male','Female'],explode=[0,0.1],auto
plt.show()
Insights
Gender has balanced data with female students are 518 (48%) and male students are 482
(52%)
In [24]: f,ax=plt.subplots(1,2,figsize=(20,10))
sns.countplot(x=df['race/ethnicity'],data=df,palette = 'bright',ax=ax[0],saturation
for container in ax[0].containers:
ax[0].bar_label(container,color='black',size=20)
plt.pie(x = df['race/ethnicity'].value_counts(),labels=df['race/ethnicity'].value_c
plt.show()
Insights
In [25]: Group_data2=df.groupby('race/ethnicity')
f,ax=plt.subplots(1,3,figsize=(20,8))
sns.barplot(x=Group_data2['math score'].mean().index,y=Group_data2['math score'].me
ax[0].set_title('Math score',color='#005ce6',size=20)
Insights
Students from a lower Socioeconomic status have a lower avg in all course subjects
BIVARIATE ANALYSIS ( Is lunch type intake has any impact on student's performance ? )
In [27]: f,ax=plt.subplots(1,2,figsize=(20,8))
sns.countplot(x=df['parental level of education'],data=df,palette = 'bright',hue='t
ax[0].set_title('Students vs test preparation course ',color='black',size=25)
for container in ax[0].containers:
ax[0].bar_label(container,color='black',size=20)
Insights
Students who get Standard Lunch tend to perform better than students who got
free/reduced lunch
BIVARIATE ANALYSIS ( Is Test prepration course has any impact on student's performance ? )
In [28]: plt.figure(figsize=(12,6))
plt.subplot(2,2,1)
sns.barplot (x=df['lunch'], y=df['math score'], hue=df['test preparation course'])
plt.subplot(2,2,2)
sns.barplot (x=df['lunch'], y=df['reading score'], hue=df['test preparation course
plt.subplot(2,2,3)
sns.barplot (x=df['lunch'], y=df['writing score'], hue=df['test preparation course
Insights
Students who have completed the Test Prepration Course have scores higher in all three
categories than those who haven't taken the course
In [29]: plt.subplots(1,4,figsize=(16,5))
plt.subplot(141)
sns.boxplot(df['math score'],color='skyblue')
plt.subplot(142)
sns.boxplot(df['reading score'],color='hotpink')
plt.subplot(143)
sns.boxplot(df['writing score'],color='yellow')
plt.subplot(144)
sns.boxplot(df['average'],color='lightgreen')
plt.show()
4.4.7 MUTIVARIATE ANALYSIS USING PAIRPLOT
Insights
From the above plot it is clear that all the scores increase linearly with each other.
1. Conclusions
In [ ]: