0% found this document useful (0 votes)
13 views16 pages

Student Analysis

Uploaded by

Ahaan Raza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views16 pages

Student Analysis

Uploaded by

Ahaan Raza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1.

Import Data and Required Packages

Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [1]: import numpy as np


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Import the CSV Data as Pandas DataFrame

In [2]: df = pd.read_csv("StudentsPerformance.csv")

Show Top 5 Records

In [3]: df.head()

Out[3]: parental test


math reading writing
gender race/ethnicity level of lunch preparation
score score score
education course

bachelor's
0 female group B standard none 72 72 74
degree

some
1 female group C standard completed 69 90 88
college

master's
2 female group B standard none 90 95 93
degree

associate's
3 male group A free/reduced none 47 57 44
degree

some
4 male group C standard none 76 78 75
college

Shape of the dataset

In [4]: df.shape

(1000, 8)
Out[4]:

1. Dataset information

gender : sex of students -> (Male/female)

race/ethnicity : ethnicity of students -> (Group A, B,C, D,E)

parental level of education : parents' final education ->(bachelor's

degree,some college,master's degree,associate's degree,high school)

lunch : having lunch before test (standard or free/reduced)


test preparation course : complete or not complete before test

math score

reading score

writing score

1. Data Checks to perform

Check Missing values

Check Duplicates

Check data type

Check the number of unique values of each column

Check statistics of data set

Check various categories present in the different categorical column

3.1 Check Missing values

In [5]: df.isna().sum()

gender 0
Out[5]:
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64

There are no missing values in the data set

3.2 Check Duplicates

In [6]: df.duplicated().sum()

0
Out[6]:

There are no duplicates values in the data set

3.3 Check data types

In [7]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 1000 non-null object
1 race/ethnicity 1000 non-null object
2 parental level of education 1000 non-null object
3 lunch 1000 non-null object
4 test preparation course 1000 non-null object
5 math score 1000 non-null int64
6 reading score 1000 non-null int64
7 writing score 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB

3.4 Checking the number of unique values of each column

In [8]: df.nunique()

gender 2
Out[8]:
race/ethnicity 5
parental level of education 6
lunch 2
test preparation course 2
math score 81
reading score 72
writing score 77
dtype: int64

3.5 Check statistics of data set

In [9]: df.describe()

Out[9]: math score reading score writing score

count 1000.00000 1000.000000 1000.000000

mean 66.08900 69.169000 68.054000

std 15.16308 14.600192 15.195657

min 0.00000 17.000000 10.000000

25% 57.00000 59.000000 57.750000

50% 66.00000 70.000000 69.000000

75% 77.00000 79.000000 79.000000

max 100.00000 100.000000 100.000000

Insight

From above description of numerical data, all means are very close to each other - between
66 and 68.05;

All standard deviations are also close - between 14.6 and 15.19;

While there is a minimum score 0 for math, for writing minimum is much higher = 10 and
for reading myet higher = 17

3.6 Exploring Data


In [10]: df.head()

Out[10]: parental test


math reading writing
gender race/ethnicity level of lunch preparation
score score score
education course

bachelor's
0 female group B standard none 72 72 74
degree

some
1 female group C standard completed 69 90 88
college

master's
2 female group B standard none 90 95 93
degree

associate's
3 male group A free/reduced none 47 57 44
degree

some
4 male group C standard none 76 78 75
college

In [11]: print("Categories in 'gender' variable: ",end=" " )


print(df['gender'].unique())

print("Categories in 'race_ethnicity' variable: ",end=" ")


print(df['race/ethnicity'].unique())

print("Categories in'parental level of education' variable:",end=" " )


print(df['parental level of education'].unique())

print("Categories in 'lunch' variable: ",end=" " )


print(df['lunch'].unique())

print("Categories in 'test preparation course' variable: ",end=" " )


print(df['test preparation course'].unique())

Categories in 'gender' variable: ['female' 'male']


Categories in 'race_ethnicity' variable: ['group B' 'group C' 'group A' 'group
D' 'group E']
Categories in'parental level of education' variable: ["bachelor's degree" 'some co
llege' "master's degree" "associate's degree"
'high school' 'some high school']
Categories in 'lunch' variable: ['standard' 'free/reduced']
Categories in 'test preparation course' variable: ['none' 'completed']

In [12]: # define numerical & categorical columns


numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype ==

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_fe
print('\nWe have {} categorical features : {}'.format(len(categorical_features), ca

We have 3 numerical features : ['math score', 'reading score', 'writing score']

We have 5 categorical features : ['gender', 'race/ethnicity', 'parental level of e


ducation', 'lunch', 'test preparation course']

3.8 Adding columns for "Total Score" and "Average"

In [13]: df['total score'] = df['math score'] + df['reading score'] + df['writing score']


df['average'] = df['total score']/3
df.head()
Out[13]: parental test
math reading writing total
gender race/ethnicity level of lunch preparation a
score score score score
education course

bachelor's
0 female group B standard none 72 72 74 218 72
degree

some
1 female group C standard completed 69 90 88 247 82
college

master's
2 female group B standard none 90 95 93 278 92
degree

associate's
3 male group A free/reduced none 47 57 44 148 49
degree

some
4 male group C standard none 76 78 75 229 76
college

 

In [14]: reading_full = df[df['reading score'] == 100]['average'].count()


writing_full = df[df['writing score'] == 100]['average'].count()
math_full = df[df['math score'] == 100]['average'].count()

print(f'Number of students with full marks in Maths: {math_full}')


print(f'Number of students with full marks in Writing: {writing_full}')
print(f'Number of students with full marks in Reading: {reading_full}')

Number of students with full marks in Maths: 7


Number of students with full marks in Writing: 14
Number of students with full marks in Reading: 17

In [15]: reading_less_20 = df[df['reading score'] <= 20]['average'].count()


writing_less_20 = df[df['writing score'] <= 20]['average'].count()
math_less_20 = df[df['math score'] <= 20]['average'].count()

print(f'Number of students with less than 20 marks in Maths: {math_less_20}')


print(f'Number of students with less than 20 marks in Writing: {writing_less_20}')
print(f'Number of students with less than 20 marks in Reading: {reading_less_20}')

Number of students with less than 20 marks in Maths: 4


Number of students with less than 20 marks in Writing: 3
Number of students with less than 20 marks in Reading: 1

Insights

.From above values we get students have performed the worst in Maths

.Best performance is in reading section

1. Exploring Data ( Visualization )

4.1 Visualize average score distribution to make some conclusion.

.Histogram

.Kernel Distribution Function (KDE)

4.1.1 Histogram & KDE

In [16]: fig, axs = plt.subplots(1, 2, figsize=(15, 7))


plt.subplot(121)
sns.histplot(data=df,x='average',bins=30,kde=True,color='g')
plt.subplot(122)
sns.histplot(data=df,x='average',kde=True,hue='gender')
plt.show()

In [17]: fig, axs = plt.subplots(1, 2, figsize=(15, 7))


plt.subplot(121)
sns.histplot(data=df,x='total score',bins=30,kde=True,color='g')
plt.subplot(122)
sns.histplot(data=df,x='total score',kde=True,hue='gender')
plt.show()

Female students tend to perform well then male students.

In [18]: plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
sns.histplot(data=df,x='average',kde=True,hue='lunch')
plt.subplot(142)
sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='lunch')
plt.subplot(143)
sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='lunch')
plt.show()
Insights

Standard lunch helps perform well in exams.

Standard lunch helps perform well in exams be it a male or a female.

In [19]: plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
ax =sns.histplot(data=df,x='average',kde=True,hue='parental level of education')
plt.subplot(142)
ax =sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='parental leve
plt.subplot(143)
ax =sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='parental le
plt.show()

Insights

In general parent's education don't help student perform well in exam.

2nd plot shows that parent's whose education is of associate's degree or master's degree
their male child tend to perform well in exam

3rd plot we can see there is no effect of parent's education on female students.

In [20]: plt.subplots(1,3,figsize=(25,6))
plt.subplot(141)
ax =sns.histplot(data=df,x='average',kde=True,hue='race/ethnicity')
plt.subplot(142)
ax =sns.histplot(data=df[df.gender=='female'],x='average',kde=True,hue='race/ethnic
plt.subplot(143)
ax =sns.histplot(data=df[df.gender=='male'],x='average',kde=True,hue='race/ethnicit
plt.show()
Insights

Students of group A and group B tends to perform poorly in exam.

Students of group A and group B tends to perform poorly in exam irrespective of whether
they are male or female

In [21]: plt.figure(figsize=(18,8))
plt.subplot(1, 4, 1)
plt.title('MATH SCORES')
sns.violinplot(y='math score',data=df,color='red',linewidth=3)
plt.subplot(1, 4, 2)
plt.title('READING SCORES')
sns.violinplot(y='reading score',data=df,color='green',linewidth=3)
plt.subplot(1, 4, 3)
plt.title('WRITING SCORES')
sns.violinplot(y='writing score',data=df,color='blue',linewidth=3)
plt.show()

Insights

From the above three plots its clearly visible that most of the students score in between 60-
80 in Maths whereas in reading and writing most of them score from 50-80

4.3 Multivariate analysis using pieplot


In [22]: plt.rcParams['figure.figsize'] = (30, 12)

plt.subplot(1, 5, 1)
size = df['gender'].value_counts()
labels = 'Female', 'Male'
color = ['red','green']

plt.pie(size, colors = color, labels = labels,autopct = '.%2f%%')


plt.title('Gender', fontsize = 20)
plt.axis('off')

plt.subplot(1, 5, 2)
size = df['race/ethnicity'].value_counts()
labels = 'Group C', 'Group D','Group B','Group E','Group A'
color = ['red', 'green', 'blue', 'cyan','orange']

plt.pie(size, colors = color,labels = labels,autopct = '.%2f%%')


plt.title('Race/Ethnicity', fontsize = 20)
plt.axis('off')

plt.subplot(1, 5, 3)
size = df['lunch'].value_counts()
labels = 'Standard', 'Free'
color = ['red','green']

plt.pie(size, colors = color,labels = labels,autopct = '.%2f%%')


plt.title('Lunch', fontsize = 20)
plt.axis('off')

plt.subplot(1, 5, 4)
size = df['test preparation course'].value_counts()
labels = 'None', 'Completed'
color = ['red','green']

plt.pie(size, colors = color,labels = labels,autopct = '.%2f%%')


plt.title('Test Course', fontsize = 20)
plt.axis('off')

plt.subplot(1, 5, 5)
size = df['parental level of education'].value_counts()
labels = 'Some College', "Associate's Degree",'High School','Some High School',"Bac
color = ['red', 'green', 'blue', 'cyan','orange','grey']

plt.pie(size, colors = color,labels = labels,autopct = '.%2f%%')


plt.title('Parental Education', fontsize = 20)
plt.axis('off')

plt.tight_layout()
plt.grid()

plt.show()
Insights

Number of Male and Female students is almost equal

Number students are greatest in Group C

Number of students who have standard lunch are greater

Number of students who have not enrolled in any test preparation course is greater

Number of students whose parental education is "Some College" is greater followed closely
by "Associate's Degree"

4.4 Feature Wise Visualization

4.4.1 GENDER COLUMN

How is distribution of Gender ?

Is gender has any impact on student's performance ?

UNIVARIATE ANALYSIS ( How is distribution of Gender ? )

In [23]: f,ax=plt.subplots(1,2,figsize=(20,10))
sns.countplot(x=df['gender'],data=df,palette ='bright',ax=ax[0],saturation=0.95)
for container in ax[0].containers:
ax[0].bar_label(container,color='black',size=20)

plt.pie(x=df['gender'].value_counts(),labels=['Male','Female'],explode=[0,0.1],auto
plt.show()

Insights
Gender has balanced data with female students are 518 (48%) and male students are 482
(52%)

4.4.2 RACE/EHNICITY COLUMN

How is Group wise distribution ?

Is Race/Ehnicity has any impact on student's performance ?

UNIVARIATE ANALYSIS ( How is Group wise distribution ?)

In [24]: f,ax=plt.subplots(1,2,figsize=(20,10))
sns.countplot(x=df['race/ethnicity'],data=df,palette = 'bright',ax=ax[0],saturation
for container in ax[0].containers:
ax[0].bar_label(container,color='black',size=20)

plt.pie(x = df['race/ethnicity'].value_counts(),labels=df['race/ethnicity'].value_c
plt.show()

Insights

Most of the student belonging from group C /group D.

Lowest number of students belong to groupA.

BIVARIATE ANALYSIS ( Is Race/Ehnicity has any impact on student's performance ? )

In [25]: Group_data2=df.groupby('race/ethnicity')
f,ax=plt.subplots(1,3,figsize=(20,8))
sns.barplot(x=Group_data2['math score'].mean().index,y=Group_data2['math score'].me
ax[0].set_title('Math score',color='#005ce6',size=20)

for container in ax[0].containers:


ax[0].bar_label(container,color='black',size=15)

sns.barplot(x=Group_data2['reading score'].mean().index,y=Group_data2['reading scor


ax[1].set_title('Reading score',color='#005ce6',size=20)

for container in ax[1].containers:


ax[1].bar_label(container,color='black',size=15)
sns.barplot(x=Group_data2['writing score'].mean().index,y=Group_data2['writing scor
ax[2].set_title('Writing score',color='#005ce6',size=20)

for container in ax[2].containers:


ax[2].bar_label(container,color='black',size=15)

Insights

Group E students have scored the highest marks.

Group A students have scored the lowest marks.

Students from a lower Socioeconomic status have a lower avg in all course subjects

4.4.3 PARENTAL LEVEL OF EDUCATION COLUMN

What is educational background of student's parent ?

Is parental education has any impact on student's performance ?

UNIVARIATE ANALYSIS ( What is educational background of student's parent ? )

In [26]: plt.rcParams['figure.figsize'] = (15, 9)


plt.style.use('fivethirtyeight')
sns.histplot(df["parental level of education"], palette = 'Blues')
plt.title('Comparison of Parental Education', fontweight = 30, fontsize = 20)
plt.xlabel('Degree')
plt.ylabel('count')
plt.show()
Insights

Largest number of parents are from some college.

4.4.4 LUNCH COLUMN

Which type of lunch is most common amoung students ?

What is the effect of lunch type on test results?

BIVARIATE ANALYSIS ( Is lunch type intake has any impact on student's performance ? )

In [27]: f,ax=plt.subplots(1,2,figsize=(20,8))
sns.countplot(x=df['parental level of education'],data=df,palette = 'bright',hue='t
ax[0].set_title('Students vs test preparation course ',color='black',size=25)
for container in ax[0].containers:
ax[0].bar_label(container,color='black',size=20)

sns.countplot(x=df['parental level of education'],data=df,palette = 'bright',hue='l


for container in ax[1].containers:
ax[1].bar_label(container,color='black',size=20)

Insights
Students who get Standard Lunch tend to perform better than students who got
free/reduced lunch

4.4.5 TEST PREPARATION COURSE COLUMN

Which type of lunch is most common amoung students ?

Is Test prepration course has any impact on student's performance ?

BIVARIATE ANALYSIS ( Is Test prepration course has any impact on student's performance ? )

In [28]: plt.figure(figsize=(12,6))
plt.subplot(2,2,1)
sns.barplot (x=df['lunch'], y=df['math score'], hue=df['test preparation course'])
plt.subplot(2,2,2)
sns.barplot (x=df['lunch'], y=df['reading score'], hue=df['test preparation course
plt.subplot(2,2,3)
sns.barplot (x=df['lunch'], y=df['writing score'], hue=df['test preparation course

<Axes: xlabel='lunch', ylabel='writing score'>


Out[28]:

Insights

Students who have completed the Test Prepration Course have scores higher in all three
categories than those who haven't taken the course

4.4.6 CHECKING OUTLIERS

In [29]: plt.subplots(1,4,figsize=(16,5))
plt.subplot(141)
sns.boxplot(df['math score'],color='skyblue')
plt.subplot(142)
sns.boxplot(df['reading score'],color='hotpink')
plt.subplot(143)
sns.boxplot(df['writing score'],color='yellow')
plt.subplot(144)
sns.boxplot(df['average'],color='lightgreen')
plt.show()
4.4.7 MUTIVARIATE ANALYSIS USING PAIRPLOT

In [30]: sns.pairplot(df,hue = 'gender')


plt.show()

Insights

From the above plot it is clear that all the scores increase linearly with each other.

1. Conclusions

Student's Performance is related with lunch, race, parental level education

Females lead in pass percentage and also are top-scorers

Student's Performance is not much related with test preparation course


Finishing preparation course is benefitial.

In [ ]:

You might also like