machine learning with python
machine learning with python
import numpy as np
import pandas as pd
# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import dabl
['StudentsPerformance.csv']
(1000, 8)
data.head()
Descriptive Statistics
In [5]: # describing the dataset
data.describe()
Loading [MathJax]/extensions/Safe.js
Out[5]: math score reading score writing score
In [6]: # lets check the no. of unique items present in the categorical column
data.select_dtypes('object').nunique()
gender 2
Out[6]:
race/ethnicity 5
parental level of education 6
lunch 2
test preparation course 2
dtype: int64
In [7]: # lets check the percentage of missing data in each columns present in the data
no_of_columns = data.shape[0]
percentage_of_missing_data = data.isnull().sum()/no_of_columns
print(percentage_of_missing_data)
gender 0.0
race/ethnicity 0.0
parental level of education 0.0
lunch 0.0
test preparation course 0.0
math score 0.0
reading score 0.0
writing score 0.0
dtype: float64
plt.rcParams['figure.figsize'] = (18, 6)
plt.style.use('fivethirtyeight')
dabl.plot(data, target_col = 'math score')
A future version of pandas will default to `skipna=True`. To silence this warning, pass
`skipna=True|False` explicitly.
Loading [MathJax]/extensions/Safe.js
In [9]: # comparison of all other attributes with respect to Reading Marks
plt.rcParams['figure.figsize'] = (18, 6)
plt.style.use('fivethirtyeight')
dabl.plot(data, target_col = 'reading score')
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py:186: FutureWarning:
A future version of pandas will default to `skipna=True`. To silence this warning, pass
`skipna=True|False` explicitly.
Loading [MathJax]/extensions/Safe.js
In [10]: # comparison of all other attributes with respect to Writing Marks
plt.rcParams['figure.figsize'] = (18, 6)
plt.style.use('fivethirtyeight')
dabl.plot(data, target_col = 'writing score')
/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py:186: FutureWarning:
A future version of pandas will default to `skipna=True`. To silence this warning, pass
`skipna=True|False` explicitly.
Loading [MathJax]/extensions/Safe.js
Inferential Statistics
Lets check the Probability of Students Scoring More than 50 Marks in Maths
probability_of_students_scoring_more_than_50_in_maths = (students_score_more_than_50/tot
print("Probability of Students Scoring more than 50 marks in Maths :", probability_of_st
probability_of_students_scoring_more_than_50_in_reading = (students_score_more_than_50/t
print("Probability of Students Scoring more than 50 marks in Reading :", probability_of_
Lets also check the Probability of Students Passing in all the three Subjects
Lets also check the Probability of Students Scoring more than 90 in all the three Subjects
probability_of_students_scoring_more_than_90_in_all_subjects = (number_of_students_scori
print("The Probability of Students Passing in all the Subjects is {0:.2f} %".
format(probability_of_students_scoring_more_than_90_in_all_subjects))
Checking for Skewness for the Maths, Reading and Writing Scores
In [16]: plt.subplot(1, 3, 1)
sns.distplot(data['math score'])
plt.subplot(1, 3, 2)
sns.distplot(data['reading score'])
plt.subplot(1, 3, 3)
sns.distplot(data['writing score'])
In [17]: # lets take seed so that everytime the random values come out to be constant
np.random.seed(6)
# lets take 100 sample values from the dataset of 1000 values
sample_math_marks = np.random.choice(a= data['math score'], size=100)
# lets take 100 sample values from the dataset of 1000 values
sample_reading_marks = np.random.choice(a= data['reading score'], size=100)
# lets take 100 sample values from the dataset of 1000 values
sample_writing_marks = np.random.choice(a= data['writing score'], size=100)
Grouping Operations
Number of Girl Students Scoring 90 in all the Subjects
Loading [MathJax]/extensions/Safe.js
Out[21]: gender race/ethnicity
parental level of
lunch
test preparation math reading writing
education course score score score
179 female group D some high school standard completed 97 100 100
458 female group E bachelor's degree standard none 100 100 100
962 female group E associate's degree standard none 100 100 100
In [22]: data.groupby(['gender']).agg(['min','median','max'])
gender
lunch gender
Loading [MathJax]/extensions/Safe.js
In [24]: data[['test preparation course',
'gender',
'math score',
'writing score',
'reading score']].groupby(['test preparation course','gender']).agg('median')
female 67 79 78
completed
male 73 70 71
female 62 70 71
none
male 67 60 63
In [25]: data[['race/ethnicity',
'math score',
'writing score',
'reading score']].groupby(['race/ethnicity']).agg('median')
race/ethnicity
Data Visualizations
In [26]: # visualising the number of male and female in the dataset
plt.rcParams['figure.figsize'] = (15, 5)
plt.style.use('_classic_test')
sns.countplot(data['gender'], palette = 'bone')
plt.title('Comparison of Males and Females', fontweight = 30)
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()
Loading [MathJax]/extensions/Safe.js
In [27]: # visualizing the different groups in the dataset
plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('ggplot')
Loading [MathJax]/extensions/Safe.js
In [28]: # visualizing the differnt parental education levels
plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('fivethirtyeight')
Loading [MathJax]/extensions/Safe.js
In [29]: # visualizing different types of lunch
plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('seaborn-talk')
Loading [MathJax]/extensions/Safe.js
In [30]: # visualizing maths score
plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('tableau-colorblind10')
Loading [MathJax]/extensions/Safe.js
In [31]: ## visualizing reading score
plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('tableau-colorblind10')
Loading [MathJax]/extensions/Safe.js
In [32]: # visualizing writing score
plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('tableau-colorblind10')
Loading [MathJax]/extensions/Safe.js
In [33]: # gender vs race/etnicity
plt.rcParams['figure.figsize'] = (15, 9)
x = pd.crosstab(data['gender'], data['race/ethnicity'])
x.div(x.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = False)
plt.title('Gender vs Race', fontweight = 30, fontsize = 20)
plt.show()
Loading [MathJax]/extensions/Safe.js
In [34]: # comparison of race/ethnicity and parental level of education
plt.rcParams['figure.figsize'] = (15, 9)
x = pd.crosstab(data['race/ethnicity'], data['parental level of education'])
x.div(x.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = 'True')
plt.title('Race vs Parental Education', fontweight = 30, fontsize = 20)
plt.show()
plt.rcParams['figure.figsize'] = (15, 9)
sns.countplot(x = 'parental level of education', data = data, hue = 'test preparation co
plt.title('Parental Education vs Test Preparation Course', fontweight = 30, fontsize = 2
plt.show()
Loading [MathJax]/extensions/Safe.js
In [36]: # comparison of race/ethnicity and test preparation course
In [37]: # feature engineering on the data to visualize and solve the dataset more accurately
# setting a passing mark for the students to pass on the three subjects individually
passmarks = 40
Loading [MathJax]/extensions/Safe.js
# creating a new column pass_math, this column will tell us whether the students are pas
data['pass_math'] = np.where(data['math score']< passmarks, 'Fail', 'Pass')
data['pass_math'].value_counts().plot.pie(colors = ['lightblue', 'lightgreen'])
In [38]: # creating a new column pass_reading, this column will tell us whether the students are p
Loading [MathJax]/extensions/Safe.js
In [39]: # creating a new column pass_writing, this column will tell us whether the students are p
Loading [MathJax]/extensions/Safe.js
In [40]: # computing the total score for each student
import warnings
warnings.filterwarnings('ignore')
Loading [MathJax]/extensions/Safe.js
In [41]: # computing percentage for each of the students
# importing math library to use ceil
from math import *
import warnings
warnings.filterwarnings('ignore')
data['percentage'] = data['total_score']/3
plt.rcParams['figure.figsize'] = (15, 9)
sns.distplot(data['percentage'], color = 'orange')
Loading [MathJax]/extensions/Safe.js
In [42]: # checking which student is fail overall
Loading [MathJax]/extensions/Safe.js
In [43]: # Assigning grades to the grades according to the following criteria :
# 0 - 40 marks : grade E
# 41 - 60 marks : grade D
# 60 - 70 marks : grade C
# 70 - 80 marks : grade B
# 80 - 90 marks : grade A
# 90 - 100 marks : grade O
data['grades'].value_counts()
B 260
Out[43]:
C 252
D 223
A 156
O 58
E 51
Name: grades, dtype: int64
In [44]: # plotting a pie chart for the distribution of various grades amongst the students
labels = ['Grade 0', 'Grade A', 'Grade B', 'Grade C', 'Grade D', 'Grade E']
sizes = [58, 156, 260, 252, 223, 51]
colors = ['yellow', 'gold', 'lightskyblue', 'lightcoral', 'pink', 'cyan']
explode = (0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001)
plt.rcParams['figure.figsize'] = (15, 9)
Loading [MathJax]/extensions/Safe.js
In [46]: # for better visualization we will plot it again using seaborn
Loading [MathJax]/extensions/Safe.js
In [47]: # comparing the distribution of grades among males and females
Label Encoding
Loading [MathJax]/extensions/Safe.js
In [48]: from sklearn.preprocessing import LabelEncoder
# creating an encoder
le = LabelEncoder()
Data Preparation
In [49]: # splitting the dependent and independent variables
x = data.iloc[:,:14]
y = data.iloc[:,14]
print(x.shape)
print(y.shape)
(1000, 14)
(1000,)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
Loading [MathJax]/extensions/Safe.js
(750, 14)
(750,)
(250, 14)
(250,)
# creating a scaler
mm = MinMaxScaler()
# visualising the principal components that will explain the highest share of variance
#explained_variance = pca.explained_variance_ratio_
#print(explained_variance)
Modelling
Logistic Regression
In [53]: from sklearn.linear_model import LogisticRegression
# creating a model
model = LogisticRegression()
Random Forest
# creating a model
model = RandomForestClassifier()
Loading [MathJax]/extensions/Safe.js