0% found this document useful (0 votes)

4 views

machine learning with python

Uploaded by

Moïse Sankara

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

machine learning with python

Uploaded by

Moïse Sankara

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

In [2]: # for some basic operations

import numpy as np
import pandas as pd

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import dabl

# for providing path

import os
print(os.listdir('../input/'))

['StudentsPerformance.csv']

Reading the Data set

In [3]: # reading the data

data = pd.read_csv('../input/StudentsPerformance.csv')

# getting the shape of the data

print(data.shape)

(1000, 8)

In [4]: # looking at the head of the data

data.head()

Out[4]: gender race/ethnicity

parental level of
lunch
test preparation math reading writing
education course score score score

0 female group B bachelor's degree standard none 72 72 74

1 female group C some college standard completed 69 90 88

2 female group B master's degree standard none 90 95 93

3 male group A associate's degree free/reduced none 47 57 44

4 male group C some college standard none 76 78 75

Descriptive Statistics
In [5]: # describing the dataset

data.describe()

Loading [MathJax]/extensions/Safe.js
Out[5]: math score reading score writing score

count 1000.00000 1000.000000 1000.000000

mean 66.08900 69.169000 68.054000

std 15.16308 14.600192 15.195657

min 0.00000 17.000000 10.000000

25% 57.00000 59.000000 57.750000

50% 66.00000 70.000000 69.000000

75% 77.00000 79.000000 79.000000

max 100.00000 100.000000 100.000000

In [6]: # lets check the no. of unique items present in the categorical column

data.select_dtypes('object').nunique()

gender 2
Out[6]:
race/ethnicity 5
parental level of education 6
lunch 2
test preparation course 2
dtype: int64

In [7]: # lets check the percentage of missing data in each columns present in the data

no_of_columns = data.shape[0]
percentage_of_missing_data = data.isnull().sum()/no_of_columns
print(percentage_of_missing_data)

gender 0.0
race/ethnicity 0.0
parental level of education 0.0
lunch 0.0
test preparation course 0.0
math score 0.0
reading score 0.0
writing score 0.0
dtype: float64

In [8]: # comparison of all other attributes with respect to Math Marks

plt.rcParams['figure.figsize'] = (18, 6)
plt.style.use('fivethirtyeight')
dabl.plot(data, target_col = 'math score')

Target looks like regression

/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py:186: FutureWarning:

A future version of pandas will default to `skipna=True`. To silence this warning, pass
`skipna=True|False` explicitly.

Loading [MathJax]/extensions/Safe.js
In [9]: # comparison of all other attributes with respect to Reading Marks

plt.rcParams['figure.figsize'] = (18, 6)
plt.style.use('fivethirtyeight')
dabl.plot(data, target_col = 'reading score')

/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py:186: FutureWarning:

A future version of pandas will default to `skipna=True`. To silence this warning, pass
`skipna=True|False` explicitly.

Target looks like regression

Loading [MathJax]/extensions/Safe.js
In [10]: # comparison of all other attributes with respect to Writing Marks

plt.rcParams['figure.figsize'] = (18, 6)
plt.style.use('fivethirtyeight')
dabl.plot(data, target_col = 'writing score')

/opt/conda/lib/python3.6/site-packages/pandas/core/apply.py:186: FutureWarning:

A future version of pandas will default to `skipna=True`. To silence this warning, pass
`skipna=True|False` explicitly.

Target looks like regression

Loading [MathJax]/extensions/Safe.js
Inferential Statistics
Lets check the Probability of Students Scoring More than 50 Marks in Maths

In [11]: total_students = data.shape[0]

students_score_more_than_50 = data[data['math score'] > 50].shape[0]

probability_of_students_scoring_more_than_50_in_maths = (students_score_more_than_50/tot
print("Probability of Students Scoring more than 50 marks in Maths :", probability_of_st

Probability of Students Scoring more than 50 marks in Maths : 85.0

In [12]: total_students = data.shape[0]

students_score_more_than_50 = data[data['reading score'] > 50].shape[0]

probability_of_students_scoring_more_than_50_in_reading = (students_score_more_than_50/t
print("Probability of Students Scoring more than 50 marks in Reading :", probability_of_

Probability of Students Scoring more than 50 marks in Reading : 90.3

In [13]: total_students = data.shape[0]

students_score_more_than_50 = data[data['writing score'] > 50].shape[0]
Loading [MathJax]/extensions/Safe.js
probability_of_students_scoring_more_than_50_in_writing = (students_score_more_than_50/t
print("Probability of Students Scoring more than 50 marks in Writing :", probability_of_

Probability of Students Scoring more than 50 marks in Writing : 87.6

Lets also check the Probability of Students Passing in all the three Subjects

In [14]: total_students = data.shape[0]

number_of_students_passing_in_all_subjects = data[(data['math score'] > 40) &
(data['writing score'] > 40) &
(data['reading score'] > 40)].shape[0]
probability_of_students_passing_in_all_the_subjects = (number_of_students_passing_in_all_
print("The Probability of Students Passing in all the Subjects is {0:.2f} %".format(prob

The Probability of Students Passing in all the Subjects is 93.90 %

Lets also check the Probability of Students Scoring more than 90 in all the three Subjects

In [15]: total_students = data.shape[0]

number_of_students_scoring_more_than_90 = data[(data['math score'] > 90) &
(data['writing score'] > 90) &
(data['reading score'] > 90)].shape[0]

probability_of_students_scoring_more_than_90_in_all_subjects = (number_of_students_scori
print("The Probability of Students Passing in all the Subjects is {0:.2f} %".
format(probability_of_students_scoring_more_than_90_in_all_subjects))

The Probability of Students Passing in all the Subjects is 2.30 %

Checking for Skewness for the Maths, Reading and Writing Scores

In [16]: plt.subplot(1, 3, 1)
sns.distplot(data['math score'])

plt.subplot(1, 3, 2)
sns.distplot(data['reading score'])

plt.subplot(1, 3, 3)
sns.distplot(data['writing score'])

plt.suptitle('Checking for Skewness', fontsize = 18)

plt.show()

It is very much clear, that there is no skewness in the Target Columns,

Loading [MathJax]/extensions/Safe.js
Lets check the Inference

In [17]: # lets take seed so that everytime the random values come out to be constant
np.random.seed(6)

# lets take 100 sample values from the dataset of 1000 values
sample_math_marks = np.random.choice(a= data['math score'], size=100)

# getting the sample mean

print ("Sample mean for Math Scores:", sample_math_marks.mean() )

# getting the population mean

print("Population mean for Math Scores:", data['math score'].mean())

# lets take 100 sample values from the dataset of 1000 values
sample_reading_marks = np.random.choice(a= data['reading score'], size=100)

# getting the sample mean

print ("\nSample mean for Reading Scores:", sample_reading_marks.mean() )

# getting the population mean

print("Population mean for Reading Scores:", data['reading score'].mean())

# lets take 100 sample values from the dataset of 1000 values
sample_writing_marks = np.random.choice(a= data['writing score'], size=100)

# getting the sample mean

print ("\nSample mean for Writing Scores:", sample_math_marks.mean() )

# getting the population mean

print("Population mean for Writing Scores:", data['writing score'].mean())

Sample mean for Math Scores: 63.12

Population mean for Math Scores: 66.089

Sample mean for Reading Scores: 68.5

Population mean for Reading Scores: 69.169

Sample mean for Writing Scores: 63.12

Population mean for Writing Scores: 68.054

Let check the Confidence Interval for Math Score

In [18]: # lets import the scipy package

import scipy.stats as stats
import math

# lets seed the random values

np.random.seed(10)

# lets take a sample size

sample_size = 1000
sample = np.random.choice(a= data['math score'],
size = sample_size)
sample_mean = sample.mean()

# Get the z-critical value*

z_critical = stats.norm.ppf(q = 0.95)

# Check the z-critical value

print("z-critical value: ",z_critical)

# Get the population standard deviation

Loading [MathJax]/extensions/Safe.js
pop_stdev = data['math score'].std()

# checking the margin of error

margin_of_error = z_critical * (pop_stdev/math.sqrt(sample_size))

# defining our confidence interval

confidence_interval = (sample_mean - margin_of_error,
sample_mean + margin_of_error)

# lets print the results

print("Confidence interval:",end=" ")
print(confidence_interval)
print("True mean: {}".format(data['math score'].mean()))

z-critical value: 1.6448536269514722

Confidence interval: (64.82729483328328, 66.40470516671672)
True mean: 66.089

Let check the Confidence Interval for Reading Score

In [19]: # lets import the scipy package

import scipy.stats as stats
import math

# lets seed the random values

np.random.seed(10)

# lets take a sample size

sample_size = 1000
sample = np.random.choice(a= data['reading score'],
size = sample_size)
sample_mean = sample.mean()

# Get the z-critical value*

z_critical = stats.norm.ppf(q = 0.95)

# Check the z-critical value

print("z-critical value: ",z_critical)

# Get the population standard deviation

pop_stdev = data['reading score'].std()

# checking the margin of error

margin_of_error = z_critical * (pop_stdev/math.sqrt(sample_size))

# defining our confidence interval

confidence_interval = (sample_mean - margin_of_error,
sample_mean + margin_of_error)

# lets print the results

print("Confidence interval:",end=" ")
print(confidence_interval)
print("True mean: {}".format(data['reading score'].mean()))

z-critical value: 1.6448536269514722

Confidence interval: (67.75757337011645, 69.27642662988355)
True mean: 69.169

Let check the Confidence Interval for Writing Score

In [20]: # lets seed the random values

np.random.seed(10)

# lets take a sample size

Loading [MathJax]/extensions/Safe.js
sample_size = 1000
sample = np.random.choice(a= data['writing score'],
size = sample_size)
sample_mean = sample.mean()

# Get the z-critical value*

z_critical = stats.norm.ppf(q = 0.95)

# Check the z-critical value

print("z-critical value: ",z_critical)

# Get the population standard deviation

pop_stdev = data['writing score'].std()

# checking the margin of error

margin_of_error = z_critical * (pop_stdev/math.sqrt(sample_size))

# defining our confidence interval

confidence_interval = (sample_mean - margin_of_error,
sample_mean + margin_of_error)

# lets print the results

print("Confidence interval:",end=" ")
print(confidence_interval)
print("True mean: {}".format(data['writing score'].mean()))

z-critical value: 1.6448536269514722

Confidence interval: (66.80960035030861, 68.39039964969137)
True mean: 68.054

Notice that the true mean is contained in our interval.

A confidence interval of 95% would mean that if we take many samples and create confidence intervals
for each of them, 95% of our samples' confidence intervals will contain the true population mean.
Now, let's create several confidence intervals and plot them to get a better sense of what it means to
"capture" the true mean

Grouping Operations
Number of Girl Students Scoring 90 in all the Subjects

In [21]: data[(data['gender'] == 'female') &

(data['math score'] > 90) &
(data['writing score'] > 90) &
(data['reading score'] > 90)]

Loading [MathJax]/extensions/Safe.js
Out[21]: gender race/ethnicity
parental level of
lunch
test preparation math reading writing
education course score score score

114 female group E bachelor's degree standard completed 99 100 100

165 female group C bachelor's degree standard completed 96 100 100

179 female group D some high school standard completed 97 100 100

451 female group E some college standard none 100 92 97

458 female group E bachelor's degree standard none 100 100 100

546 female group A some high school standard completed 92 100 97

566 female group E bachelor's degree free/reduced completed 92 100 100

594 female group C bachelor's degree standard completed 92 100 99

685 female group E master's degree standard completed 94 99 100

712 female group D some college standard none 98 100 99

717 female group C associate's degree standard completed 96 96 99

855 female group B bachelor's degree standard none 97 97 96

886 female group E associate's degree standard completed 93 100 95

903 female group D bachelor's degree free/reduced completed 93 100 100

957 female group D master's degree standard none 92 100 100

962 female group E associate's degree standard none 100 100 100

979 female group C associate's degree standard none 91 95 94

Lets compare the scores secured by Boys and Girls

In [22]: data.groupby(['gender']).agg(['min','median','max'])

Out[22]: math score reading score writing score

min median max min median max min median max

gender

female 0 65 100 17 73 100 10 74 100

male 27 69 100 23 66 100 15 64 100

Lets check the Effect of Lunch on Student's Performnce

In [23]: data[['lunch','gender','math score','writing score','reading score']].groupby(['lunch','

Out[23]: math score writing score reading score

lunch gender

female 57.0 68.0 67.0

free/reduced
male 62.0 59.0 61.0

female 67.0 76.0 75.0

standard
male 72.0 67.0 67.5

Lets check the Effect of Test Preparation Course on Scores

Loading [MathJax]/extensions/Safe.js
In [24]: data[['test preparation course',
'gender',
'math score',
'writing score',
'reading score']].groupby(['test preparation course','gender']).agg('median')

Out[24]: math score writing score reading score

test preparation course gender

female 67 79 78
completed
male 73 70 71

female 62 70 71
none
male 67 60 63

Lets check the Effect of Race and Ethnicity on Student's Performance

In [25]: data[['race/ethnicity',
'math score',
'writing score',
'reading score']].groupby(['race/ethnicity']).agg('median')

Out[25]: math score writing score reading score

race/ethnicity

group A 61.0 62.0 64.0

group B 63.0 67.0 67.0

group C 65.0 68.0 71.0

group D 69.0 72.0 71.0

group E 74.5 72.0 74.0

Data Visualizations
In [26]: # visualising the number of male and female in the dataset

plt.rcParams['figure.figsize'] = (15, 5)
plt.style.use('_classic_test')
sns.countplot(data['gender'], palette = 'bone')
plt.title('Comparison of Males and Females', fontweight = 30)
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

Loading [MathJax]/extensions/Safe.js
In [27]: # visualizing the different groups in the dataset

plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('ggplot')

sns.countplot(data['race/ethnicity'], palette = 'pink')

plt.title('Comparison of various groups', fontweight = 30, fontsize = 20)
plt.xlabel('Groups')
plt.ylabel('count')
plt.show()

Loading [MathJax]/extensions/Safe.js
In [28]: # visualizing the differnt parental education levels

plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('fivethirtyeight')

sns.countplot(data['parental level of education'], palette = 'Blues')

plt.title('Comparison of Parental Education', fontweight = 30, fontsize = 20)
plt.xlabel('Degree')
plt.ylabel('count')
plt.show()

Loading [MathJax]/extensions/Safe.js
In [29]: # visualizing different types of lunch

plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('seaborn-talk')

sns.countplot(data['lunch'], palette = 'PuBu')

plt.title('Comparison of different types of lunch', fontweight = 30, fontsize = 20)
plt.xlabel('types of lunch')
plt.ylabel('count')
plt.show()

Loading [MathJax]/extensions/Safe.js
In [30]: # visualizing maths score

plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('tableau-colorblind10')

sns.countplot(data['math score'], palette = 'BuPu')

plt.title('Comparison of math scores', fontweight = 30, fontsize = 20)
plt.xlabel('score')
plt.ylabel('count')
plt.xticks(rotation = 90)
plt.show()

Loading [MathJax]/extensions/Safe.js
In [31]: ## visualizing reading score

plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('tableau-colorblind10')

sns.countplot(data['reading score'], palette = 'RdPu')

plt.title('Comparison of Reading scores', fontweight = 30, fontsize = 20)
plt.xlabel('score')
plt.ylabel('count')
plt.xticks(rotation = 90)
plt.show()

Loading [MathJax]/extensions/Safe.js
In [32]: # visualizing writing score

plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('tableau-colorblind10')

sns.countplot(data['writing score'], palette = 'prism')

plt.title('Comparison of Writing scores', fontweight = 30, fontsize = 20)
plt.xlabel('score')
plt.ylabel('count')
plt.xticks(rotation = 90)
plt.show()

Loading [MathJax]/extensions/Safe.js
In [33]: # gender vs race/etnicity

plt.rcParams['figure.figsize'] = (15, 9)
x = pd.crosstab(data['gender'], data['race/ethnicity'])
x.div(x.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = False)
plt.title('Gender vs Race', fontweight = 30, fontsize = 20)
plt.show()

Loading [MathJax]/extensions/Safe.js
In [34]: # comparison of race/ethnicity and parental level of education

plt.rcParams['figure.figsize'] = (15, 9)
x = pd.crosstab(data['race/ethnicity'], data['parental level of education'])
x.div(x.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = 'True')
plt.title('Race vs Parental Education', fontweight = 30, fontsize = 20)
plt.show()

In [35]: # comparison of parental degree and test course

plt.rcParams['figure.figsize'] = (15, 9)
sns.countplot(x = 'parental level of education', data = data, hue = 'test preparation co
plt.title('Parental Education vs Test Preparation Course', fontweight = 30, fontsize = 2
plt.show()

Loading [MathJax]/extensions/Safe.js
In [36]: # comparison of race/ethnicity and test preparation course

sns.countplot(x = 'race/ethnicity', data = data, hue = 'test preparation course', palet

plt.title('Race vs Test Preparion', fontweight = 30, fontsize = 20)
plt.show()

In [37]: # feature engineering on the data to visualize and solve the dataset more accurately

# setting a passing mark for the students to pass on the three subjects individually
passmarks = 40
Loading [MathJax]/extensions/Safe.js
# creating a new column pass_math, this column will tell us whether the students are pas
data['pass_math'] = np.where(data['math score']< passmarks, 'Fail', 'Pass')
data['pass_math'].value_counts().plot.pie(colors = ['lightblue', 'lightgreen'])

plt.title('Pass/Fail in Maths', fontweight = 30, fontsize = 20)

plt.xlabel('status')
plt.ylabel('count')
plt.show()

In [38]: # creating a new column pass_reading, this column will tell us whether the students are p

data['pass_reading'] = np.where(data['reading score']< passmarks, 'Fail', 'Pass')

data['pass_reading'].value_counts(dropna = False).plot.pie(colors = ['pink', 'yellow'])

plt.title('Pass/Fail in Reading', fontweight = 30, fontsize = 20)

plt.xlabel('status')
plt.ylabel('count')
plt.show()

Loading [MathJax]/extensions/Safe.js
In [39]: # creating a new column pass_writing, this column will tell us whether the students are p

data['pass_writing'] = np.where(data['writing score']< passmarks, 'Fail', 'Pass')

data['pass_writing'].value_counts(dropna = False).plot.pie(colors = ['orange', 'gray'])

plt.title('Pass/Fail in Writing', fontweight = 30, fontsize = 20)

plt.xlabel('status')
plt.ylabel('count')
plt.show()

Loading [MathJax]/extensions/Safe.js
In [40]: # computing the total score for each student

import warnings
warnings.filterwarnings('ignore')

data['total_score'] = data['math score'] + data['reading score'] + data['writing score']

sns.distplot(data['total_score'], color = 'magenta')

plt.title('comparison of total score of all the students', fontweight = 30, fontsize = 2

plt.xlabel('total score scored by the students')
plt.ylabel('count')
plt.show()

Loading [MathJax]/extensions/Safe.js
In [41]: # computing percentage for each of the students
# importing math library to use ceil
from math import *
import warnings
warnings.filterwarnings('ignore')

data['percentage'] = data['total_score']/3

for i in range(0, 1000):

data['percentage'][i] = ceil(data['percentage'][i])

plt.rcParams['figure.figsize'] = (15, 9)
sns.distplot(data['percentage'], color = 'orange')

plt.title('Comparison of percentage scored by all the students', fontweight = 30, fontsi

plt.xlabel('Percentage scored')
plt.ylabel('Count')
plt.show()

Loading [MathJax]/extensions/Safe.js
In [42]: # checking which student is fail overall

data['status'] = data.apply(lambda x : 'Fail' if x['pass_math'] == 'Fail' or

x['pass_reading'] == 'Fail' or x['pass_writing'] == 'Fail'
else 'pass', axis = 1)

data['status'].value_counts(dropna = False).plot.pie(colors = ['grey', 'crimson'])

plt.title('overall results', fontweight = 30, fontsize = 20)
plt.xlabel('status')
plt.ylabel('count')
plt.show()

Loading [MathJax]/extensions/Safe.js
In [43]: # Assigning grades to the grades according to the following criteria :
# 0 - 40 marks : grade E
# 41 - 60 marks : grade D
# 60 - 70 marks : grade C
# 70 - 80 marks : grade B
# 80 - 90 marks : grade A
# 90 - 100 marks : grade O

def getgrade(percentage, status):

if status == 'Fail':
return 'E'
if(percentage >= 90):
return 'O'
if(percentage >= 80):
return 'A'
if(percentage >= 70):
return 'B'
if(percentage >= 60):
return 'C'
if(percentage >= 40):
return 'D'
else :
return 'E'
Loading [MathJax]/extensions/Safe.js
data['grades'] = data.apply(lambda x: getgrade(x['percentage'], x['status']), axis = 1 )

data['grades'].value_counts()

B 260
Out[43]:
C 252
D 223
A 156
O 58
E 51
Name: grades, dtype: int64

In [44]: # plotting a pie chart for the distribution of various grades amongst the students

labels = ['Grade 0', 'Grade A', 'Grade B', 'Grade C', 'Grade D', 'Grade E']
sizes = [58, 156, 260, 252, 223, 51]
colors = ['yellow', 'gold', 'lightskyblue', 'lightcoral', 'pink', 'cyan']
explode = (0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001)

patches, texts = plt.pie(sizes, colors=colors, shadow=True, startangle=90)

plt.legend(patches, labels)
plt.title('Distribution of Grades among Students', fontweight = 30, fontsize = 20)
plt.axis('equal')
plt.tight_layout()
plt.show()

In [45]: # comparison parent's degree and their corresponding grades

plt.rcParams['figure.figsize'] = (15, 9)

x = pd.crosstab(data['parental level of education'], data['grades'])

color = plt.cm.copper(np.linspace(0, 1, 8))
x.div(x.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = True, color = color
plt.title("Parental Education vs Student's Grades", fontweight = 30, fontsize = 20)
plt.show()

Loading [MathJax]/extensions/Safe.js
In [46]: # for better visualization we will plot it again using seaborn

sns.countplot(x = data['parental level of education'], data = data, hue = data['grades']

plt.title('Parental Education vs Grades of Students', fontsize = 20, fontweight = 30)
plt.show()

Loading [MathJax]/extensions/Safe.js
In [47]: # comparing the distribution of grades among males and females

sns.countplot(x = data['grades'], data = data, hue = data['gender'], palette = 'cubeheli

#sns.palplot(sns.dark_palette('purple'))
plt.title('Grades vs Gender', fontweight = 30, fontsize = 20)
plt.show()

Label Encoding
Loading [MathJax]/extensions/Safe.js
In [48]: from sklearn.preprocessing import LabelEncoder

# creating an encoder
le = LabelEncoder()

# label encoding for test preparation course

data['test preparation course'] = le.fit_transform(data['test preparation course'])

# label encoding for lunch

data['lunch'] = le.fit_transform(data['lunch'])

# label encoding for race/ethnicity

# we have to map values to each of the categories
data['race/ethnicity'] = data['race/ethnicity'].replace('group A', 1)
data['race/ethnicity'] = data['race/ethnicity'].replace('group B', 2)
data['race/ethnicity'] = data['race/ethnicity'].replace('group C', 3)
data['race/ethnicity'] = data['race/ethnicity'].replace('group D', 4)
data['race/ethnicity'] = data['race/ethnicity'].replace('group E', 5)

# label encoding for parental level of education

data['parental level of education'] = le.fit_transform(data['parental level of education

#label encoding for gender

data['gender'] = le.fit_transform(data['gender'])

# label encoding for pass_math

data['pass_math'] = le.fit_transform(data['pass_math'])

# label encoding for pass_reading

data['pass_reading'] = le.fit_transform(data['pass_reading'])

# label encoding for pass_writing

data['pass_writing'] = le.fit_transform(data['pass_writing'])

# label encoding for status

data['status'] = le.fit_transform(data['status'])

Data Preparation
In [49]: # splitting the dependent and independent variables

x = data.iloc[:,:14]
y = data.iloc[:,14]

print(x.shape)
print(y.shape)

(1000, 14)
(1000,)

In [50]: # splitting the dataset into training and test sets

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

Loading [MathJax]/extensions/Safe.js
(750, 14)
(750,)
(250, 14)
(250,)

In [51]: # importing the MinMaxScaler

from sklearn.preprocessing import MinMaxScaler

# creating a scaler
mm = MinMaxScaler()

# feeding the independent variable into the scaler

x_train = mm.fit_transform(x_train)
x_test = mm.transform(x_test)

In [52]: # applying principal components analysis

from sklearn.decomposition import PCA

# creating a principal component analysis model

#pca = PCA(n_components = None)

# feeding the independent variables to the PCA model

#x_train = pca.fit_transform(x_train)
#x_test = pca.transform(x_test)

# visualising the principal components that will explain the highest share of variance
#explained_variance = pca.explained_variance_ratio_
#print(explained_variance)

# creating a principal component analysis model

#pca = PCA(n_components = 2)

# feeding the independent variables to the PCA model

#x_train = pca.fit_transform(x_train)
#x_test = pca.transform(x_test)

Modelling

Logistic Regression
In [53]: from sklearn.linear_model import LogisticRegression

# creating a model
model = LogisticRegression()

# feeding the training data to the model

model.fit(x_train, y_train)

# predicting the test set results

y_pred = model.predict(x_test)

# calculating the classification accuracies

print("Training Accuracy :", model.score(x_train, y_train))
print("Testing Accuracy :", model.score(x_test, y_test))

Training Accuracy : 0.7786666666666666

Testing Accuracy : 0.768

In [54]: # printing the confusion matrix

Loading [MathJax]/extensions/Safe.js
from sklearn.metrics import confusion_matrix

# creating a confusion matrix

cm = confusion_matrix(y_test, y_pred)

# printing the confusion matrix

plt.rcParams['figure.figsize'] = (8, 8)
sns.heatmap(cm, annot = True, cmap = 'Greens')
plt.title('Confusion Matrix for Logistic Regression', fontweight = 30, fontsize = 20)
plt.show()

Random Forest

In [55]: from sklearn.ensemble import RandomForestClassifier

# creating a model
model = RandomForestClassifier()

# feeding the training data to the model

model.fit(x_train, y_train)

# predicting the x-test results

y_pred = model.predict(x_test)

# calculating the accuracies

print("Training Accuracy :", model.score(x_train, y_train))
Loading [MathJax]/extensions/Safe.js
print("Testing Accuracy :", model.score(x_test, y_test))
Training Accuracy : 1.0
Testing Accuracy : 1.0

In [56]: # printing the confusion matrix

from sklearn.metrics import confusion_matrix

# creating a confusion matrix

cm = confusion_matrix(y_test, y_pred)

# printing the confusion matrix

plt.rcParams['figure.figsize'] = (8, 8)
sns.heatmap(cm, annot = True, cmap = 'Reds')
plt.title('Confusion Matrix for Random Forest', fontweight = 30, fontsize = 20)
plt.show()

Loading [MathJax]/extensions/Safe.js

Random Motors Project: Rocinate 36
75% (12)
Random Motors Project: Rocinate 36
2 pages
ANSWER SCHEME TEST STA404 JULY 2021 Latest
No ratings yet
ANSWER SCHEME TEST STA404 JULY 2021 Latest
3 pages
ML Practical 205160694034
No ratings yet
ML Practical 205160694034
33 pages
Lab 2 - Basic Statistical Analysis
No ratings yet
Lab 2 - Basic Statistical Analysis
7 pages
Information Practice ProjectRushda
No ratings yet
Information Practice ProjectRushda
24 pages
Ai Programs
No ratings yet
Ai Programs
19 pages
Information Practice Projecta aryan
No ratings yet
Information Practice Projecta aryan
24 pages
Assignment 02
No ratings yet
Assignment 02
4 pages
Practical 6A & 6B
No ratings yet
Practical 6A & 6B
4 pages
Correction TP 11 Statistiques
No ratings yet
Correction TP 11 Statistiques
9 pages
ML Python Exercises UOM BDS Classification
No ratings yet
ML Python Exercises UOM BDS Classification
18 pages
Email Spam Classifier
No ratings yet
Email Spam Classifier
22 pages
Pattern Recognition
No ratings yet
Pattern Recognition
26 pages
students result maker project
No ratings yet
students result maker project
4 pages
student analysis
No ratings yet
student analysis
16 pages
Machine Learning Lab New
No ratings yet
Machine Learning Lab New
14 pages
Seaborn Besant
No ratings yet
Seaborn Besant
27 pages
Lab 1
No ratings yet
Lab 1
3 pages
Data Preprocessing Python Tome III
No ratings yet
Data Preprocessing Python Tome III
12 pages
LIST OF PRACTICAL IP065 XII SESSION 2025 CKC ACADEMY
No ratings yet
LIST OF PRACTICAL IP065 XII SESSION 2025 CKC ACADEMY
19 pages
List of Programs For Informatics - XII - IP
No ratings yet
List of Programs For Informatics - XII - IP
26 pages
List of Practical Ip065 Xii Session 2025 Ckc Academy
No ratings yet
List of Practical Ip065 Xii Session 2025 Ckc Academy
19 pages
Machine Learning
No ratings yet
Machine Learning
67 pages
IT Project Pandas Matplotlib SQL
No ratings yet
IT Project Pandas Matplotlib SQL
4 pages
AIPT practical exam codes
No ratings yet
AIPT practical exam codes
12 pages
Import Import As Import As: #Default To CSV
No ratings yet
Import Import As Import As: #Default To CSV
6 pages
41 Perusse Alexander Aperusse PDF
No ratings yet
41 Perusse Alexander Aperusse PDF
7 pages
SOURCE CODE (1)
No ratings yet
SOURCE CODE (1)
20 pages
Module 2.9
No ratings yet
Module 2.9
11 pages
Students Performance Analysis
No ratings yet
Students Performance Analysis
12 pages
IP_Practical
No ratings yet
IP_Practical
15 pages
Class 14 - Basic Coding in Python - 5
No ratings yet
Class 14 - Basic Coding in Python - 5
24 pages
AIL303 M
No ratings yet
AIL303 M
22 pages
Man Avg
No ratings yet
Man Avg
2 pages
Employee Analysis
No ratings yet
Employee Analysis
19 pages
Documentation
No ratings yet
Documentation
7 pages
ml record
No ratings yet
ml record
21 pages
Week2 lab
No ratings yet
Week2 lab
8 pages
12 IP File Programs 6 To 17
No ratings yet
12 IP File Programs 6 To 17
9 pages
Report File
No ratings yet
Report File
22 pages
I Am Sharing 'I Am Sharing 'Presentation (1)' With You' With You
No ratings yet
I Am Sharing 'I Am Sharing 'Presentation (1)' With You' With You
30 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
6 pages
Ip-12-2023-24 Practical File
No ratings yet
Ip-12-2023-24 Practical File
19 pages
Data Manipulation With Python Pandas 1700003764
No ratings yet
Data Manipulation With Python Pandas 1700003764
10 pages
October 11, 2020: 0.1 Applied Machine Learning, Module 1: A Simple Classification Task
No ratings yet
October 11, 2020: 0.1 Applied Machine Learning, Module 1: A Simple Classification Task
4 pages
OOP MP
No ratings yet
OOP MP
9 pages
IP12 gargi
No ratings yet
IP12 gargi
32 pages
ST Joseph'S Convent Senior Secondary School: Name:-Shatakshi Gaur Class:-Xii Sec:-A Board Roll No.
No ratings yet
ST Joseph'S Convent Senior Secondary School: Name:-Shatakshi Gaur Class:-Xii Sec:-A Board Roll No.
65 pages
3 Classification
No ratings yet
3 Classification
16 pages
天池SMP 2023 ChatGLM金融大模型挑战赛
No ratings yet
天池SMP 2023 ChatGLM金融大模型挑战赛
6 pages
EXP NO 3 SM
No ratings yet
EXP NO 3 SM
8 pages
R Record
No ratings yet
R Record
16 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
Php+da
No ratings yet
Php+da
26 pages
CS Final Project
No ratings yet
CS Final Project
26 pages
House Pricing
No ratings yet
House Pricing
15 pages
scholarship management project comp sc
No ratings yet
scholarship management project comp sc
7 pages
exp 4
No ratings yet
exp 4
6 pages
csml1819 PDF
No ratings yet
csml1819 PDF
36 pages
Final-23W-key
No ratings yet
Final-23W-key
5 pages
6 - 10 Python
No ratings yet
6 - 10 Python
6 pages
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
Tutorial 5 Discrete Distributions
No ratings yet
Tutorial 5 Discrete Distributions
6 pages
Daftar Isi
No ratings yet
Daftar Isi
10 pages
Introduction To Logistic Regression
No ratings yet
Introduction To Logistic Regression
20 pages
Student Solutions Manual to Introductory Econometrics 2nd edition Edition Jeffrey M. Wooldridgepdf download
100% (1)
Student Solutions Manual to Introductory Econometrics 2nd edition Edition Jeffrey M. Wooldridgepdf download
45 pages
Math 3031 Syllabus and HW
No ratings yet
Math 3031 Syllabus and HW
4 pages
2018-Panel Data by Baun PDF
100% (1)
2018-Panel Data by Baun PDF
88 pages
Stat 11 Q4 Week 5-SSLM
No ratings yet
Stat 11 Q4 Week 5-SSLM
4 pages
Stat 6 Regression Analysis
No ratings yet
Stat 6 Regression Analysis
53 pages
Independent Components Analysis
No ratings yet
Independent Components Analysis
26 pages
Homework Transforming and Combining Random Variables
100% (1)
Homework Transforming and Combining Random Variables
7 pages
Website Has Effect Towards Students' Achievements in Fractions? This Question Requires Analysis
No ratings yet
Website Has Effect Towards Students' Achievements in Fractions? This Question Requires Analysis
4 pages
The Moving Average Models MA (1) and MA (2) : Al Nosedal University of Toronto
No ratings yet
The Moving Average Models MA (1) and MA (2) : Al Nosedal University of Toronto
47 pages
Statistics Skittles Project Part 4
No ratings yet
Statistics Skittles Project Part 4
2 pages
Passing-Bablok Regression For Method Comparison
No ratings yet
Passing-Bablok Regression For Method Comparison
9 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Statistics For Managers Using Microsoft Excel: 5 Edition
No ratings yet
Statistics For Managers Using Microsoft Excel: 5 Edition
54 pages
X Variable 1 Line Fit Plot: Regression Statistics
No ratings yet
X Variable 1 Line Fit Plot: Regression Statistics
3 pages
Survival Using R
100% (1)
Survival Using R
277 pages
Random Variables
No ratings yet
Random Variables
22 pages
An Actuarial Model For Assessing General Practictioner Prescribing Costs
No ratings yet
An Actuarial Model For Assessing General Practictioner Prescribing Costs
20 pages
MA2216 Summary
100% (1)
MA2216 Summary
1 page
CH 6
No ratings yet
CH 6
64 pages
Understanding Econometric Analysis Using
No ratings yet
Understanding Econometric Analysis Using
13 pages
The Problem of Overfitting - Coursera
No ratings yet
The Problem of Overfitting - Coursera
1 page
Week 1 Graded Quiz On Solution PDF
No ratings yet
Week 1 Graded Quiz On Solution PDF
2 pages
Homework 2 New
No ratings yet
Homework 2 New
5 pages
The Normal Distributions
No ratings yet
The Normal Distributions
9 pages
(Ebook) Propensity Score Analysis: Statistical Methods and Applications by Shenyang Y. Guo; Mark W. Fraser ISBN 9781452235004, 1452235007 - The ebook is available for instant download, read anywhere
100% (1)
(Ebook) Propensity Score Analysis: Statistical Methods and Applications by Shenyang Y. Guo; Mark W. Fraser ISBN 9781452235004, 1452235007 - The ebook is available for instant download, read anywhere
46 pages