0% found this document useful (0 votes)
50 views6 pages

Feature Engineering - MeanMedianDay 1 - Jupyter Notebook

Uploaded by

firoz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views6 pages

Feature Engineering - MeanMedianDay 1 - Jupyter Notebook

Uploaded by

firoz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

8/6/23, 10:39 PM Feature Engineering- MeanMedianDay 1 - Jupyter Notebook

Missing Values- Feature Engineering- Day 1

Lifecycle of a Data Science Projects

1. Data Collection Statergy---from company side,3rd party APi's,Surveys,Surveys


2. Feature Engineering---Handling Missing Values

Why are their Missing values?? Survey--Depression Survey

1. They hesitate to put down the information


2. Survey informations are not that valid
3. Men--salary
4. Women---age
5. People may have died----NAN

Data Science Projects---Dataset should be collected from multiple sources

What are the different types of Missing Data?

1. Missing Completely at Random, MCAR:


A variable is missing completely at random (MCAR) if the probability of being missing is the
same for all the observations. When data is MCAR, there is absolutely no relationship
between the data missing and any other values, observed or missing, within the dataset. In
other words, those missing data points are a random subset of the data. There is nothing
systematic going on that makes some data more likely to be missing than other.

In [1]: import pandas as pd

In [2]: df=pd.read_csv('titanic.csv')

localhost:8888/notebooks/Desktop/Lab3/Feature Engineering- MeanMedianDay 1.ipynb 1/6


8/6/23, 10:39 PM Feature Engineering- MeanMedianDay 1 - Jupyter Notebook

In [3]: df.head()

Out[3]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Ca

Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500 N
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...

Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 N
3101282
Laina

Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C
Heath
(Lily May
Peel)

Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 N
Henry

In [4]: df.isnull().sum()

Out[4]: PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

localhost:8888/notebooks/Desktop/Lab3/Feature Engineering- MeanMedianDay 1.ipynb 2/6


8/6/23, 10:39 PM Feature Engineering- MeanMedianDay 1 - Jupyter Notebook

In [5]: df[df['Embarked'].isnull()]

Out[5]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin E

Icard,
61 62 1 1 Miss. female 38.0 0 0 113572 80.0 B28
Amelie

Stone,
Mrs.
George
829 830 1 1 female 62.0 0 0 113572 80.0 B28
Nelson
(Martha
Evelyn)

2. Missing Data Not At Random(MNAR): Systematic missing Values


There is absolutely some relationship between the data missing and any other values,
observed or missing, within the dataset.

In [6]: import numpy as np


df['cabin_null']=np.where(df['Cabin'].isnull(),1,0)

##find the percentage of null values
df['cabin_null'].mean()

Out[6]: 0.7710437710437711

In [ ]: ​

In [7]: df.columns

Out[7]: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',


'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'cabin_null'],
dtype='object')

In [8]: df.groupby(['Survived'])['cabin_null'].mean()

Out[8]: Survived
0 0.876138
1 0.602339
Name: cabin_null, dtype: float64

Missing At Random(MAR)

localhost:8888/notebooks/Desktop/Lab3/Feature Engineering- MeanMedianDay 1.ipynb 3/6


8/6/23, 10:39 PM Feature Engineering- MeanMedianDay 1 - Jupyter Notebook

In [9]: Men---hide their salary


Women---hide their age

Input In [9]
Men---hide their salary
^
SyntaxError: invalid syntax

In [10]: ### All the techniques of handling ,issing values



1. Mean/ Median/Mode replacement
2. Random Sample Imputation
3. Capturing NAN values with a new feature
4. End of Distribution imputation
5. Arbitrary imputation
6. Frequent categories imputation

Input In [10]
1. Mean/ Median/Mode replacement
^
SyntaxError: invalid syntax

Mean/ MEdian /Mode imputation

When should we apply? Mean/median imputation has the assumption that the data are missing
completely at random(MCAR). We solve this by replacing the NAN with the most frequent
occurance of the variables

In [11]: df=pd.read_csv('titanic.csv',usecols=['Age','Fare','Survived'])
df.head()

Out[11]: Survived Age Fare

0 0 22.0 7.2500

1 1 38.0 71.2833

2 1 26.0 7.9250

3 1 35.0 53.1000

4 0 35.0 8.0500

In [12]: ## Lets go and see the percentage of missing values


df.isnull().mean()

Out[12]: Survived 0.000000


Age 0.198653
Fare 0.000000
dtype: float64

localhost:8888/notebooks/Desktop/Lab3/Feature Engineering- MeanMedianDay 1.ipynb 4/6


8/6/23, 10:39 PM Feature Engineering- MeanMedianDay 1 - Jupyter Notebook

In [13]: def impute_nan(df,variable,median):


df[variable+"_median"]=df[variable].fillna(median)

In [14]: median=df.Age.median()
median

Out[14]: 28.0

In [16]: impute_nan(df,'Age',median)
df

Out[16]: Survived Age Fare Age_median

0 0 22.0 7.2500 22.0

1 1 38.0 71.2833 38.0

2 1 26.0 7.9250 26.0

3 1 35.0 53.1000 35.0

4 0 35.0 8.0500 35.0

... ... ... ... ...

886 0 27.0 13.0000 27.0

887 1 19.0 30.0000 19.0

888 0 NaN 23.4500 28.0

889 1 26.0 30.0000 26.0

890 0 32.0 7.7500 32.0

891 rows × 4 columns

In [22]: print(df['Age'].std())
print(df['Age_median'].std())

14.526497332334042
13.019696550973201

In [23]: import matplotlib.pyplot as plt


%matplotlib inline

localhost:8888/notebooks/Desktop/Lab3/Feature Engineering- MeanMedianDay 1.ipynb 5/6


8/6/23, 10:39 PM Feature Engineering- MeanMedianDay 1 - Jupyter Notebook

In [25]: fig = plt.figure()


ax = fig.add_subplot(111)
df['Age'].plot(kind='kde', ax=ax)
df.Age_median.plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

Out[25]: <matplotlib.legend.Legend at 0x273541c2828>

Advantages And Disadvantages of Mean/Median Imputation

Advantages

1. Easy to implement(Robust to outliers)


2. Faster way to obtain the complete dataset

Disadvantages
3. Change or Distortion in the original variance
4. Impacts Correlation

In [ ]: ​

localhost:8888/notebooks/Desktop/Lab3/Feature Engineering- MeanMedianDay 1.ipynb 6/6

You might also like