0% found this document useful (0 votes)

50 views6 pages

Feature Engineering - MeanMedianDay 1 - Jupyter Notebook

Uploaded by

firoz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views6 pages

Feature Engineering - MeanMedianDay 1 - Jupyter Notebook

Uploaded by

firoz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

8/6/23, 10:39 PM Feature Engineering- MeanMedianDay 1 - Jupyter Notebook

Missing Values- Feature Engineering- Day 1

Lifecycle of a Data Science Projects

1. Data Collection Statergy---from company side,3rd party APi's,Surveys,Surveys

2. Feature Engineering---Handling Missing Values

Why are their Missing values?? Survey--Depression Survey

1. They hesitate to put down the information

2. Survey informations are not that valid
3. Men--salary
4. Women---age
5. People may have died----NAN

Data Science Projects---Dataset should be collected from multiple sources

What are the different types of Missing Data?

1. Missing Completely at Random, MCAR:

A variable is missing completely at random (MCAR) if the probability of being missing is the
same for all the observations. When data is MCAR, there is absolutely no relationship
between the data missing and any other values, observed or missing, within the dataset. In
other words, those missing data points are a random subset of the data. There is nothing
systematic going on that makes some data more likely to be missing than other.

In [1]: import pandas as pd

In [2]: df=pd.read_csv('titanic.csv')

localhost:8888/notebooks/Desktop/Lab3/Feature Engineering- MeanMedianDay 1.ipynb 1/6

8/6/23, 10:39 PM Feature Engineering- MeanMedianDay 1 - Jupyter Notebook

In [3]: df.head()

Out[3]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Ca

Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500 N
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...

Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 N
3101282
Laina

Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C
Heath
(Lily May
Peel)

Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 N
Henry

In [4]: df.isnull().sum()

Out[4]: PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

localhost:8888/notebooks/Desktop/Lab3/Feature Engineering- MeanMedianDay 1.ipynb 2/6

8/6/23, 10:39 PM Feature Engineering- MeanMedianDay 1 - Jupyter Notebook

In [5]: df[df['Embarked'].isnull()]

Out[5]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin E

Icard,
61 62 1 1 Miss. female 38.0 0 0 113572 80.0 B28
Amelie

Stone,
Mrs.
George
829 830 1 1 female 62.0 0 0 113572 80.0 B28
Nelson
(Martha
Evelyn)

2. Missing Data Not At Random(MNAR): Systematic missing Values

There is absolutely some relationship between the data missing and any other values,
observed or missing, within the dataset.

In [6]: import numpy as np

df['cabin_null']=np.where(df['Cabin'].isnull(),1,0)

##find the percentage of null values
df['cabin_null'].mean()

Out[6]: 0.7710437710437711

In [ ]:

In [7]: df.columns

Out[7]: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',

'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'cabin_null'],
dtype='object')

In [8]: df.groupby(['Survived'])['cabin_null'].mean()

Out[8]: Survived
0 0.876138
1 0.602339
Name: cabin_null, dtype: float64

Missing At Random(MAR)

localhost:8888/notebooks/Desktop/Lab3/Feature Engineering- MeanMedianDay 1.ipynb 3/6

8/6/23, 10:39 PM Feature Engineering- MeanMedianDay 1 - Jupyter Notebook

In [9]: Men---hide their salary

Women---hide their age

Input In [9]
Men---hide their salary
^
SyntaxError: invalid syntax

In [10]: ### All the techniques of handling ,issing values

1. Mean/ Median/Mode replacement
2. Random Sample Imputation
3. Capturing NAN values with a new feature
4. End of Distribution imputation
5. Arbitrary imputation
6. Frequent categories imputation

Input In [10]
1. Mean/ Median/Mode replacement
^
SyntaxError: invalid syntax

Mean/ MEdian /Mode imputation

When should we apply? Mean/median imputation has the assumption that the data are missing
completely at random(MCAR). We solve this by replacing the NAN with the most frequent
occurance of the variables

In [11]: df=pd.read_csv('titanic.csv',usecols=['Age','Fare','Survived'])
df.head()

Out[11]: Survived Age Fare

0 0 22.0 7.2500

1 1 38.0 71.2833

2 1 26.0 7.9250

3 1 35.0 53.1000

4 0 35.0 8.0500

In [12]: ## Lets go and see the percentage of missing values

df.isnull().mean()

Out[12]: Survived 0.000000

Age 0.198653
Fare 0.000000
dtype: float64

localhost:8888/notebooks/Desktop/Lab3/Feature Engineering- MeanMedianDay 1.ipynb 4/6

8/6/23, 10:39 PM Feature Engineering- MeanMedianDay 1 - Jupyter Notebook

In [13]: def impute_nan(df,variable,median):

df[variable+"_median"]=df[variable].fillna(median)

In [14]: median=df.Age.median()
median

Out[14]: 28.0

In [16]: impute_nan(df,'Age',median)
df

Out[16]: Survived Age Fare Age_median

0 0 22.0 7.2500 22.0

1 1 38.0 71.2833 38.0

2 1 26.0 7.9250 26.0

3 1 35.0 53.1000 35.0

4 0 35.0 8.0500 35.0

... ... ... ... ...

886 0 27.0 13.0000 27.0

887 1 19.0 30.0000 19.0

888 0 NaN 23.4500 28.0

889 1 26.0 30.0000 26.0

890 0 32.0 7.7500 32.0

891 rows × 4 columns

In [22]: print(df['Age'].std())
print(df['Age_median'].std())

14.526497332334042
13.019696550973201

In [23]: import matplotlib.pyplot as plt

%matplotlib inline

localhost:8888/notebooks/Desktop/Lab3/Feature Engineering- MeanMedianDay 1.ipynb 5/6

8/6/23, 10:39 PM Feature Engineering- MeanMedianDay 1 - Jupyter Notebook

In [25]: fig = plt.figure()

ax = fig.add_subplot(111)
df['Age'].plot(kind='kde', ax=ax)
df.Age_median.plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')

Out[25]: <matplotlib.legend.Legend at 0x273541c2828>

Advantages And Disadvantages of Mean/Median Imputation

Advantages

1. Easy to implement(Robust to outliers)

2. Faster way to obtain the complete dataset

Disadvantages
3. Change or Distortion in the original variance
4. Impacts Correlation

In [ ]:

localhost:8888/notebooks/Desktop/Lab3/Feature Engineering- MeanMedianDay 1.ipynb 6/6

Machine Learning
100% (2)
Machine Learning
136 pages
Feature Engineering 1708311524
No ratings yet
Feature Engineering 1708311524
48 pages
Fda Exp3 E0323040
No ratings yet
Fda Exp3 E0323040
2 pages
Import As: Pandas PD Titanic - Data PD - Read - CSV Titanic - Data - Head
No ratings yet
Import As: Pandas PD Titanic - Data PD - Read - CSV Titanic - Data - Head
12 pages
Missing Data
No ratings yet
Missing Data
14 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
7 8 - Missing Value Handling
No ratings yet
7 8 - Missing Value Handling
4 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Output
No ratings yet
Output
5 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
Missing Data
No ratings yet
Missing Data
25 pages
Titanic Survival Prediction ML
No ratings yet
Titanic Survival Prediction ML
36 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Handling Missing Data in Categorical Features
No ratings yet
Handling Missing Data in Categorical Features
7 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
10 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
FDS U4
No ratings yet
FDS U4
93 pages
Datascience 1
No ratings yet
Datascience 1
1 page
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
Missing Data Handling
No ratings yet
Missing Data Handling
19 pages
Quiz #03-Section A
No ratings yet
Quiz #03-Section A
6 pages
Unit 1
No ratings yet
Unit 1
21 pages
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
No ratings yet
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
10 pages
Unit 3
No ratings yet
Unit 3
30 pages
Pyt Manual 1
No ratings yet
Pyt Manual 1
85 pages
2 - Machine Learning - 130824
No ratings yet
2 - Machine Learning - 130824
81 pages
AE II Simulation File PDF
No ratings yet
AE II Simulation File PDF
32 pages
DA Lab Manual r22
No ratings yet
DA Lab Manual r22
31 pages
Lec 45
No ratings yet
Lec 45
9 pages
Data e
No ratings yet
Data e
16 pages
Week 3
No ratings yet
Week 3
77 pages
Fda Exp2 E0323040
No ratings yet
Fda Exp2 E0323040
3 pages
Unit 2
No ratings yet
Unit 2
76 pages
Lec9 Dealing With Missing Values
No ratings yet
Lec9 Dealing With Missing Values
22 pages
Module 4
No ratings yet
Module 4
47 pages
1 DataWrangling
No ratings yet
1 DataWrangling
2 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Assignment
No ratings yet
Assignment
14 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Data Wrangling and Preprocessing
100% (1)
Data Wrangling and Preprocessing
41 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
PS ML Lect 5 9 Unit 2
No ratings yet
PS ML Lect 5 9 Unit 2
114 pages
Ai Tools and Applications-Lab
No ratings yet
Ai Tools and Applications-Lab
33 pages
DA Unit 2 15m Handling Missing Data
No ratings yet
DA Unit 2 15m Handling Missing Data
3 pages
Data Cleaning by Manish Batra 1697684636
No ratings yet
Data Cleaning by Manish Batra 1697684636
30 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Handling Missing Values
No ratings yet
Handling Missing Values
5 pages
Homework 2
No ratings yet
Homework 2
12 pages
Slides On DataII
No ratings yet
Slides On DataII
26 pages
Data Conscience: Algorithmic Siege on our Humanity
From Everand
Data Conscience: Algorithmic Siege on our Humanity
Timnit Gebru
No ratings yet
BattleTech Legends: Target of Opportunity: BattleTech Legends
From Everand
BattleTech Legends: Target of Opportunity: BattleTech Legends
Blaine Lee Pardoe
No ratings yet
Critical Review - Ditha Dwiastuti - n1d219047 - Universitas Halu Oleo
No ratings yet
Critical Review - Ditha Dwiastuti - n1d219047 - Universitas Halu Oleo
3 pages
The Functional Analysis of Psychological Terms: Defense of Research
No ratings yet
The Functional Analysis of Psychological Terms: Defense of Research
18 pages
Business Research (BA 501)
No ratings yet
Business Research (BA 501)
21 pages
System Review of Empirical Support For Check-In Check-Out
No ratings yet
System Review of Empirical Support For Check-In Check-Out
15 pages
Use of Records in Predicting The Outcome of Construction Claims
No ratings yet
Use of Records in Predicting The Outcome of Construction Claims
13 pages
Yael Navaro-Yashin - The Make-Believe Space - Affective Geography in A Postwar Polity-Duke University Press (2012) PDF
No ratings yet
Yael Navaro-Yashin - The Make-Believe Space - Affective Geography in A Postwar Polity-Duke University Press (2012) PDF
297 pages
PR2 Q1 Mod5-7
No ratings yet
PR2 Q1 Mod5-7
51 pages
Statistics Chapter 2
No ratings yet
Statistics Chapter 2
1 page
The Evaluation of Complex Infrastructure Projects - A Guide To Qualitative Comparative Analysis
No ratings yet
The Evaluation of Complex Infrastructure Projects - A Guide To Qualitative Comparative Analysis
49 pages
Exposure To Violent Media: The Effects of Songs With Violent Lyrics On Aggressive Thoughts and Feelings
No ratings yet
Exposure To Violent Media: The Effects of Songs With Violent Lyrics On Aggressive Thoughts and Feelings
12 pages
A Case of Churchill Avenue and The Surroundings, Addis Ababa by
No ratings yet
A Case of Churchill Avenue and The Surroundings, Addis Ababa by
122 pages
Cbsnews 20231208 2 Fri Connections
No ratings yet
Cbsnews 20231208 2 Fri Connections
8 pages
Types of Questions in Ielts Reading
No ratings yet
Types of Questions in Ielts Reading
9 pages
Item Analysis AND Validation
No ratings yet
Item Analysis AND Validation
11 pages
An Analysis On The English Teachers Strategies in Teaching Reading Comprehension SMP 1 Wonomulyo
No ratings yet
An Analysis On The English Teachers Strategies in Teaching Reading Comprehension SMP 1 Wonomulyo
1 page
Chi Square Test Course Requirement
No ratings yet
Chi Square Test Course Requirement
5 pages
Teachers' Viewpoint: Translanguaging in English-Instructed Classes
No ratings yet
Teachers' Viewpoint: Translanguaging in English-Instructed Classes
6 pages
JEE Main 2023 24 January Shift 2
No ratings yet
JEE Main 2023 24 January Shift 2
32 pages
Jurnal M Abidin
No ratings yet
Jurnal M Abidin
19 pages
Testing Herzberg's Duality Theory: Analyzing Job Satisfaction Among State Administration Employees
No ratings yet
Testing Herzberg's Duality Theory: Analyzing Job Satisfaction Among State Administration Employees
16 pages
UPSC 2010 New Pattern
100% (1)
UPSC 2010 New Pattern
2 pages
1 ParaTrack System Overview
No ratings yet
1 ParaTrack System Overview
1 page
Introduction To Research Methods in Education
No ratings yet
Introduction To Research Methods in Education
19 pages
Journal of Medicinal Chemistry Volume 21 Issue 6 1978 (Doi 10.1021/jm00204a013) Hansch, Corwin Hatheway, Gerard J. Quinn, Frank R. Greenberg, - Antitumor 1 - (X-Aryl) - 3,3-Dialkyltriazenes. 2. On T
No ratings yet
Journal of Medicinal Chemistry Volume 21 Issue 6 1978 (Doi 10.1021/jm00204a013) Hansch, Corwin Hatheway, Gerard J. Quinn, Frank R. Greenberg, - Antitumor 1 - (X-Aryl) - 3,3-Dialkyltriazenes. 2. On T
4 pages
Tobit
100% (1)
Tobit
20 pages
SPM12 Manual
No ratings yet
SPM12 Manual
449 pages
The Mechanics of Writing A Research Report
No ratings yet
The Mechanics of Writing A Research Report
5 pages
Simulation Software Comparison
No ratings yet
Simulation Software Comparison
39 pages
Indian Institute of Technology Madras: Placement Brochure
No ratings yet
Indian Institute of Technology Madras: Placement Brochure
18 pages
ONYX II Brochure
No ratings yet
ONYX II Brochure
4 pages