Feature Engineering - MeanMedianDay 1 - Jupyter Notebook
Feature Engineering - MeanMedianDay 1 - Jupyter Notebook
In [2]: df=pd.read_csv('titanic.csv')
In [3]: df.head()
Out[3]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Ca
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500 N
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...
Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250 N
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000 C
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500 N
Henry
In [4]: df.isnull().sum()
Out[4]: PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
In [5]: df[df['Embarked'].isnull()]
Out[5]: PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin E
Icard,
61 62 1 1 Miss. female 38.0 0 0 113572 80.0 B28
Amelie
Stone,
Mrs.
George
829 830 1 1 female 62.0 0 0 113572 80.0 B28
Nelson
(Martha
Evelyn)
Out[6]: 0.7710437710437711
In [ ]:
In [7]: df.columns
In [8]: df.groupby(['Survived'])['cabin_null'].mean()
Out[8]: Survived
0 0.876138
1 0.602339
Name: cabin_null, dtype: float64
Missing At Random(MAR)
Input In [9]
Men---hide their salary
^
SyntaxError: invalid syntax
Input In [10]
1. Mean/ Median/Mode replacement
^
SyntaxError: invalid syntax
When should we apply? Mean/median imputation has the assumption that the data are missing
completely at random(MCAR). We solve this by replacing the NAN with the most frequent
occurance of the variables
In [11]: df=pd.read_csv('titanic.csv',usecols=['Age','Fare','Survived'])
df.head()
0 0 22.0 7.2500
1 1 38.0 71.2833
2 1 26.0 7.9250
3 1 35.0 53.1000
4 0 35.0 8.0500
In [14]: median=df.Age.median()
median
Out[14]: 28.0
In [16]: impute_nan(df,'Age',median)
df
In [22]: print(df['Age'].std())
print(df['Age_median'].std())
14.526497332334042
13.019696550973201
Advantages
Disadvantages
3. Change or Distortion in the original variance
4. Impacts Correlation
In [ ]: