AM19 EDA Assignment1
AM19 EDA Assignment1
[3]: df=pd.read_excel('Titanic-Dataset.xlsx')
df
1
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
.. … … … … …
886 0 211536 13.0000 NaN S
887 0 112053 30.0000 B42 S
888 2 W./C. 6607 23.4500 NaN S
889 0 111369 30.0000 C148 C
890 0 370376 7.7500 NaN Q
[7]: df_d
[9]: 10692
2
[10]: 2
3
[890, 1, 1, …, 30.0, 'C148', 'C'],
[891, 0, 3, …, 7.75, nan, 'Q']], dtype=object)
[16]: False
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
4
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
[21]: #15.Display descriptive statistics for the numerical columns from the df
df.describe()
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
5
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
[22]: #16. Find mean, median, mode, std values for numerical columns in df
df['Age'].mean()
[22]: 29.69911764705882
[23]: df['Age'].median()
[23]: 28.0
[24]: df['Age'].mode()
[24]: 0 24.0
Name: Age, dtype: float64
[25]: df['Age'].min()
[25]: 0.42
[26]: df['Age'].max()
[26]: 80.0
[27]: df['Age'].std()
[27]: 14.526497332334042
6
[30]: 2
[31]: PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22.0
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object
[32]: PassengerId 5
Survived 0
Pclass 3
Name Allen, Mr. William Henry
Sex male
Age 35.0
SibSp 0
Parch 0
Ticket 373450
Fare 8.05
Cabin NaN
Embarked S
Name: 4, dtype: object
7
Ticket 370376
Fare 7.75
Cabin NaN
Embarked Q
Name: 890, dtype: object
[35]: #24.Locate all the rows in df with specific criteria using loc
df.loc[df['Age']>20]
8
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
.. … … … …
884 Sutehall, Mr. Henry Jr male 25.0 0
885 Rice, Mrs. William (Margaret Norton) female 39.0 0
886 Montvila, Rev. Juozas male 27.0 0
889 Behr, Mr. Karl Howell male 26.0 0
890 Dooley, Mr. Patrick male 32.0 0
[36]: #25.Locate all the rows in df with specific criteria with logical AND opeartor␣
↪using loc
df.loc[(df['Age']>15)&(df['Age']<20)]
9
44 Devaney, Miss. Margaret Delia female 19.0 0
49 Arnold-Franchi, Mrs. Josef (Josefine Franchi) female 18.0 1
67 Crease, Mr. Ernest James male 19.0 0
.. … … … …
844 Culumovic, Mr. Jeso male 17.0 0
853 Lines, Miss. Mary Conover female 16.0 0
855 Aks, Mrs. Sam (Leah Rosen) female 18.0 0
877 Petroff, Mr. Nedelio male 19.0 0
887 Graham, Miss. Margaret Edith female 19.0 0
[37]: PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22.0
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object
[38]: PassengerId 5
Survived 0
10
Pclass 3
Name Allen, Mr. William Henry
Sex male
Age 35.0
SibSp 0
Parch 0
Ticket 373450
Fare 8.05
Cabin NaN
Embarked S
Name: 4, dtype: object
[40]: 0 S
1 C
2 S
3 S
4 S
..
886 S
887 S
888 S
889 C
890 Q
Name: Embarked, Length: 891, dtype: object
[41]: #30 locate all the rows in df within specific range of row index
df.iloc[0:5]
11
[41]: PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
[43]: #32 locate second through third row and first 2 cols
df.iloc[1:3,0:2]
12
[44]: #33 loc 1st and 6thh row
df.iloc[[0,5],[0,3]]
[46]: #34. save created data in csv, excel file and reload and check
df_d.to_csv('EDA_data.csv')
[47]: df_d.to_excel('EDA_data.xlsx')
[ ]:
13