0% found this document useful (0 votes)
28 views7 pages

Pandas Day 4

The document summarizes analysis done on the Titanic dataset using pandas. It loads the dataset, explores missing data, groups data by variables like sex and class to analyze patterns in fare prices, and imputes missing age values with the mean. Key findings include: females had higher average fares than males; first class fares were highest; and replacing missing ages with the overall age mean of 29.7 years.

Uploaded by

Ali Khokhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views7 pages

Pandas Day 4

The document summarizes analysis done on the Titanic dataset using pandas. It loads the dataset, explores missing data, groups data by variables like sex and class to analyze patterns in fare prices, and imputes missing age values with the mean. Key findings include: females had higher average fares than males; first class fares were highest; and replacing missing ages with the overall age mean of 29.7 years.

Uploaded by

Ali Khokhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

pandas Day 4

November 8, 2023

[ ]: import pandas as pd
import seaborn as sns

[ ]: df = sns.load_dataset('titanic')
df.head()

[ ]: survived pclass sex age sibsp parch fare embarked class \


0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third

who adult_male deck embark_town alive alone


0 man True NaN Southampton no False
1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
4 man True NaN Southampton no True

[ ]: df.shape

[ ]: (891, 15)

[ ]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 891 non-null int64
1 pclass 891 non-null int64
2 sex 891 non-null object
3 age 714 non-null float64
4 sibsp 891 non-null int64
5 parch 891 non-null int64
6 fare 891 non-null float64
7 embarked 889 non-null object

1
8 class 891 non-null category
9 who 891 non-null object
10 adult_male 891 non-null bool
11 deck 203 non-null category
12 embark_town 889 non-null object
13 alive 891 non-null object
14 alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB

[ ]: df.isnull().sum()

[ ]: survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
dtype: int64

[ ]: df.isnull().sum() / df.shape[0]

[ ]: survived 0.000000
pclass 0.000000
sex 0.000000
age 0.198653
sibsp 0.000000
parch 0.000000
fare 0.000000
embarked 0.002245
class 0.000000
who 0.000000
adult_male 0.000000
deck 0.772166
embark_town 0.002245
alive 0.000000
alone 0.000000
dtype: float64

2
[ ]: df.shape[0]

[ ]: 891

[ ]: df.shape[0]
df.shape[1]

[ ]: 15

[ ]: sns.heatmap(df.isnull(), cbar=False)

[ ]: <Axes: >

[ ]: sns.heatmap(df.isnull(), cbar=True)

[ ]: <Axes: >

3
[ ]: df[['sex', 'age', 'class']]

[ ]: sex age class


0 male 22.0 Third
1 female 38.0 First
2 female 26.0 Third
3 female 35.0 First
4 male 35.0 Third
.. … … …
886 male 27.0 Second
887 female 19.0 First
888 female NaN Third
889 male 26.0 First
890 male 32.0 Third

[891 rows x 3 columns]

4
[ ]: print(df['sex'].unique())
df.sex.nunique()

['male' 'female']

[ ]: 2

[ ]: df.nunique()

[ ]: survived 2
pclass 3
sex 2
age 88
sibsp 7
parch 7
fare 248
embarked 3
class 3
who 3
adult_male 2
deck 7
embark_town 3
alive 2
alone 2
dtype: int64

[ ]: df.columns

[ ]: Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',


'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
'alive', 'alone'],
dtype='object')

[ ]: df['embark_town'].unique()

[ ]: array(['Southampton', 'Cherbourg', 'Queenstown', nan], dtype=object)

[ ]: df['embark_town'].value_counts()

[ ]: embark_town
Southampton 644
Cherbourg 168
Queenstown 77
Name: count, dtype: int64

[ ]: df.groupby('sex')['fare'].mean()

5
[ ]: sex
female 44.479818
male 25.523893
Name: fare, dtype: float64

[ ]: df.groupby(['class', 'sex'])['fare'].mean()

C:\Users\Lala Abu\AppData\Local\Temp\ipykernel_9808\299023266.py:1:
FutureWarning: The default of observed=False is deprecated and will be changed
to True in a future version of pandas. Pass observed=False to retain current
behavior or observed=True to adopt the future default and silence this warning.
df.groupby(['class', 'sex'])['fare'].mean()

[ ]: class sex
First female 106.125798
male 67.226127
Second female 21.970121
male 19.741782
Third female 16.118810
male 12.661633
Name: fare, dtype: float64

[ ]: df.groupby(['survived', 'sex', 'who'])['fare'].mean()

[ ]: survived sex who


0 female child 32.076113
woman 20.967174
male child 33.073905
man 21.490736
1 female child 30.887800
woman 54.813801
male child 35.562700
man 42.076422
Name: fare, dtype: float64

[ ]: df.groupby(['survived', 'who']).size()

[ ]: survived who
0 child 34
man 449
woman 66
1 child 49
man 88
woman 205
dtype: int64

[ ]: df.drop('deck', axis=1, inplace=True)

6
[ ]: print (df.age.median())
print (df.age.mean())

28.0
29.69911764705882

[ ]: df['age'].fillna(df['age'].mean(), inplace=True)

[ ]: print (df.age.median())
print (df.age.mean())

29.69911764705882
29.69911764705882

[ ]:

You might also like