0% found this document useful (0 votes)
7 views

Assignment 5

The document provides a comprehensive overview of data manipulation using pandas in Python, including importing libraries, loading datasets, and performing basic data exploration. It covers handling missing values, modifying and adding columns, and creating new DataFrames from dictionaries. Key operations demonstrated include checking for missing values, filling them, extracting names, and categorizing age groups.

Uploaded by

Ehtisham Amjad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Assignment 5

The document provides a comprehensive overview of data manipulation using pandas in Python, including importing libraries, loading datasets, and performing basic data exploration. It covers handling missing values, modifying and adding columns, and creating new DataFrames from dictionaries. Key operations demonstrated include checking for missing values, filling them, extracting names, and categorizing age groups.

Uploaded by

Ehtisham Amjad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Importing Libraries and Loading Data

import pandas as pd
import numpy as np

train = pd.read_csv('/content/drive/MyDrive/train.csv')
test = pd.read_csv('/content/drive/MyDrive/test.csv')
gender_submission = pd.read_csv('/content/drive/MyDrive/gender_submission.csv')

print("Train Dataset:\n", train.info())


print("\nTest Dataset:\n", test.info())
print("\nGender Submission Dataset:\n", gender_submission.info())

# Preview the first 5 rows of each dataset


print("\nFirst 5 rows of the Train Dataset:")
print(train.head())

memory usage: 83.7+ KB


Train Dataset:
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

Test Dataset:
None
4 5 0 3

Name Sex Age SibSp \


0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

Basic Data Exploration

a. See the first and last 5 rows:

print("First 5 rows of train data:")


print(train.head())

print("\nLast 5 rows of train data:")


print(train.tail())

First 5 rows of train data:


PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age SibSp \


0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

Last 5 rows of train data:


PassengerId Survived Pclass Name \
886 887 0 2 Montvila, Rev. Juozas
887 888 1 1 Graham, Miss. Margaret Edith
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie"
889 890 1 1 Behr, Mr. Karl Howell
890 891 0 3 Dooley, Mr. Patrick

Sex Age SibSp Parch Ticket Fare Cabin Embarked


886 male 27.0 0 0 211536 13.00 NaN S
887 female 19.0 0 0 112053 30.00 B42 S
888 female NaN 1 2 W./C. 6607 23.45 NaN S
889 male 26.0 0 0 111369 30.00 C148 C
890 male 32.0 0 0 370376 7.75 NaN Q

b. Shape of the dataset:

print("\nShape of the train data (rows, columns):")


print(train.shape)

Shape of the train data (rows, columns):


(891, 12)

c. List of all columns:

print("\nList of all columns in the train dataset:")


print(train.columns.tolist())

List of all columns in the train dataset:


['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',

d. Values with their counts in a column:

print("\nValue counts for 'Pclass' column:")


print(train['Pclass'].value_counts())

Value counts for 'Pclass' column:


Pclass
3 491
1 216
2 184
Name: count, dtype: int64

e. General description:

print("\nStatistical description of the dataset:")


print(train.describe(include='all'))

Statistical description of the dataset:


PassengerId Survived Pclass Name Sex \
count 891.000000 891.000000 891.000000 891 891
unique NaN NaN NaN 891 2
top NaN NaN NaN Braund, Mr. Owen Harris male
freq NaN NaN NaN 1 577
mean 446.000000 0.383838 2.308642 NaN NaN
std 257.353842 0.486592 0.836071 NaN NaN
min 1.000000 0.000000 1.000000 NaN NaN
25% 223.500000 0.000000 2.000000 NaN NaN
50% 446.000000 0.000000 3.000000 NaN NaN
75% 668.500000 1.000000 3.000000 NaN NaN
max 891.000000 1.000000 3.000000 NaN NaN

Age SibSp Parch Ticket Fare Cabin \


count 714.000000 891.000000 891.000000 891 891.000000 204
unique NaN NaN NaN 681 NaN 147
top NaN NaN NaN 347082 NaN B96 B98
freq NaN NaN NaN 7 NaN 4
mean 29.699118 0.523008 0.381594 NaN 32.204208 NaN
std 14.526497 1.102743 0.806057 NaN 49.693429 NaN
min 0.420000 0.000000 0.000000 NaN 0.000000 NaN
25% 20.125000 0.000000 0.000000 NaN 7.910400 NaN
50% 28.000000 0.000000 0.000000 NaN 14.454200 NaN
75% 38.000000 1.000000 0.000000 NaN 31.000000 NaN
max 80.000000 8.000000 6.000000 NaN 512.329200 NaN

Embarked
count 889
unique 3
top S
freq 644
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN

f. Index of rows:

print("\nRow index of the dataset:")


print(train.index)

Row index of the dataset:


RangeIndex(start=0, stop=891, step=1)

Creating a DataFrame

a. Create an empty DataFrame:

empty_df = pd.DataFrame()
print("\nEmpty DataFrame:")
print(empty_df)

Empty DataFrame:
Empty DataFrame
Columns: []
Index: []
b. Create a DataFrame from a dictionary:

import pandas as pd

data_dict = {
'Name': ['Ehtisham', 'Amjad', 'Muhammad'],
'Age': [25, 30, 35]
}
# Create a DataFrame from the dictionary
dict_df = pd.DataFrame(data_dict)

# Display the DataFrame


print("\nDataFrame created from a dictionary:")
print(dict_df)

DataFrame created from a dictionary:


Name Age
0 Ehtisham 25
1 Amjad 30
2 Muhammad 35

Handling Missing Values

a. Check for missing values:

print("\nMissing values in the dataset:")


print(train.isnull().sum())

Missing values in the dataset:


PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

b. Handle missing values in a specific column:

import pandas as pd

train = pd.read_csv('/content/drive/MyDrive/train.csv')

print("\nFirst 5 rows of the dataset:")


print(train.head())
mean_age = train['Age'].mean()
train['Age'] = train['Age'].fillna(mean_age)

print("\nAfter filling missing values in 'Age' column with mean:")


print(train['Age'].head())

First 5 rows of the dataset:


PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age SibSp \


0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

After filling missing values in 'Age' column with mean:


0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64

Modifying and Adding Columns

a. Extract first and last names from the Name column:

train['LastName'] = train['Name'].apply(lambda x: x.split(',')[0])


train['FirstName'] = train['Name'].apply(lambda x: x.split(',')[1] if ',' in x else '')
print("\nFirst and Last Names added as new columns:")
print(train[['Name', 'LastName', 'FirstName']].head())

First and Last Names added as new columns:


Name LastName \
0 Braund, Mr. Owen Harris Braund
1 Cumings, Mrs. John Bradley (Florence Briggs Th... Cumings
2 Heikkinen, Miss. Laina Heikkinen
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Futrelle
4 Allen, Mr. William Henry Allen

FirstName
0 Mr. Owen Harris
1 Mrs. John Bradley (Florence Briggs Thayer)
2 Miss. Laina
3 Mrs. Jacques Heath (Lily May Peel)
4 Mr. William Henry

b. Add a column to indicate male passengers in 3rd class:

train['MenIn3rdClass'] = train.apply(lambda row: 1 if row['Pclass'] == 3 and row['Sex'] == 'male' e


print("\nNew column 'MenIn3rdClass' added:")
print(train[['Pclass', 'Sex', 'MenIn3rdClass']].head())

New column 'MenIn3rdClass' added:


Pclass Sex MenIn3rdClass
0 3 male 1
1 1 female 0
2 3 female 0
3 1 female 0
4 3 male 1

c. Create age groups using a custom function:

def find_age_group(age):
if age < 18:
return 'Child'
elif 18 <= age < 40:
return 'Young Adult'
elif 40 <= age < 60:
return 'Middle Aged'
else:
return 'Senior'

train['AgeGroup'] = train['Age'].apply(find_age_group)
print("\nAge groups created:")
print(train[['Age', 'AgeGroup']].head())

Age groups created:


Age AgeGroup
0 22.0 Young Adult
1 38.0 Young Adult
2 26.0 Young Adult
3 35.0 Young Adult
4 35.0 Young Adult

Renaming and Deleting Columns

a. Rename a column:
train.rename(columns={'Age': 'PassengerAge'}, inplace=True)
print("\nColumns after renaming 'Age' to 'PassengerAge':")
print(train.columns)

Columns after renaming 'Age' to 'PassengerAge':


Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'PassengerAge',
'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'LastName',
'FirstName', 'MenIn3rdClass', 'AgeGroup'],
dtype='object')

b. Delete the Cabin column:

train.drop(columns=['Cabin'], inplace=True, errors='ignore')


print("\nDataset after dropping the 'Cabin' column:")
print(train.head())

Dataset after dropping the 'Cabin' column:


PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex PassengerAge \


0 Braund, Mr. Owen Harris male 22.0
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
2 Heikkinen, Miss. Laina female 26.0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
4 Allen, Mr. William Henry male 35.0

SibSp Parch Ticket Fare Embarked LastName \


0 1 0 A/5 21171 7.2500 S Braund
1 1 0 PC 17599 71.2833 C Cumings
2 0 0 STON/O2. 3101282 7.9250 S Heikkinen
3 1 0 113803 53.1000 S Futrelle
4 0 0 373450 8.0500 S Allen

FirstName MenIn3rdClass AgeGroup


0 Mr. Owen Harris 1 Young Adult
1 Mrs. John Bradley (Florence Briggs Thayer) 0 Young Adult
2 Miss. Laina 0 Young Adult
3 Mrs. Jacques Heath (Lily May Peel) 0 Young Adult
4 Mr. William Henry 1 Young Adult

Slicing the DataFrame

a. Filter passengers in 3rd class:

pclass_3 = train[train['Pclass'] == 3]
print("\nPassengers in 3rd class:")
print(pclass_3.head())
Passengers in 3rd class:
PassengerId Survived Pclass Name Sex \
0 1 0 3 Braund, Mr. Owen Harris male
2 3 1 3 Heikkinen, Miss. Laina female
4 5 0 3 Allen, Mr. William Henry male
5 6 0 3 Moran, Mr. James male
7 8 0 3 Palsson, Master. Gosta Leonard male

PassengerAge SibSp Parch Ticket Fare Embarked LastName \


0 22.000000 1 0 A/5 21171 7.2500 S Braund
2 26.000000 0 0 STON/O2. 3101282 7.9250 S Heikkinen
4 35.000000 0 0 373450 8.0500 S Allen
5 29.699118 0 0 330877 8.4583 Q Moran
7 2.000000 3 1 349909 21.0750 S Palsson

FirstName MenIn3rdClass AgeGroup


0 Mr. Owen Harris 1 Young Adult
2 Miss. Laina 0 Young Adult
4 Mr. William Henry 1 Young Adult
5 Mr. James 1 Young Adult
7 Master. Gosta Leonard 1 Child

b. Filter females older than 60:

females_60_plus = train[(train['Sex'] == 'female') & (train['PassengerAge'] > 60)]


print("\nFemales older than 60:")
print(females_60_plus)

Females older than 60:


PassengerId Survived Pclass Name \
275 276 1 1 Andrews, Miss. Kornelia Theodosia
483 484 1 3 Turkula, Mrs. (Hedwig)
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn)

Sex PassengerAge SibSp Parch Ticket Fare Embarked LastName \


275 female 63.0 1 0 13502 77.9583 S Andrews
483 female 63.0 0 0 4134 9.5875 S Turkula
829 female 62.0 0 0 113572 80.0000 NaN Stone

FirstName MenIn3rdClass AgeGroup


275 Miss. Kornelia Theodosia 0 Senior
483 Mrs. (Hedwig) 0 Senior
829 Mrs. George Nelson (Martha Evelyn) 0 Senior

c. Select numerical and categorical columns:

numerical = train.select_dtypes(include=[np.number])
categorical = train.select_dtypes(include=['object'])
print("\nNumerical columns:")
print(numerical.head())
print("\nCategorical columns:")
print(categorical.head())

Numerical columns:
PassengerId Survived Pclass PassengerAge SibSp Parch Fare \
0 1 0 3 22.0 1 0 7.2500
1 2 1 1 38.0 1 0 71.2833
2 3 1 3 26.0 0 0 7.9250
3 4 1 1 35.0 1 0 53.1000
4 5 0 3 35.0 0 0 8.0500

MenIn3rdClass
0 1
1 0
2 0
3 0
4 1

Categorical columns:
Name Sex \
0 Braund, Mr. Owen Harris male
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female
2 Heikkinen, Miss. Laina female
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
4 Allen, Mr. William Henry male

Ticket Embarked LastName \


0 A/5 21171 S Braund
1 PC 17599 C Cumings
2 STON/O2. 3101282 S Heikkinen
3 113803 S Futrelle
4 373450 S Allen

FirstName AgeGroup
0 Mr. Owen Harris Young Adult
1 Mrs. John Bradley (Florence Briggs Thayer) Young Adult
2 Miss. Laina Young Adult
3 Mrs. Jacques Heath (Lily May Peel) Young Adult
4 Mr. William Henry Young Adult

d. Using .iloc and .loc:

iloc_subset = train.iloc[:100, :3] # First 100 rows, first 3 columns


print("\nSubset using iloc:")
print(iloc_subset)

loc_subset = train.loc[train['PassengerId'] <= 100, ['Name', 'Sex', 'PassengerAge']]


print("\nSubset using loc:")
print(loc_subset)

Subset using iloc:


PassengerId Survived Pclass
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. ... ... ...
95 96 0 3
96 97 0 1
97 98 1 1
98 99 1 2
99 100 0 2
[100 rows x 3 columns]

Subset using loc:


Name Sex PassengerAge
0 Braund, Mr. Owen Harris male 22.000000
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.000000
2 Heikkinen, Miss. Laina female 26.000000
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.000000
4 Allen, Mr. William Henry male 35.000000
.. ... ... ...
95 Shorney, Mr. Charles Joseph male 29.699118
96 Goldschmidt, Mr. George B male 71.000000
97 Greenfield, Mr. William Bertram male 23.000000
98 Doling, Mrs. John T (Ada Julia Bone) female 34.000000
99 Kantor, Mr. Sinai male 34.000000

[100 rows x 3 columns]

Adding Rows

a. Append a new row:

new_row_2 = {
'PassengerId': 893,
'Survived': 1,
'Pclass': 2,
'Name': 'Example, Miss Dummy',
'Sex': 'female',
'Age': 28,
'SibSp': 1,
'Parch': 0,
'Ticket': 'SC/Paris 21195',
'Fare': 26.0,
'Embarked': 'C'
}

# Append the new row using loc


train.loc[len(train.index)] = new_row_2

print("\nDataset after appending a new row using loc:")


print(train.tail())

Dataset after appending a new row using loc:


PassengerId Survived Pclass Name \
887 888 1 1 Graham, Miss. Margaret Edith
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie"
889 890 1 1 Behr, Mr. Karl Howell
890 891 0 3 Dooley, Mr. Patrick
891 893 1 2 Example, Miss Dummy

Sex PassengerAge SibSp Parch Ticket Fare Embarked \


887 female 19.000000 0 0 112053 30.00 S
888 female 29.699118 1 2 W./C. 6607 23.45 S
889 male 26.000000 0 0 111369 30.00 C
890 male 32.000000 0 0 370376 7.75 Q
891 female NaN 1 0 SC/Paris 21195 26.00 C

LastName FirstName MenIn3rdClass AgeGroup


887 Graham Miss. Margaret Edith 0.0 Young Adult
888 Johnston Miss. Catherine Helen "Carrie" 0.0 Young Adult
889 Behr Mr. Karl Howell 0.0 Young Adult
890 Dooley Mr. Patrick 1.0 Young Adult
891 NaN NaN NaN NaN

Sorting the DataFrame

a. Sort by PassengerAge in descending order:

sorted_df = train.sort_values(by='PassengerAge', ascending=False)


print("\nData sorted by PassengerAge in descending order:")
print(sorted_df.head())

Data sorted by PassengerAge in descending order:


PassengerId Survived Pclass Name \
630 631 1 1 Barkworth, Mr. Algernon Henry Wilson
851 852 0 3 Svensson, Mr. Johan
96 97 0 1 Goldschmidt, Mr. George B
493 494 0 1 Artagaveytia, Mr. Ramon
116 117 0 3 Connors, Mr. Patrick

Sex PassengerAge SibSp Parch Ticket Fare Embarked \


630 male 80.0 0 0 27042 30.0000 S
851 male 74.0 0 0 347060 7.7750 S
96 male 71.0 0 0 PC 17754 34.6542 C
493 male 71.0 0 0 PC 17609 49.5042 C
116 male 70.5 0 0 370369 7.7500 Q

LastName FirstName MenIn3rdClass AgeGroup


630 Barkworth Mr. Algernon Henry Wilson 0 Senior
851 Svensson Mr. Johan 1 Senior
96 Goldschmidt Mr. George B 0 Senior
493 Artagaveytia Mr. Ramon 0 Senior
116 Connors Mr. Patrick 1 Senior

GroupBy Operations

a. Group by Pclass and calculate average age:

grouped = train.groupby('Pclass')
print("\nAverage PassengerAge per Pclass:")
print(grouped['PassengerAge'].mean())

Average PassengerAge per Pclass:


Pclass
1 37.048118
2 29.866958
3 26.403259
Name: PassengerAge, dtype: float64
b. Perform multiple aggregations:

agg_results = grouped.agg({'PassengerAge': ['min', 'max', 'mean'], 'Survived': 'sum'})


print("\nAggregated results (min, max, mean of age and total survivors):")
print(agg_results)

Aggregated results (min, max, mean of age and total survivors):


PassengerAge Survived
min max mean sum
Pclass
1 0.92 80.0 37.048118 136
2 0.67 70.0 29.866958 87
3 0.42 74.0 26.403259 119

Joining DataFrames

a. Perform inner, left, and outer joins:

marks_df = pd.DataFrame({'Sno': [1, 2, 3], 'Marks': [90, 80, 70]})


age_df = pd.DataFrame({'Sno': [2, 3, 4], 'Age': [25, 35, 45]})

# Inner Join
inner = marks_df.merge(age_df, on='Sno', how='inner')
print("\nInner Join:")
print(inner)

# Left Join
left = marks_df.merge(age_df, on='Sno', how='left')
print("\nLeft Join:")
print(left)

# Outer Join
outer = marks_df.merge(age_df, on='Sno', how='outer')
print("\nOuter Join:")
print(outer)

Inner Join:
Sno Marks Age
0 2 80 25
1 3 70 35

Left Join:
Sno Marks Age
0 1 90 NaN
1 2 80 25.0
2 3 70 35.0

Outer Join:
Sno Marks Age
0 1 90.0 NaN
1 2 80.0 25.0
2 3 70.0 35.0
3 4 NaN 45.0

You might also like