2/21/24, 11:35 PM PRAC3_23BME053
23BME053 MAZIN VORA P3 PRACTICAL - 3
In [ ]: USING PANDAS FOR READING DATA FROM CSV FILE
In [3]: import pandas as pd
df = pd.read_csv(r'C:\Users\mazin\Downloads\train.csv')
print(df.head(7))
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
5 6 0 3
6 7 0 1
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
5 Moran, Mr. James male NaN 0
6 McCarthy, Mr. Timothy J male 54.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
5 0 330877 8.4583 NaN Q
6 0 17463 51.8625 E46 S
C:\Users\mazin\AppData\Local\Temp\ipykernel_22848\1674290458.py:1: DeprecationWar
ning:
Pyarrow will become a required dependency of pandas in the next major release of
pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better i
nteroperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
In [4]: df.info()
localhost:8888/doc/tree/Documents/PRAC3_23BME053.ipynb 1/5
2/21/24, 11:35 PM PRAC3_23BME053
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [5]: print(df.shape)
(891, 12)
In [6]: print(df.describe())
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
In [ ]: MAKING COPIES OF INITIAL DATAFRAME FOR APPLING DIFFERENT TYPES OF METHOD FOR FIL
In [7]: df1 = df.copy()
df2 = df.copy()
df3 = df.copy()
df4 = df.copy()
df5 = df.copy()
In [ ]: REMOVING ROWS HAVING NULL VALUES
In [8]: df1 = df1.dropna(axis=0)
print(df1.shape)
print(df.shape)
localhost:8888/doc/tree/Documents/PRAC3_23BME053.ipynb 2/5
2/21/24, 11:35 PM PRAC3_23BME053
(183, 12)
(891, 12)
In [ ]: REMOVING COLUMNS HAVING NULL VALUES
In [9]: df2 = df2.dropna(axis=1)
print(df2.shape)
print(df.shape)
(891, 9)
(891, 12)
In [10]: df3.loc[1,"Age"]
Out[10]: 38.0
In [ ]: INSERTING A CONSTANT IN
In [11]: df3.loc[df3.loc[:,"Age"].isna(),"Age"] = 21
df3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 891 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [12]: df3.loc[df3["Embarked"].isna(),"Embarked"]='S'
df3.info()
localhost:8888/doc/tree/Documents/PRAC3_23BME053.ipynb 3/5
2/21/24, 11:35 PM PRAC3_23BME053
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 891 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In Python, .isna() is a method used to check if a Pandas DataFrame or Series contains
missing or null values. It returns a boolean DataFrame or Series where True indicates that
the corresponding element is null or missing and False indicates that it is not.
In [13]: df3.loc[df3["Cabin"].isna(),"Cabin"]='C85'
df3.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 891 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 891 non-null object
11 Embarked 891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [15]: import numpy as np
import statistics as stats
meanage=np.mean(df.loc[~df.loc[:,"Age"].isna(),"Age"].values)
meanage=np.round(meanage)
df4.loc[df4.loc[:,"Age"].isna(),"Age"]=meanage
modeembarked=stats.mode(df.loc[~df.loc[:,"Embarked"].isna(),"Embarked"])
df4.loc[df4.loc[:,"Embarked"].isna(),"Embarked"]=modeembarked
df4.info()
localhost:8888/doc/tree/Documents/PRAC3_23BME053.ipynb 4/5
2/21/24, 11:35 PM PRAC3_23BME053
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 891 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [21]: meanage0 = df.loc[df["Survived"] == 0, "Age"].mean()
meanage1 = df.loc[df["Survived"] == 1, "Age"].mean()
df5.loc[df5["Age"].isna() & (df5["Survived"] == 0), "Age"] = meanage0
df5.loc[df5["Age"].isna() & (df5["Survived"] == 1), "Age"] = meanage1
df5.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 891 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
In [ ]:
localhost:8888/doc/tree/Documents/PRAC3_23BME053.ipynb 5/5