0% found this document useful (0 votes)
9 views13 pages

AM19 EDA Assignment1

Assignment on EDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views13 pages

AM19 EDA Assignment1

Assignment on EDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

am19-eda-assignment1

November 28, 2024

Name: Swapnil Chaudhari


PRN: 2122000238
Roll No.: AM19
Assignment No. 1
[1]: import pandas as pd

[3]: df=pd.read_excel('Titanic-Dataset.xlsx')
df

[3]: PassengerId Survived Pclass \


0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. … … …
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3

Name Sex Age SibSp \


0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
.. … … … …
886 Montvila, Rev. Juozas male 27.0 0
887 Graham, Miss. Margaret Edith female 19.0 0
888 Johnston, Miss. Catherine Helen "Carrie" female NaN 1
889 Behr, Mr. Karl Howell male 26.0 0
890 Dooley, Mr. Patrick male 32.0 0

1
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
.. … … … … …
886 0 211536 13.0000 NaN S
887 0 112053 30.0000 B42 S
888 2 W./C. 6607 23.4500 NaN S
889 0 111369 30.0000 C148 C
890 0 370376 7.7500 NaN Q

[891 rows x 12 columns]

[5]: #1.Create a dataframe


data = {"Name":["Ram","Shyam","Gita","Sita","Druv","Om","Radhika"],
"Age":[20,22,19,18,20,21,20],
"Gender":["Male","Male","Female","Female","Male","Male","Female"],
"Salary":[12000,23000,20000,19000,10000,25000,40000]}

[6]: df_d = pd.DataFrame(data)

[7]: df_d

[7]: Name Age Gender Salary


0 Ram 20 Male 12000
1 Shyam 22 Male 23000
2 Gita 19 Female 20000
3 Sita 18 Female 19000
4 Druv 20 Male 10000
5 Om 21 Male 25000
6 Radhika 20 Female 40000

[8]: #2.Find shape of the data


df.shape

[8]: (891, 12)

[9]: #3.Find size of the data


df.size

[9]: 10692

[10]: #4.Find dimensions of the data


df.ndim

2
[10]: 2

[11]: #5.List all columns in df


df.columns

[11]: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',


'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')

[12]: #6.Find datatypes of each column


df.dtypes

[12]: PassengerId int64


Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object

[13]: #7.Find axes of the df


df.axes

[13]: [RangeIndex(start=0, stop=891, step=1),


Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')]

[14]: #8.Find index of the df


df.index

[14]: RangeIndex(start=0, stop=891, step=1)

[15]: #9.Find all values of df


df.values

[15]: array([[1, 0, 3, …, 7.25, nan, 'S'],


[2, 1, 1, …, 71.2833, 'C85', 'C'],
[3, 1, 3, …, 7.925, nan, 'S'],
…,
[889, 0, 3, …, 23.45, nan, 'S'],

3
[890, 1, 1, …, 30.0, 'C148', 'C'],
[891, 0, 3, …, 7.75, nan, 'Q']], dtype=object)

[16]: #10.Check whether the df id empty


df.empty

[16]: False

[17]: #11.Transpose the df


# df.T

[18]: #12.Find detailed info of the df


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

[19]: #13.Display top n record from the df


df.head()

[19]: PassengerId Survived Pclass \


0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age SibSp \


0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1

4
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

[20]: #14.Display bottom n record from the df


df.tail()

[20]: PassengerId Survived Pclass Name \


886 887 0 2 Montvila, Rev. Juozas
887 888 1 1 Graham, Miss. Margaret Edith
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie"
889 890 1 1 Behr, Mr. Karl Howell
890 891 0 3 Dooley, Mr. Patrick

Sex Age SibSp Parch Ticket Fare Cabin Embarked


886 male 27.0 0 0 211536 13.00 NaN S
887 female 19.0 0 0 112053 30.00 B42 S
888 female NaN 1 2 W./C. 6607 23.45 NaN S
889 male 26.0 0 0 111369 30.00 C148 C
890 male 32.0 0 0 370376 7.75 NaN Q

[21]: #15.Display descriptive statistics for the numerical columns from the df
df.describe()

[21]: PassengerId Survived Pclass Age SibSp \


count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000

Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400

5
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200

[22]: #16. Find mean, median, mode, std values for numerical columns in df
df['Age'].mean()

[22]: 29.69911764705882

[23]: df['Age'].median()

[23]: 28.0

[24]: df['Age'].mode()

[24]: 0 24.0
Name: Age, dtype: float64

[25]: df['Age'].min()

[25]: 0.42

[26]: df['Age'].max()

[26]: 80.0

[27]: df['Age'].std()

[27]: 14.526497332334042

[28]: #17.Return a random samples from df


df.sample()

[28]: PassengerId Survived Pclass Name Sex Age SibSp \


470 471 0 3 Keefe, Mr. Arthur male NaN 0

Parch Ticket Fare Cabin Embarked


470 0 323592 7.25 NaN S

[29]: #18.Find unique values for the categorical columns


df['Sex'].unique()

[29]: array(['male', 'female'], dtype=object)

[30]: #19.Find number of unique values for the categorical columns


df['Sex'].nunique()

6
[30]: 2

[31]: #20.Locate first row in the df using loc


df.loc[0]

[31]: PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22.0
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object

[32]: #21.Locate nth row in the df using loc


df.loc[4]

[32]: PassengerId 5
Survived 0
Pclass 3
Name Allen, Mr. William Henry
Sex male
Age 35.0
SibSp 0
Parch 0
Ticket 373450
Fare 8.05
Cabin NaN
Embarked S
Name: 4, dtype: object

[33]: #22.Locate last row in the df using loc


df.loc[df.index[-1]]

[33]: PassengerId 891


Survived 0
Pclass 3
Name Dooley, Mr. Patrick
Sex male
Age 32.0
SibSp 0
Parch 0

7
Ticket 370376
Fare 7.75
Cabin NaN
Embarked Q
Name: 890, dtype: object

[34]: #23.Locate all values in a range in df using loc


df.loc[1:5]

[34]: PassengerId Survived Pclass \


1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
5 6 0 3

Name Sex Age SibSp \


1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
5 Moran, Mr. James male NaN 0

Parch Ticket Fare Cabin Embarked


1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
5 0 330877 8.4583 NaN Q

[35]: #24.Locate all the rows in df with specific criteria using loc
df.loc[df['Age']>20]

[35]: PassengerId Survived Pclass \


0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
.. … … …
884 885 0 3
885 886 0 3
886 887 0 2
889 890 1 1
890 891 0 3

Name Sex Age SibSp \

8
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
.. … … … …
884 Sutehall, Mr. Henry Jr male 25.0 0
885 Rice, Mrs. William (Margaret Norton) female 39.0 0
886 Montvila, Rev. Juozas male 27.0 0
889 Behr, Mr. Karl Howell male 26.0 0
890 Dooley, Mr. Patrick male 32.0 0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
.. … … … … …
884 0 SOTON/OQ 392076 7.0500 NaN S
885 5 382652 29.1250 NaN Q
886 0 211536 13.0000 NaN S
889 0 111369 30.0000 C148 C
890 0 370376 7.7500 NaN Q

[535 rows x 12 columns]

[36]: #25.Locate all the rows in df with specific criteria with logical AND opeartor␣
↪using loc

df.loc[(df['Age']>15)&(df['Age']<20)]

[36]: PassengerId Survived Pclass \


27 28 0 1
38 39 0 3
44 45 1 3
49 50 0 3
67 68 0 3
.. … … …
844 845 0 3
853 854 1 1
855 856 1 3
877 878 0 3
887 888 1 1

Name Sex Age SibSp \


27 Fortune, Mr. Charles Alexander male 19.0 3
38 Vander Planke, Miss. Augusta Maria female 18.0 2

9
44 Devaney, Miss. Margaret Delia female 19.0 0
49 Arnold-Franchi, Mrs. Josef (Josefine Franchi) female 18.0 1
67 Crease, Mr. Ernest James male 19.0 0
.. … … … …
844 Culumovic, Mr. Jeso male 17.0 0
853 Lines, Miss. Mary Conover female 16.0 0
855 Aks, Mrs. Sam (Leah Rosen) female 18.0 0
877 Petroff, Mr. Nedelio male 19.0 0
887 Graham, Miss. Margaret Edith female 19.0 0

Parch Ticket Fare Cabin Embarked


27 2 19950 263.0000 C23 C25 C27 S
38 0 345764 18.0000 NaN S
44 0 330958 7.8792 NaN Q
49 0 349237 17.8000 NaN S
67 0 S.P. 3464 8.1583 NaN S
.. … … … … …
844 0 315090 8.6625 NaN S
853 1 PC 17592 39.4000 D28 S
855 1 392091 9.3500 NaN S
877 0 349212 7.8958 NaN S
887 0 112053 30.0000 B42 S

[81 rows x 12 columns]

[37]: #26.Locate first row in the df using iloc


df.iloc[0]

[37]: PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22.0
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object

[38]: #27.Locate last row in the df using iloc


df.iloc[4]

[38]: PassengerId 5
Survived 0

10
Pclass 3
Name Allen, Mr. William Henry
Sex male
Age 35.0
SibSp 0
Parch 0
Ticket 373450
Fare 8.05
Cabin NaN
Embarked S
Name: 4, dtype: object

[39]: #28.Locate last row in the df using iloc


df.iloc[-1]

[39]: PassengerId 891


Survived 0
Pclass 3
Name Dooley, Mr. Patrick
Sex male
Age 32.0
SibSp 0
Parch 0
Ticket 370376
Fare 7.75
Cabin NaN
Embarked Q
Name: 890, dtype: object

[40]: #29.Locate first , nth, last col in df using iloc


df.iloc[:,-1]

[40]: 0 S
1 C
2 S
3 S
4 S
..
886 S
887 S
888 S
889 C
890 Q
Name: Embarked, Length: 891, dtype: object

[41]: #30 locate all the rows in df within specific range of row index
df.iloc[0:5]

11
[41]: PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age SibSp \


0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

[42]: #31 locate all cols within a range of col index


df.iloc[:,0:2]

[42]: PassengerId Survived


0 1 0
1 2 1
2 3 1
3 4 1
4 5 0
.. … …
886 887 0
887 888 1
888 889 0
889 890 1
890 891 0

[891 rows x 2 columns]

[43]: #32 locate second through third row and first 2 cols
df.iloc[1:3,0:2]

[43]: PassengerId Survived


1 2 1
2 3 1

12
[44]: #33 loc 1st and 6thh row
df.iloc[[0,5],[0,3]]

[44]: PassengerId Name


0 1 Braund, Mr. Owen Harris
5 6 Moran, Mr. James

[46]: #34. save created data in csv, excel file and reload and check
df_d.to_csv('EDA_data.csv')

[47]: df_d.to_excel('EDA_data.xlsx')

[48]: df_csv = pd.read_csv('EDA_data.csv')


df_excel = pd.read_excel('EDA_data.xlsx')
print(df_csv)
print(df_excel)

Unnamed: 0 Name Age Gender Salary


0 0 Ram 20 Male 12000
1 1 Shyam 22 Male 23000
2 2 Gita 19 Female 20000
3 3 Sita 18 Female 19000
4 4 Druv 20 Male 10000
5 5 Om 21 Male 25000
6 6 Radhika 20 Female 40000
Unnamed: 0 Name Age Gender Salary
0 0 Ram 20 Male 12000
1 1 Shyam 22 Male 23000
2 2 Gita 19 Female 20000
3 3 Sita 18 Female 19000
4 4 Druv 20 Male 10000
5 5 Om 21 Male 25000
6 6 Radhika 20 Female 40000

[ ]:

13

You might also like