AI Lab5
AI Lab5
Practical No. 5
To perform Exploratory Data Analysis using python
Student’s Roll no: _______________ Points Scored: __________________________
OBJECTIVES: Upon successful completion of this practical, the students will be able to:
Understand the Exploratory Data Analysis
Perform the Exploratory Data Analysis on a given dataset.
Exploratory Data Analysis or EDA is used to take insights from the data. Data Scientists and
Analysts try to find different patterns, relations, and anomalies in the data using some statistical
graphs and other visualization techniques. Following things are part of EDA :
● Get maximum insights from a data set.
● Uncover underlying structure.
● Extract important variables from the dataset.
● Detect outliers and anomalies(if any).
● Test underlying assumptions.
● Determine the optimal factor settings.
Steps:
The columns having null values are: Age, Cabin, Embarked. They need to be filled up with
appropriate values later on.
Using Seaborn
Seaborn:
It is a python library used to statistically visualize data. Seaborn, built over Matplotlib, provides a
better interface and ease of usage. It can be installed using the following command,
It helps in determining if higher-class passengers had more survival rate than the lower class ones
or vice versa. Class 1 passengers have a higher survival chance compared to classes 2 and 3. It
implies that Pclass contributes a lot to a passenger’s survival rate.
Age (Continuous Feature) vs Survived
This graph gives a summary of the age range of men, women and children who were saved. The
survival rate is –
Good for children.
High for women in the age range 20-50.
Less for men as the age increases.
Since Age column is important, the missing values need to be filled, either by using the Name
column(ascertaining age based on salutation – Mr, Mrs etc.) or by using a regressor.
After this step, another column – Age_Range (based on age column) can be created and the data can
be analyzed again.
Bar Plot for Fare (Continuous Feature)
Fare denotes the fare paid by a passenger. As the values in this column are continuous, they need to
be put in separate bins(as done for Age feature) to get a clear idea. It can be concluded that if a
passenger paid a higher fare, the survival rate is more.
Categorical Count Plots for Embarked Feature
Some notable observations are:
Majority of the passengers boarded from S. So, the missing values can be filled with S.
Majority of class 3 passengers boarded from Q.
S looks lucky for class 1 and 2 passengers compared to class 3.
Lab Tasks
The End