0% found this document useful (0 votes)

23 views7 pages

Intro To Exploratory Data Analysis Eda in Python

The document provides a tutorial on Exploratory Data Analysis (EDA) in Python, emphasizing its importance in understanding datasets before applying machine learning models. It outlines the steps involved in EDA, including importing libraries, loading data, checking data types, dropping irrelevant columns, renaming columns, removing duplicates, handling missing values, and detecting outliers. The tutorial uses a car dataset from Kaggle to illustrate these concepts and techniques.

Uploaded by

h367565601

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views7 pages

Intro To Exploratory Data Analysis Eda in Python

Uploaded by

h367565601

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Exploratory data analysis in Python.

Let us understand how to explore the data in python.

Image Credits: Morioh

Introduction

What is Exploratory Data Analysis ?

Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main characteristics often plotting them visually. This step is
very important especially when we arrive at modeling the data in order to apply Machine learning. Plotting in EDA consists of Histograms, Box plot,
Scatter plot and many more. It often takes much time to explore the data. Through the process of EDA, we can ask to define the problem statement or
definition on our data set which is very important.

How to perform Exploratory Data Analysis ?

This is one such question that everyone is keen on knowing the answer. Well, the answer is it depends on the data set that you are working. There is no
one method or common methods in order to perform EDA, whereas in this tutorial you can understand some common methods and plots that would
be used in the EDA process.

What data are we exploring today ?

Since I am a huge fan of cars, I got a very beautiful data-set of cars from Kaggle. The data-set can be downloaded from here. To give a piece of brief
information about the data set this data contains more of 10, 000 rows and more than 10 columns which contains features of the car such as Engine
Fuel Type, Engine HP, Transmission Type, highway MPG, city MPG and many more. So in this tutorial, we will explore the data and make it ready for
modeling.

1. Importing the required libraries for EDA

Below are the libraries that are used in order to perform EDA (Exploratory data analysis) in this tutorial.

In [1]: import pandas as pd

import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline
sns.set(color_codes=True)

2. Loading the data into the data frame.

Loading the data into the pandas data frame is certainly one of the most important steps in EDA, as we can see that the value from the data set is
comma-separated. So all we have to do is to just read the CSV into a data frame and pandas data frame does the job for us.

To get or load the dataset into the notebook, all I did was one trivial step. In Google Colab at the left-hand side of the notebook, you will find a > (greater
than symbol). When you click that you will find a tab with three options, you just have to select Files. Then you can easily upload your file with the help
of the Upload option. No need to mount to the google drive or use any specific libraries just upload the data set and your job is done. One thing to
remember in this step is that uploaded files will get deleted when this runtime is recycled. This is how I got the data set into the notebook.

In [2]: df = pd.read_csv("../input/cardataset/data.csv")
# To display the top 5 rows
df.head(5)

Out [2]:
Number
Engine Engine Engine Transmission Vehicle Vehicle highway
Make Model Year Driven_Wheels of Market Category
Fuel Type HP Cylinders Type Size Style MPG m
Doors

1 premium Factory
rear wheel
0 BMW Series 2011 unleaded 335.0 6.0 MANUAL 2.0 Tuner,Luxury,High- Compact Coupe 26 1
drive
M (required) Performance

premium
1 rear wheel
1 BMW 2011 unleaded 300.0 6.0 MANUAL 2.0 Luxury,Performance Compact Convertible 28 1
Series drive
(required)

premium
1 rear wheel Luxury,High-
2 BMW 2011 unleaded 300.0 6.0 MANUAL 2.0 Compact Coupe 28 2
Series drive Performance
(required)

premium
1 rear wheel
3 BMW 2011 unleaded 230.0 6.0 MANUAL 2.0 Luxury,Performance Compact Coupe 28 1
Series drive
(required)

premium
1 rear wheel
4 BMW 2011 unleaded 230.0 6.0 MANUAL 2.0 Luxury Compact Convertible 28 1
Series drive
(required)

In [3]: df.tail(5) # To display the botton 5 rows

Out [3]:
Number
Engine Fuel Engine Engine Transmission Vehicle V
Make Model Year Driven_Wheels of Market Category
Type HP Cylinders Type Size
Doors

premium
4dr
11909 Acura ZDX 2012 unleaded 300.0 6.0 AUTOMATIC all wheel drive 4.0 Crossover,Hatchback,Luxury Midsize
Hatc
(required)

premium
4dr
11910 Acura ZDX 2012 unleaded 300.0 6.0 AUTOMATIC all wheel drive 4.0 Crossover,Hatchback,Luxury Midsize
Hatc
(required)

premium
4dr
11911 Acura ZDX 2012 unleaded 300.0 6.0 AUTOMATIC all wheel drive 4.0 Crossover,Hatchback,Luxury Midsize
Hatc
(required)

premium
4dr
11912 Acura ZDX 2013 unleaded 300.0 6.0 AUTOMATIC all wheel drive 4.0 Crossover,Hatchback,Luxury Midsize
Hatc
(recommended)

regular front wheel

11913 Lincoln Zephyr 2006 221.0 6.0 AUTOMATIC 4.0 Luxury Midsize Sed
unleaded drive

3. Checking the types of data

Here we check for the datatypes because sometimes the MSRP or the price of the car would be stored as a string, if in that case, we have to convert
that string to the integer data only then we can plot the data via a graph. Here, in this case, the data is already in integer format so nothing to worry.

In [4]: df.dtypes

Out [4]: Make object

Model object
Year int64
Engine Fuel Type object
Engine HP float64
Engine Cylinders float64
Transmission Type object
Driven_Wheels object
Number of Doors float64
Market Category object
Vehicle Size object
Vehicle Style object
highway MPG int64
city mpg int64
Popularity int64
MSRP int64
dtype: object

4. Dropping irrelevant columns

This step is certainly needed in every EDA because sometimes there would be many columns that we never use in such cases dropping is the only
solution. In this case, the columns such as Engine Fuel Type, Market Category, Vehicle style, Popularity, Number of doors, Vehicle Size doesn't make
any sense to me so I just dropped for this instance.

In [5]: df = df.drop(['Engine Fuel Type', 'Market Category', 'Vehicle Style', 'Popularity', 'Number of Doors', 'Vehicle
df.head(5)

Out [5]:
Make Model Year Engine HP Engine Cylinders Transmission Type Driven_Wheels highway MPG city mpg MSRP

0 BMW 1 Series M 2011 335.0 6.0 MANUAL rear wheel drive 26 19 46135

1 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 19 40650

2 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 20 36350

3 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 29450

4 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 34500

5. Renaming the columns

In this instance, most of the column names are very confusing to read, so I just tweaked their column names. This is a good approach it improves the
readability of the data set.

In [6]: df = df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission"
df.head(5)

Out [6]:
Make Model Year HP Cylinders Transmission Drive Mode MPG-H MPG-C Price

0 BMW 1 Series M 2011 335.0 6.0 MANUAL rear wheel drive 26 19 46135

1 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 19 40650

2 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 20 36350

3 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 29450

4 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 34500

6. Dropping the duplicate rows

This is often a handy thing to do because a huge data set as in this case contains more than 10, 000 rows often have some duplicate data which might
be disturbing, so here I remove all the duplicate value from the data-set. For example prior to removing I had 11914 rows of data but after removing the
duplicates 10925 data meaning that I had 989 of duplicate data.

In [7]: df.shape

Out [7]: (11914, 10)

In [8]: duplicate_rows_df = df[df.duplicated()]

print("number of duplicate rows: ", duplicate_rows_df.shape)

number of duplicate rows: (989, 10)

Now let us remove the duplicate data because it's ok to remove them.

In [9]: df.count() # Used to count the number of rows

Out [9]: Make 11914

Model 11914
Year 11914
HP 11845
Cylinders 11884
Transmission 11914
Drive Mode 11914
MPG-H 11914
MPG-C 11914
Price 11914
dtype: int64

So seen above there are 11914 rows and we are removing 989 rows of duplicate data.

In [10]: df = df.drop_duplicates()
df.head(5)
Out [10]:
Make Model Year HP Cylinders Transmission Drive Mode MPG-H MPG-C Price

0 BMW 1 Series M 2011 335.0 6.0 MANUAL rear wheel drive 26 19 46135

1 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 19 40650

2 BMW 1 Series 2011 300.0 6.0 MANUAL rear wheel drive 28 20 36350

3 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 29450

4 BMW 1 Series 2011 230.0 6.0 MANUAL rear wheel drive 28 18 34500

In [11]: df.count()

Out [11]: Make 10925

Model 10925
Year 10925
HP 10856
Cylinders 10895
Transmission 10925
Drive Mode 10925
MPG-H 10925
MPG-C 10925
Price 10925
dtype: int64

7. Dropping the missing or null values.

This is mostly similar to the previous step but in here all the missing values are detected and are dropped later. Now, this is not a good approach to do
so, because many people just replace the missing values with the mean or the average of that column, but in this case, I just dropped that missing
values. This is because there is nearly 100 missing value compared to 10, 000 values this is a small number and this is negligible so I just dropped
those values.

In [12]: print(df.isnull().sum())

Make 0
Model 0
Year 0
HP 69
Cylinders 30
Transmission 0
Drive Mode 0
MPG-H 0
MPG-C 0
Price 0
dtype: int64

This is the reason in the above step while counting both Cylinders and Horsepower (HP) had 10856 and 10895 over 10925 rows.

In [13]: df = df.dropna() # Dropping the missing values.

df.count()

Out [13]: Make 10827

Model 10827
Year 10827
HP 10827
Cylinders 10827
Transmission 10827
Drive Mode 10827
MPG-H 10827
MPG-C 10827
Price 10827
dtype: int64

Now we have removed all the rows which contain the Null or N/A values (Cylinders and Horsepower (HP)).

In [14]: print(df.isnull().sum()) # After dropping the values

Make 0
Model 0
Year 0
HP 0
Cylinders 0
Transmission 0
Drive Mode 0
MPG-H 0
MPG-C 0
Price 0
dtype: int64

8. Detecting Outliers

An outlier is a point or set of points that are different from other points. Sometimes they can be very high or very low. It's often a good idea to detect
and remove the outliers. Because outliers are one of the primary reasons for resulting in a less accurate model. Hence it's a good idea to remove them.
The outlier detection and removing that I am going to perform is called IQR score technique. Often outliers can be seen with visualizations using a box
plot. Shown below are the box plot of MSRP, Cylinders, Horsepower and EngineSize. Herein all the plots, you can find some points are outside the box
they are none other than outliers. The technique of finding and removing outlier that I am performing in this assignment is taken help of a tutorial from
towards data science.

In [15]: sns.boxplot(x=df['Price'])

Out [15]: <matplotlib.axes._subplots.AxesSubplot at 0x7f89e9c87a90>

In [16]: sns.boxplot(x=df['HP'])

Out [16]: <matplotlib.axes._subplots.AxesSubplot at 0x7f89e79e2990>

In [17]: sns.boxplot(x=df['Cylinders'])

Out [17]: <matplotlib.axes._subplots.AxesSubplot at 0x7f89e79e2250>

In [18]: Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Year 9.0
HP 130.0
Cylinders 2.0
MPG-H 8.0
MPG-C 6.0
Price 21327.5
dtype: float64

Don't worry about the above values because it's not important to know each and every one of them because it's just important to know how to use this
technique in order to remove the outliers.

In [19]: df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df.shape

Out [19]: (9191, 10)

As seen above there were around 1600 rows were outliers. But you cannot completely remove the outliers because even after you use the above
technique there maybe 1–2 outlier unremoved but that ok because there were more than 100 outliers. Something is better than nothing.

9. Plot different features against one another (scatter), against frequency (histogram)

Histogram

Histogram refers to the frequency of occurrence of variables in an interval. In this case, there are mainly 10 different types of car manufacturing
companies, but it is often important to know who has the most number of cars. To do this histogram is one of the trivial solutions which lets us know
the total number of car manufactured by a different company.
In [20]: df.Make.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
plt.title("Number of cars by make")
plt.ylabel('Number of cars')
plt.xlabel('Make');

Heat Maps

Heat Maps is a type of plot which is necessary when we need to find the dependent variables. One of the best way to find the relationship between the
features can be done using heat maps. In the below heat map we know that the price feature depends mainly on the Engine Size, Horsepower, and
Cylinders.

In [21]: plt.figure(figsize=(10,5))
c= df.corr()
sns.heatmap(c,cmap="BrBG",annot=True)
c

Out [21]:
Year HP Cylinders MPG-H MPG-C Price

Year 1.000000 0.326726 -0.133920 0.378479 0.338145 0.592983

HP 0.326726 1.000000 0.715237 -0.443807 -0.544551 0.739042

Cylinders -0.133920 0.715237 1.000000 -0.703856 -0.755540 0.354013

MPG-H 0.378479 -0.443807 -0.703856 1.000000 0.939141 -0.106320

MPG-C 0.338145 -0.544551 -0.755540 0.939141 1.000000 -0.180515

Price 0.592983 0.739042 0.354013 -0.106320 -0.180515 1.000000

Scatterplot

We generally use scatter plots to find the correlation between two variables. Here the scatter plots are plotted between Horsepower and Price and we
can see the plot below. With the plot given below, we can easily draw a trend line. These features provide a good scattering of points.

In [22]: fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(df['HP'], df['Price'])
ax.set_xlabel('HP')
ax.set_ylabel('Price')
plt.show()
Hence the above are some of the steps involved in Exploratory data analysis, these are some general steps that you must follow in order to perform
EDA. There are many more yet to come but for now, this is more than enough idea as to how to perform a good EDA given any data sets. Stay tuned for
more updates.

Thank you.

Kubota V Manual KX080 3
100% (6)
Kubota V Manual KX080 3
758 pages
Step-by-Step Exploratory Data Analysis (EDA) Using Python
100% (1)
Step-by-Step Exploratory Data Analysis (EDA) Using Python
20 pages
Guide PDF
No ratings yet
Guide PDF
148 pages
Sulhuric Acid PDF
No ratings yet
Sulhuric Acid PDF
2 pages
Exploratory Data Analysis (EDA) Using Python
No ratings yet
Exploratory Data Analysis (EDA) Using Python
21 pages
PDF Experiments-1 DADV
No ratings yet
PDF Experiments-1 DADV
41 pages
Eda Expt
No ratings yet
Eda Expt
6 pages
EDA Withoutcode
No ratings yet
EDA Withoutcode
36 pages
Exploratiory Data Analysis
No ratings yet
Exploratiory Data Analysis
26 pages
Intro
No ratings yet
Intro
26 pages
Project Report
No ratings yet
Project Report
7 pages
Data Acquisition Python
No ratings yet
Data Acquisition Python
12 pages
Lec ExploratoryDataAnalysis1Unit5Part1
No ratings yet
Lec ExploratoryDataAnalysis1Unit5Part1
22 pages
Eda Notes
No ratings yet
Eda Notes
4 pages
AI-MAJOR-AUGUST - Aryal Ashish
No ratings yet
AI-MAJOR-AUGUST - Aryal Ashish
16 pages
EDA Assignment
No ratings yet
EDA Assignment
16 pages
Swaraj Project
No ratings yet
Swaraj Project
16 pages
Internship
No ratings yet
Internship
23 pages
Lesson 5 Exploratory Data Analysis
No ratings yet
Lesson 5 Exploratory Data Analysis
10 pages
Data Analysis Report
No ratings yet
Data Analysis Report
74 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Cmps 396X Advanceddata Science: Fatima K. Abu Salem Exploratory Analysis Driving Visual Analysis With Automobile Data
100% (1)
Cmps 396X Advanceddata Science: Fatima K. Abu Salem Exploratory Analysis Driving Visual Analysis With Automobile Data
113 pages
Trilokesh Assignment
No ratings yet
Trilokesh Assignment
15 pages
1.5 Data Analysis With Python - Exploratory Data Analysis 1
No ratings yet
1.5 Data Analysis With Python - Exploratory Data Analysis 1
17 pages
Assignment - CarsData - Descriptive - EDA - Munjal - Exercise - Ipynb - Colaboratory
No ratings yet
Assignment - CarsData - Descriptive - EDA - Munjal - Exercise - Ipynb - Colaboratory
6 pages
Python Codes
No ratings yet
Python Codes
17 pages
Lab07ML - f40
No ratings yet
Lab07ML - f40
13 pages
Import As Import As
No ratings yet
Import As Import As
18 pages
Document
No ratings yet
Document
21 pages
Data Analisis 1
No ratings yet
Data Analisis 1
9 pages
Car Price Prediction
No ratings yet
Car Price Prediction
35 pages
Exploratory Data Visualization Using Python
No ratings yet
Exploratory Data Visualization Using Python
3 pages
BDA-4 EDA Project
No ratings yet
BDA-4 EDA Project
19 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
7 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Exp 5 Exploratory Data Analysis SDK Ok
No ratings yet
Exp 5 Exploratory Data Analysis SDK Ok
13 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Exploratory Data Analysis With Python
No ratings yet
Exploratory Data Analysis With Python
2 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
The Laboratory Work 12
No ratings yet
The Laboratory Work 12
9 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Engo 645
No ratings yet
Engo 645
9 pages
Machine Learning With Python - Part-2
No ratings yet
Machine Learning With Python - Part-2
27 pages
Mastering Exploratory Data Analysis With Python - A Comprehensive Guide To Unveiling Hidden Insights
No ratings yet
Mastering Exploratory Data Analysis With Python - A Comprehensive Guide To Unveiling Hidden Insights
73 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
Elite Sports Cars Eda
No ratings yet
Elite Sports Cars Eda
9 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
Complete Guide To Exploratory Data Analysis With Python Plotly - by Anar Abiyev - Mar, 2022 - Medium
No ratings yet
Complete Guide To Exploratory Data Analysis With Python Plotly - by Anar Abiyev - Mar, 2022 - Medium
11 pages
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
No ratings yet
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
4 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
22 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
Practical 2 .Ipynb - Colab
No ratings yet
Practical 2 .Ipynb - Colab
9 pages
DA0101EN-Review-Introduction - Jupyter Notebook
No ratings yet
DA0101EN-Review-Introduction - Jupyter Notebook
8 pages
DEV Lab Record
No ratings yet
DEV Lab Record
46 pages
Xii Project PDF
No ratings yet
Xii Project PDF
19 pages
Data Analisis 2
No ratings yet
Data Analisis 2
13 pages
Hitchhiker's Guide To Exploratory Data Analysis - by Harshit Tyagi - Towards Data Science
No ratings yet
Hitchhiker's Guide To Exploratory Data Analysis - by Harshit Tyagi - Towards Data Science
14 pages
Lec 18
No ratings yet
Lec 18
17 pages
06 Data Analysis With Python I
No ratings yet
06 Data Analysis With Python I
6 pages
BMW Automotive Technologies
From Everand
BMW Automotive Technologies
Mandy Concepcion
No ratings yet
BMW: HOW GERMAN PRECISION ENGINEERING BECAME AN UNATTAINABLE DIFFERENTIATOR
From Everand
BMW: HOW GERMAN PRECISION ENGINEERING BECAME AN UNATTAINABLE DIFFERENTIATOR
MAX EDITORIAL
No ratings yet
Previewpdf
No ratings yet
Previewpdf
33 pages
Quicklabelcrop 1748665927403
No ratings yet
Quicklabelcrop 1748665927403
297 pages
Quicklabelcrop 1748664666694
No ratings yet
Quicklabelcrop 1748664666694
33 pages
Flipkart Labels 31 May 2025-09-57
No ratings yet
Flipkart Labels 31 May 2025-09-57
1 page
Web Client - Quick Start 7.0-Gui4
No ratings yet
Web Client - Quick Start 7.0-Gui4
26 pages
English for Writing Research Papers 2nd Edition Adrian Wallwork instant download
No ratings yet
English for Writing Research Papers 2nd Edition Adrian Wallwork instant download
131 pages
Hitachi VRF FSNP FSNS Series - Reduced
No ratings yet
Hitachi VRF FSNP FSNS Series - Reduced
114 pages
Solids Handling Study Bench
86% (7)
Solids Handling Study Bench
30 pages
Ee - Electromagnetic Theory PDF
No ratings yet
Ee - Electromagnetic Theory PDF
86 pages
Make A DTM and Drape An Image MicroStation Pre-V8i - AskInga Community Wiki - AskInga - Bentley Communities
No ratings yet
Make A DTM and Drape An Image MicroStation Pre-V8i - AskInga Community Wiki - AskInga - Bentley Communities
16 pages
Asymptomatic Presentation of A Congenital Malforma
No ratings yet
Asymptomatic Presentation of A Congenital Malforma
6 pages
Motoman SG650: Picking, Packing & Handling With The SG-series
No ratings yet
Motoman SG650: Picking, Packing & Handling With The SG-series
2 pages
Advanced Negotiation Skills
No ratings yet
Advanced Negotiation Skills
69 pages
Ai Samgyupsal
No ratings yet
Ai Samgyupsal
21 pages
Sedition in India and It'S Changing Trends.: A Project Proposal Made by
No ratings yet
Sedition in India and It'S Changing Trends.: A Project Proposal Made by
20 pages
Case 2 Flat Cargo Berhad
0% (1)
Case 2 Flat Cargo Berhad
11 pages
Case Study 1
No ratings yet
Case Study 1
2 pages
Repairing Concrete
No ratings yet
Repairing Concrete
10 pages
VOSCO Form-313SOP Crankweb Inspection Report
No ratings yet
VOSCO Form-313SOP Crankweb Inspection Report
1 page
Penguatan Nilai Kejujuran Melalui Pendidikan Anti Korupsi Di Sekolah
No ratings yet
Penguatan Nilai Kejujuran Melalui Pendidikan Anti Korupsi Di Sekolah
11 pages
Biodiversity and Ecosystem Services 2021
0% (1)
Biodiversity and Ecosystem Services 2021
54 pages
WAP2100-W Series
No ratings yet
WAP2100-W Series
5 pages
Diminishing Returns vs. Economies of Scale
No ratings yet
Diminishing Returns vs. Economies of Scale
3 pages
Chapter 12
No ratings yet
Chapter 12
24 pages
Mathalino: Spiral Curve
No ratings yet
Mathalino: Spiral Curve
6 pages
OB Study Material
100% (1)
OB Study Material
39 pages
Wings of The Seraph The Complete Series Sarah Hawke Instant Download
No ratings yet
Wings of The Seraph The Complete Series Sarah Hawke Instant Download
29 pages
Risk Assessment For Installation of Cable Tray, Ladder and Trunking
No ratings yet
Risk Assessment For Installation of Cable Tray, Ladder and Trunking
11 pages
How To Write A Cause and Effect Essay: The Effects of Sleep Deprivation
No ratings yet
How To Write A Cause and Effect Essay: The Effects of Sleep Deprivation
3 pages
Harvesting Indices of Vegetables
100% (3)
Harvesting Indices of Vegetables
7 pages
Textiles Coursework Examples Aqa
100% (2)
Textiles Coursework Examples Aqa
6 pages

Intro To Exploratory Data Analysis Eda in Python

Uploaded by

Intro To Exploratory Data Analysis Eda in Python

Uploaded by

Exploratory data analysis in Python.

Let us understand how to explore the data in python.

Image Credits: Morioh

What is Exploratory Data Analysis ?

How to perform Exploratory Data Analysis ?

What data are we exploring today ?

1. Importing the required libraries for EDA

In [1]: import pandas as pd

2. Loading the data into the data frame.

In [3]: df.tail(5) # To display the botton 5 rows

regular front wheel

3. Checking the types of data

Out [4]: Make object

4. Dropping irrelevant columns

5. Renaming the columns

6. Dropping the duplicate rows

Out [7]: (11914, 10)

In [8]: duplicate_rows_df = df[df.duplicated()]

number of duplicate rows: (989, 10)

In [9]: df.count() # Used to count the number of rows

Out [9]: Make 11914

Out [11]: Make 10925

7. Dropping the missing or null values.

In [13]: df = df.dropna() # Dropping the missing values.

Out [13]: Make 10827

In [14]: print(df.isnull().sum()) # After dropping the values

Out [15]: <matplotlib.axes._subplots.AxesSubplot at 0x7f89e9c87a90>

Out [16]: <matplotlib.axes._subplots.AxesSubplot at 0x7f89e79e2990>

Out [17]: <matplotlib.axes._subplots.AxesSubplot at 0x7f89e79e2250>

Out [19]: (9191, 10)

Year 1.000000 0.326726 -0.133920 0.378479 0.338145 0.592983

HP 0.326726 1.000000 0.715237 -0.443807 -0.544551 0.739042

Cylinders -0.133920 0.715237 1.000000 -0.703856 -0.755540 0.354013

MPG-H 0.378479 -0.443807 -0.703856 1.000000 0.939141 -0.106320

MPG-C 0.338145 -0.544551 -0.755540 0.939141 1.000000 -0.180515

Price 0.592983 0.739042 0.354013 -0.106320 -0.180515 1.000000

In [22]: fig, ax = plt.subplots(figsize=(10,6))

You might also like