Exploratory Data Analysis (Eda)
Exploratory Data Analysis (Eda)
FUNDAMENTALS
• Processing such data elicits useful information and processing such information
Statistics, and mathematics. There are several phases of data analysis, including
• data requirements,
• data collection,
• data processing,
• data cleaning,
Communication.
These phases are similar to the Cross-Industry Standard Process for data mining
✓ Data requirements:
✓
Suffering from dementia requires several types of sensors’ data storage, such
As sleep data, heart rate from the patient, electro-dermal activities, and user activities
pattern
All of these data points are required to correctly diagnose the mental state of
The person. Hence, these are mandatory requirements for the application. In
• Data collection:
Data collected from several sources must be stored in the correct format and
Data processing:
Analysis. Common tasks involve correctly exporting the dataset, placing them
Under the right tables, structuring them, and exporting them in the correct
Format.
• Data cleaning:
Preprocessed data is still not ready for detailed analysis. It must be correctly
Missing value check. These tasks are performed in the data cleaning stage,
• EDA:
Other variables to cause an event. For example, when buying, say, pens, the
Total price of pens(Total) = price for one pen(UnitPrice) * the number of
Quantity. Here, the total price is dependent on the unit price. Hence, the total price is
referred to as the dependent variable and the unit price is referred to as
an independent variable.
• Data Product:
Any computer software that uses data as inputs, produces outputs, and
as a data product.
for example, a recommendation model that inputs user purchase history and
Communication:
✓ This stage deals with disseminating the results to end stakeholders to use the
result for business intelligence. One of the most notable steps in this stage is
as tables, charts, summary diagrams, and bar charts to show the analyzed
result.
handful of data points without the help of computer programs. To be certain of the insights
that the collected data provides and to make further decisions, data mining is
performed
Exploratory data analysis is key, and usually the first exercise in data mining.
analysis. The exploratory analysis centers around creating a synopsis of data or insights for
❖ EDA actually reveals ground truth about the content without making any
underlying assumptions. This is the fact that data scientists use this process to
actually
statistical analysis, and visualization of data. Python provides expert tools for
exploratory
analysis, with pandas for summarizing; scipy, along with others, for statistical
analysis; and
Steps in EDA
• Problem definition:
• Before trying to extract useful insight from the data, it is essential to definethe
business problem to be solved. The problem definition works as the
• Driving force for a data analysis plan execution.
The main tasks involved in problem definition are defining the main
Objective of the analysis, defining the main deliverables, outlining the main
Roles and responsibilities, obtaining the current status of the data, defining the
• Data preparation:
This step involves methods for preparing the dataset before actual analysis. In
This step, we define the sources of data, define data schemas and tables,
Understand the main characteristics of the data, clean the dataset, delete non-
Relevant datasets, transform the data, and divide the data into required chunks
For analysis.
Data analysis:
• This is one of the most crucial steps that deals with descriptive statistics and
• Analysis of the data. The main tasks involve summarizing the data, finding
Some of the techniques used for data summarization are summary tables,
The result analyzed from the dataset should be interpretable by the business
It is crucial to identify the type of data under analysis. In this section, we are going to
Learn about different types of data that you can encounter during analysis. Different
disciplines
Store different kinds of data for different purposes. For example, medical researchers
store
Patients’ data, universities store students’ and teachers’ data, and real estate industries
storehouse
Dataset about patients in a hospital can contain many observations. A patient can be
Described by a patient identifier (ID), name, address, weight, date of birth, address,
email,
And gender. Each of these features that describes a patient is a variable. Each observation
PATIENT_ID = 1001
Gender = Female
001
Suresh Kumar
Mukhiya
Mannsverk, 61
30.12.198
[email protected] Male 68
002
Yoshmi
Mukhiya
Mannsverk 61,
5094,
Bergen
10.07.201
om
Female 1
003
Anju Mukhiya
Mannsverk 61,
5094,
Bergen
10.12.199
[email protected] Female 24
004
Asha
Gaire
Butwal,
Nepal
30.11.199
[email protected] Female 23
005
Ola Nordmann
Danmark,
Sweden
12.12.178
[email protected] Male 75
Most of the dataset broadly falls into two groups—numerical data and categorical data.
1. Numerical data
This data has a sense of measurement involved in it; for example, a person’s age,
Height, weight, blood pressure, heart rate, temperature, number of teeth, number of
bones,
And the number of family members. This data is often referred to as quantitative data
in
a) Discrete data
This is data that is countable and its values can be listed out. For example, if we
Flip a coin, the number of heads in 200 coin flips can take values from 0 to 200 (finite)
Country variable can have values such as Nepal, India, Norway, and Japan. It is fixed.
The
Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
b) Continuous data
A variable that can have an infinite number of numerical values within a specific
Continuous variable.
For example, what is the temperature of your city today? Can we be finite?
2. Categorical data
This type of data represents the characteristics of an object; for example, gender,
Marital status, type of address, or categories of the movies. This data is often referred to
asqualitative datasets in statistics. To understand clearly, here are some of the most
common
Widowed, or Unknown)