0% found this document useful (0 votes)
20 views10 pages

Exploratory Data Analysis (Eda)

Uploaded by

vishnuai4568
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views10 pages

Exploratory Data Analysis (Eda)

Uploaded by

vishnuai4568
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

EXPLORATORY DATA ANALYSIS (EDA)

FUNDAMENTALS

• Data encompasses a collection of discrete objects, numbers, words, events, facts,

Measurements, observations, or even descriptions of things. Such data is collected

And stored by every event or process occurring in several disciplines, including

Biology, economics, engineering, marketing, and others.

• Processing such data elicits useful information and processing such information

Generates useful knowledge.

“EDA is a process of examining the available dataset to discover patterns, spot

Anomalies, test hypotheses, and check assumptions using statistical measures.”

UNDERSTANDING DATA SCIENCE

Data science involves cross-disciplinary knowledge from computer science, data,

Statistics, and mathematics. There are several phases of data analysis, including

• data requirements,

• data collection,

• data processing,

• data cleaning,

• exploratory data analysis,

• modeling and algorithms, and

• data product and

Communication.

These phases are similar to the Cross-Industry Standard Process for data mining

(CRISP) framework in data mining.

✓ Data requirements:

There can be various sources of data for an organization. It is important to

Comprehend what type of data is required for the organization to be collected,

Curated, and stored.

For example, an application tracking the sleeping pattern of patients

Suffering from dementia requires several types of sensors’ data storage, such

As sleep data, heart rate from the patient, electro-dermal activities, and user activities
pattern

All of these data points are required to correctly diagnose the mental state of

The person. Hence, these are mandatory requirements for the application. In

Addition to this, it is required to categorize the data, numerical or categorical,

And the format of storage and dissemination.

• Data collection:

Data collected from several sources must be stored in the correct format and

Transferred to the right information technology personnel within a company.

✓ As mentioned previously, data can be collected from several objects on

Several events using different types of sensors and storage tools.

Data processing:

Preprocessing involves the process of pre-curating the dataset before actual

Analysis. Common tasks involve correctly exporting the dataset, placing them

Under the right tables, structuring them, and exporting them in the correct
Format.

• Data cleaning:

Preprocessed data is still not ready for detailed analysis. It must be correctly

Transformed for an incompleteness check, duplicates check, error check, and

Missing value check. These tasks are performed in the data cleaning stage,

Which involves responsibilities such as matching the correct record, finding

Inaccuracies in the dataset, understanding the overall data quality, removing

Duplicate items, and filling in the missing values.

✓ Finding such data issues requires us to perform some analytical techniques.

Hence, it is most essential for data scientists or EDA experts to comprehend

Different types of datasets. An example of data cleaning would be using

Outlier detection methods for quantitative data cleaning.

• EDA:

✓ Exploratory data analysis, as mentioned before, is the stage where we

Actually start to understand the message contained in the data. It should be

Noted that several types of data transformation techniques might be required

During the process of exploration.

Modeling and algorithm:

✓ From a data science perspective, generalized models or mathematical

Formulas can represent or exhibit relationships among different variables,

Such as correlation or causation.

These models or equations involve one or more variables that depend on

Other variables to cause an event. For example, when buying, say, pens, the
Total price of pens(Total) = price for one pen(UnitPrice) * the number of

Pens bought (Quantity). Hence, our model would be Total = UnitPrice *

Quantity. Here, the total price is dependent on the unit price. Hence, the total price is
referred to as the dependent variable and the unit price is referred to as

an independent variable.

✓ In general, a model always describes the relationship between independent

and dependent variables. Inferential statistics deals with quantifying

relationships between particular variables.

• Data Product:

Any computer software that uses data as inputs, produces outputs, and

provides feedback based on the output to control the environment is referred to

as a data product.

✓ A data product is generally based on a model developed during data analysis,

for example, a recommendation model that inputs user purchase history and

recommends a related item that the user is highly likely to buy.

Communication:

✓ This stage deals with disseminating the results to end stakeholders to use the

result for business intelligence. One of the most notable steps in this stage is

data visualization. Visualization deals with information relay techniques such

as tables, charts, summary diagrams, and bar charts to show the analyzed

result.

THE SIGNIFICANCE OF EDA

Different fields of science, economics, engineering, and marketing


accumulate and store data primarily in electronic databases. Appropriate and well-

established decisions should be made using the data collected.

It is practically impossible to make sense of datasets containing more than a

handful of data points without the help of computer programs. To be certain of the insights

that the collected data provides and to make further decisions, data mining is
performed

where we go through distinctive analysis processes.

Exploratory data analysis is key, and usually the first exercise in data mining.

It allows us to visualize data to understand it as well as to create hypotheses for


further

analysis. The exploratory analysis centers around creating a synopsis of data or insights for

the next steps in a data mining project.

❖ EDA actually reveals ground truth about the content without making any

underlying assumptions. This is the fact that data scientists use this process to
actually

understand what type of modeling and hypotheses can be created.

❖ Key components of exploratory data analysis include summarizing data,

statistical analysis, and visualization of data. Python provides expert tools for
exploratory

analysis, with pandas for summarizing; scipy, along with others, for statistical
analysis; and

matplotlib and plotly for visualizations.

Steps in EDA

• Problem definition:
• Before trying to extract useful insight from the data, it is essential to definethe
business problem to be solved. The problem definition works as the
• Driving force for a data analysis plan execution.

The main tasks involved in problem definition are defining the main

Objective of the analysis, defining the main deliverables, outlining the main

Roles and responsibilities, obtaining the current status of the data, defining the

Timetable, and performing cost/benefit analysis. Based on such a problem

Definition, an execution plan can be created.

• Data preparation:

This step involves methods for preparing the dataset before actual analysis. In

This step, we define the sources of data, define data schemas and tables,

Understand the main characteristics of the data, clean the dataset, delete non-

Relevant datasets, transform the data, and divide the data into required chunks

For analysis.

Data analysis:

• This is one of the most crucial steps that deals with descriptive statistics and
• Analysis of the data. The main tasks involve summarizing the data, finding

The hidden correlation and relationships among the data, developing

Predictive models, evaluating the models, and calculating the accuracies.

Some of the techniques used for data summarization are summary tables,

Graphs, descriptive statistics, inferential statistics, correlation statistics,

Searching, grouping, and mathematical models.

• Development and representation of the results:


• This step involves presenting the dataset to the target audience in the form of
• Graphs, summary tables, maps, and diagrams. This is also an essential step as

The result analyzed from the dataset should be interpretable by the business

Stakeholders, which is one of the major goals of EDA.

• Most of the graphical analysis techniques include scattering plots, character


• Plots, histograms, box plots, residual plots, mean plots, and others.

MAKING SENSE OF DATA

It is crucial to identify the type of data under analysis. In this section, we are going to

Learn about different types of data that you can encounter during analysis. Different
disciplines

Store different kinds of data for different purposes. For example, medical researchers
store

Patients’ data, universities store students’ and teachers’ data, and real estate industries
storehouse

And building datasets.

A dataset contains many observations about a particular object. For instance, a

Dataset about patients in a hospital can contain many observations. A patient can be

Described by a patient identifier (ID), name, address, weight, date of birth, address,
email,

And gender. Each of these features that describes a patient is a variable. Each observation

Can have a specific value for each of these variables.

PATIENT_ID = 1001

Name = Yoshmi Mukhiya

Address = Mannsverk 61, 5094, Bergen, Norway

Email = [email protected] Weight = 10

Gender = Female

PATIENT_ID NAME ADDRESS DOB EMAIL Gender WEIGHT

001
Suresh Kumar

Mukhiya

Mannsverk, 61

30.12.198

[email protected] Male 68

002

Yoshmi

Mukhiya

Mannsverk 61,

5094,

Bergen

10.07.201

[email protected]

om

Female 1

003

Anju Mukhiya

Mannsverk 61,

5094,

Bergen

10.12.199

[email protected] Female 24

004
Asha

Gaire

Butwal,

Nepal

30.11.199

[email protected] Female 23

005

Ola Nordmann

Danmark,

Sweden

12.12.178

[email protected] Male 75

Most of the dataset broadly falls into two groups—numerical data and categorical data.

1. Numerical data

This data has a sense of measurement involved in it; for example, a person’s age,

Height, weight, blood pressure, heart rate, temperature, number of teeth, number of
bones,

And the number of family members. This data is often referred to as quantitative data
in

Statistics. The numerical dataset can be either discrete or continuous types.

a) Discrete data

This is data that is countable and its values can be listed out. For example, if we

Flip a coin, the number of heads in 200 coin flips can take values from 0 to 200 (finite)

Cases. A variable that represents a discrete dataset is referred to as a discrete variable.

The discrete variable


Takes a fixed number of distinct values. For example, the

Country variable can have values such as Nepal, India, Norway, and Japan. It is fixed.
The

Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.

b) Continuous data

A variable that can have an infinite number of numerical values within a specific

Range is classified as continuous data. A variable describing continuous data is a

Continuous variable.

For example, what is the temperature of your city today? Can we be finite?

Similarly, the weight variable in the previous section is a continuous variable. \

2. Categorical data

This type of data represents the characteristics of an object; for example, gender,

Marital status, type of address, or categories of the movies. This data is often referred to
asqualitative datasets in statistics. To understand clearly, here are some of the most
common

Types of categorical data you can find in data:

Gender (Male, Female, Other, or Unknown)

• Marital Status (Annulled, Divorced, Interlocutory, Legally Separated,

Married, Polygamous, Never Married, Domestic Partner, Unmarried,

Widowed, or Unknown)

• Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy,

Historical, Horror, Mystery, Philosophical, Political, Romance, Saga, Satire,

Science Fiction, Social, Thriller, Urban, or Western)

You might also like