Chapter 4

Exploratory data analysis (EDA) involves using statistics and visualizations to analyze trends in data sets. The primary goal of EDA is to determine if a predictive model is suitable for addressing business problems. EDA helps data scientists gain insights beyond formal modeling, and is essential for any research analysis to understand a data set. Effective data preparation, including cleaning, labeling, and validation, is important for machine learning as it fuels accurate model building and analytics.

Uploaded by

You

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views20 pages

Chapter 4

Uploaded by

You

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Chapter 4

Exploratory Data Analysis

Exploratory Data Analysis
• Exploratory data analysis (EDA) involves using statistics and
visualizations to analyze and identify trends in data sets.
• The primary intent of EDA is to determine whether a predictive model
is a feasible analytical tool for business challenges or not.
• EDA helps data scientists gain an understanding of the data set beyond
the formal modeling or hypothesis testing task.
• Exploratory data analysis is essential for any research analysis, so as
to gain insights into a data set.
Importance of using EDA for analyzing data sets is:

• Exploratory data analysis is essential for any business. It allows data

scientists to analyze the data before coming to any assumption. It
ensures that the results produced are valid and applicable to business
outcomes and goals.
• Helps identify errors in data sets.
• Gives a better understanding of the data set.
• Helps detect outliers(An outlier is an observation that lies an abnormal
distance from other values in a random sample from a population.)or
anomalous(irregular) events.
• Helps understand data set variables and the relationship among them.
The purpose of Exploratory Data Analysis is essential to tackle
specific tasks such as:

• Spotting missing and erroneous data;

• Mapping and understanding the underlying structure of your data;
• Identifying the most important variables in your dataset;
• Testing a hypothesis or checking assumptions related to a specific model;
• Establishing a parsimonious model (one that can explain your data using
minimum variables);
• Estimating parameters and figuring the margins of error.
Data Cleaning
• Data cleansing, also referred to as data cleaning or data
scrubbing, is the process of fixing incorrect, incomplete,
duplicate or otherwise erroneous data in a data set.
• It involves identifying data errors and then changing, updating
or removing data to correct them. Data cleansing improves
data quality and helps provide more accurate, consistent and
reliable information for decision-making in an organization.
• If data isn't properly cleansed, customer records and other
business data may not be accurate and analytics applications
may provide faulty information. That can lead to flawed
business decisions, misguided strategies, missed opportunities
and operational problems, which ultimately may increase costs
and reduce revenue and profits. IBM estimated that data quality
issues cost organizations in the U.S. a total of $3.1 trillion in
2016, a figure that's still widely cited.
The types of issues that are commonly fixed as part of data
cleansing projects include the following:

• Typos and invalid or missing data. Data cleansing corrects various structural errors in data
sets. For example, that includes misspellings and other typographical errors, wrong
numerical entries, syntax errors and missing values, such as blank or null fields that should
contain data.
• Inconsistent data. Names, addresses and other attributes are often formatted differently
from system to system. For example, one data set might include a customer's middle initial,
while another doesn't. Data elements such as terms and identifiers may also vary. Data
cleansing helps ensure that data is consistent so it can be analyzed accurately.
• Duplicate data. Data cleansing identifies duplicate records in data sets and either removes
or merges them through the use of deduplication measures. For example, when data from
two systems is combined, duplicate data entries can be reconciled to create single records.
• Irrelevant data. Some data -- outliers or out-of-date entries, for example -- may not be
relevant to analytics applications and could skew their results. Data cleansing
removes redundant data from data sets, which streamlines data preparation and reduces the
required amount of data processing and storage resources.
Steps in data cleaning
1.Inspection and profiling. First, data is inspected and audited to assess its quality
level and identify issues that need to be fixed. This step usually involves data profiling,
which documents relationships between data elements, checks data quality and
gathers statistics on data sets to help find errors, discrepancies and other problems.
2.Cleaning. This is the heart of the cleansing process, when data errors are corrected
and inconsistent, duplicate and redundant data is addressed.
3.Verification. After the cleaning step is completed, the person or team that did the
work should inspect the data again to verify its cleanliness and make sure it conforms
to internal data quality rules and standards.
4.Reporting. The results of the data cleansing work should then be reported to IT and
business executives to highlight data quality trends and progress. The report could
include the number of issues found and corrected, plus updated metrics on the data's
quality levels.
Characteristics of clean data

• accuracy
• completeness
• consistency
• integrity
• timeliness
• uniformity
• validity
The benefits of effective data cleansing

• Improved decision-making. With more accurate data,

analytics applications can produce better results. That enables organizations to make more informed
decisions on business strategies and operations, as well as things like patient care and government
programs.
• More effective marketing and sales. Customer data is often wrong, inconsistent or out of date.
Cleaning up the data in customer relationship management and sales systems helps improve the
effectiveness of marketing campaigns and sales efforts.
• Better operational performance. Clean, high-quality data helps organizations avoid inventory
shortages, delivery snafus and other business problems that can result in higher costs, lower
revenues and damaged relationships with customers.
• Increased use of data. Data has become a key corporate asset, but it can't generate business value
if it isn't used. By making data more trustworthy, data cleansing helps convince business managers
and workers to rely on it as part of their jobs.
• Reduced data costs. Data cleansing stops data errors and issues from further propagating in
systems and analytics applications. In the long term, that saves time and money, because IT and
data management teams don't have to continue fixing the same errors in data sets.
DATA PREPARATION
• Data preparation is the process of preparing raw data so that it is
suitable for further processing and analysis. Key steps include
collecting, cleaning, and labeling raw data into a form suitable for
machine learning (ML) algorithms and then exploring and visualizing
the data. Data preparation can take up to 80% of the time spent on an
ML project. Using specialized data preparation tools is important to
optimize this process.
What is the connection between ML and data preparation?

• Data flows through organizations like never before, arriving from

everything from smartphones to smart cities as both structured data
and unstructured data (images, documents, geospatial data, and more).
Unstructured data makes up 80% of data today. ML can analyze not
just structured data, but also discover patterns in unstructured data.
ML is the process where a computer learns to interpret data and make
decisions and recommendations based on that data. During the
learning process¬—and later when used to make predictions—
incorrect, biased, or incomplete data can result in inaccurate
predictions.
Why is data preparation important for ML?

• Data fuels ML. Harnessing this data to reinvent your business, while
challenging, is imperative to staying relevant now and in the future. It
is survival of the most informed, and those who can put their data to
work to make better, more informed decisions respond faster to the
unexpected and uncover new opportunities. This important yet tedious
process is a prerequisite for building accurate ML models and
analytics, and it is the most time-consuming part of an ML project. To
minimize this time investment, data scientists can use tools that help
automate data preparation in various ways.
How do you prepare your data?

• Data preparation follows a series of steps that starts with collecting the right data, followed by
cleaning, labeling, and then validation and visualization.
• Collect data
• Collecting data is the process of assembling all the data you need for ML. Data collection can be
tedious because data resides in many data sources, including on laptops, in data warehouses, in the
cloud, inside applications, and on devices. Finding ways to connect to different data sources can be
challenging. Data volumes are also increasing exponentially, so there is a lot of data to search
through. Additionally, data has vastly different formats and types depending on the source. For
example, video data and tabular data are not easy to use together.
• Clean data
• Cleaning data corrects errors and fills in missing data as a step to ensure data quality. After you
have clean data, you will need to transform it into a consistent, readable format. This process can
include changing field formats like dates and currency, modifying naming conventions, and
correcting values and units of measure so they are consistent.
How do you prepare your data?

Label data
• Data labeling is the process of identifying raw data (images, text files, videos, and so on) and adding
one or more meaningful and informative labels to provide context so an ML model can learn from it.
For example, labels might indicate if a photo contains a bird or car, which words were mentioned in
an audio recording, or if an X-ray discovered an irregularity. Data labeling is required for various
use cases, including computer vision, natural language processing, and speech recognition.
Validate and visualize
• After data is cleaned and labeled, ML teams often explore the data to make sure it is correct and
ready for ML. Visualizations like histograms, scatter plots, box and whisker plots, line plots, and bar
charts are all useful tools to confirm data is correct. Additionally, visualizations also help data
science teams complete exploratory data analysis. This process uses visualizations to discover
patterns, spot anomalies, test a hypothesis, or check assumptions. Exploratory data analysis does not
require formal modeling; instead, data science teams can use visualizations to decipher the data.
• Feature Engineering for Machine Learning
• Feature engineering is the pre-processing step of machine
learning, which is used to transform raw data into features that
can be used for creating a predictive model using Machine
learning or statistical Modelling. Feature engineering in machine
learning aims to improve the performance of models.
What is a feature?

Generally, all machine learning algorithms take input data to generate

the output. The input data remains in a tabular form consisting of rows
(instances or observations) and columns (variable or attributes), and
these attributes are often known as features. For example, an image is
an instance in computer vision, but a line in the image could be the
feature. Similarly, in NLP, a document can be an observation, and the
word count could be the feature. So, we can say a feature is an
attribute that impacts a problem or is useful for the problem.
What is Feature Engineering?

• Feature engineering is the pre-processing step of machine

learning, which extracts features from raw data. It helps to
represent an underlying problem to predictive models in a better way,
which as a result, improve the accuracy of the model for unseen data.
The predictive model contains predictor variables and an outcome
variable, and while the feature engineering process selects the most
useful predictor variables for the model.
Feature engineering in ML contains mainly four processes: Feature Creation,
Transformations, Feature Extraction, and Feature Selection.

• These processes are described as below:

1.Feature Creation: Feature creation is finding the most useful variables to be used in a predictive model. The process is
subjective, and it requires human creativity and intervention. The new features are created by mixing existing features using
addition, subtraction, and ration, and these new features have great flexibility.
2.Transformations: The transformation step of feature engineering involves adjusting the predictor variable to improve the
accuracy and performance of the model. For example, it ensures that the model is flexible to take input of the variety of data; it
ensures that all the variables are on the same scale, making the model easier to understand. It improves the model's accuracy
and ensures that all the features are within the acceptable range to avoid any computational error.
3.Feature Extraction: Feature extraction is an automated feature engineering process that generates new variables by extracting
them from the raw data. The main aim of this step is to reduce the volume of data so that it can be easily used and managed for
data modelling. Feature extraction methods include cluster analysis, text analytics, edge detection algorithms, and
principal components analysis (PCA).
4.Feature Selection: While developing the machine learning model, only a few variables in the dataset are useful for building
the model, and the rest features are either redundant or irrelevant. If we input the dataset with all these redundant and
irrelevant features, it may negatively impact and reduce the overall performance and accuracy of the model. Hence it is very
important to identify and select the most appropriate features from the data and remove the irrelevant or less important
features, which is done with the help of feature selection in machine learning. "Feature selection is a way of selecting the
subset of the most relevant features from the original features set by removing the redundant, irrelevant, or noisy features."

Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Chapter 3
No ratings yet
Chapter 3
28 pages
Git 6
No ratings yet
Git 6
40 pages
Dbmi
No ratings yet
Dbmi
79 pages
Green IT Module 1
No ratings yet
Green IT Module 1
26 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Lila sJourneyToDataScience
No ratings yet
Lila sJourneyToDataScience
2 pages
(C.R. Rao E. J. Wegman J. L. Solka) Handbook of ST (BookFi)
No ratings yet
(C.R. Rao E. J. Wegman J. L. Solka) Handbook of ST (BookFi)
660 pages
KUM1 Introduction To Statics
No ratings yet
KUM1 Introduction To Statics
16 pages
Internship Report 3
No ratings yet
Internship Report 3
34 pages
Lattice Data Slides 1
No ratings yet
Lattice Data Slides 1
19 pages
Raghav Arora FlowCV Resume 20240318
No ratings yet
Raghav Arora FlowCV Resume 20240318
2 pages
CDS - Unit 2
No ratings yet
CDS - Unit 2
31 pages
TeM SWDND501 NoSQL Database Development
No ratings yet
TeM SWDND501 NoSQL Database Development
170 pages
Report Erdfv200
No ratings yet
Report Erdfv200
17 pages
Tasks List: The Sparks Foundation
100% (1)
Tasks List: The Sparks Foundation
42 pages
Sample
No ratings yet
Sample
5 pages
Lecture 2 The Data Science Process and Tools For Each Step
No ratings yet
Lecture 2 The Data Science Process and Tools For Each Step
8 pages
Agentic Ai Systems Applied To Tasks in Financial Services: Modeling and Model Risk Management Crews
No ratings yet
Agentic Ai Systems Applied To Tasks in Financial Services: Modeling and Model Risk Management Crews
36 pages
70 Days of Data Science
No ratings yet
70 Days of Data Science
11 pages
Pranita Dane - IBM - Internship Project Submission - Data Analytics
No ratings yet
Pranita Dane - IBM - Internship Project Submission - Data Analytics
28 pages
FDS Module 1 Notes
No ratings yet
FDS Module 1 Notes
27 pages
Data Science Notes - Hamza
No ratings yet
Data Science Notes - Hamza
110 pages
EDAusingpython SAlaruri
No ratings yet
EDAusingpython SAlaruri
50 pages
Ad3301 Set3
No ratings yet
Ad3301 Set3
2 pages
Team 14
No ratings yet
Team 14
28 pages
MLS 1 - Python For Data Science
No ratings yet
MLS 1 - Python For Data Science
33 pages
Exploratory Data Analysis (EDA) - 1
100% (1)
Exploratory Data Analysis (EDA) - 1
3 pages
Maximising Operational Uptime: A Strategic Approach To Mitigate Unplanned Machine Downtime and Boost Productivity Using Machine Learning Techniques
No ratings yet
Maximising Operational Uptime: A Strategic Approach To Mitigate Unplanned Machine Downtime and Boost Productivity Using Machine Learning Techniques
13 pages
Exploratory Data Analysis (EDA) Using Python
No ratings yet
Exploratory Data Analysis (EDA) Using Python
21 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
34 pages
Synopsis For Data Analyzer
No ratings yet
Synopsis For Data Analyzer
18 pages
Global Pollution Analysis and Energy Recovery (2-5)
No ratings yet
Global Pollution Analysis and Energy Recovery (2-5)
2 pages
EDA Unit-2
No ratings yet
EDA Unit-2
24 pages
MATERI WEBINAR DATA ANALYTICS by KMTI UMS
No ratings yet
MATERI WEBINAR DATA ANALYTICS by KMTI UMS
82 pages
Course 4 Workplace Scenarios
No ratings yet
Course 4 Workplace Scenarios
16 pages

Chapter 4

Uploaded by

Chapter 4

Uploaded by

Chapter 4

Exploratory Data Analysis

• Exploratory data analysis is essential for any business. It allows data

• Spotting missing and erroneous data;

• Improved decision-making. With more accurate data,

• Data flows through organizations like never before, arriving from

Generally, all machine learning algorithms take input data to generate

• Feature engineering is the pre-processing step of machine

• These processes are described as below:

You might also like