0% found this document useful (0 votes)
52 views20 pages

Chapter 4

Exploratory data analysis (EDA) involves using statistics and visualizations to analyze trends in data sets. The primary goal of EDA is to determine if a predictive model is suitable for addressing business problems. EDA helps data scientists gain insights beyond formal modeling, and is essential for any research analysis to understand a data set. Effective data preparation, including cleaning, labeling, and validation, is important for machine learning as it fuels accurate model building and analytics.

Uploaded by

You
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views20 pages

Chapter 4

Exploratory data analysis (EDA) involves using statistics and visualizations to analyze trends in data sets. The primary goal of EDA is to determine if a predictive model is suitable for addressing business problems. EDA helps data scientists gain insights beyond formal modeling, and is essential for any research analysis to understand a data set. Effective data preparation, including cleaning, labeling, and validation, is important for machine learning as it fuels accurate model building and analytics.

Uploaded by

You
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Chapter 4

Exploratory Data Analysis


Exploratory Data Analysis
• Exploratory data analysis (EDA) involves using statistics and
visualizations to analyze and identify trends in data sets.
• The primary intent of EDA is to determine whether a predictive model
is a feasible analytical tool for business challenges or not. 
• EDA helps data scientists gain an understanding of the data set beyond
the formal modeling or hypothesis testing task.
• Exploratory data analysis is essential for any research analysis, so as
to gain insights into a data set. 
Importance of using EDA for analyzing data sets is:

• Exploratory data analysis is essential for any business. It allows data


scientists to analyze the data before coming to any assumption. It
ensures that the results produced are valid and applicable to business
outcomes and goals. 
• Helps identify errors in data sets.
• Gives a better understanding of the data set. 
• Helps detect outliers(An outlier is an observation that lies an abnormal
distance from other values in a random sample from a population.)or
anomalous(irregular) events.
• Helps understand data set variables and the relationship among them. 
The purpose of Exploratory Data Analysis is essential to tackle
specific tasks such as:

• Spotting missing and erroneous data;


• Mapping and understanding the underlying structure of your data;
• Identifying the most important variables in your dataset;
• Testing a hypothesis or checking assumptions related to a specific model;
• Establishing a parsimonious model (one that can explain your data using
minimum variables);
• Estimating parameters and figuring the margins of error.
Data Cleaning
• Data cleansing, also referred to as data cleaning or data
scrubbing, is the process of fixing incorrect, incomplete,
duplicate or otherwise erroneous data in a data set.
• It involves identifying data errors and then changing, updating
or removing data to correct them. Data cleansing improves 
data quality and helps provide more accurate, consistent and
reliable information for decision-making in an organization.
• If data isn't properly cleansed, customer records and other
business data may not be accurate and analytics applications
may provide faulty information. That can lead to flawed
business decisions, misguided strategies, missed opportunities
and operational problems, which ultimately may increase costs
and reduce revenue and profits. IBM estimated that data quality
issues cost organizations in the U.S. a total of $3.1 trillion in
2016, a figure that's still widely cited.
The types of issues that are commonly fixed as part of data
cleansing projects include the following:

• Typos and invalid or missing data. Data cleansing corrects various structural errors in data
sets. For example, that includes misspellings and other typographical errors, wrong
numerical entries, syntax errors and missing values, such as blank or null fields that should
contain data.
• Inconsistent data. Names, addresses and other attributes are often formatted differently
from system to system. For example, one data set might include a customer's middle initial,
while another doesn't. Data elements such as terms and identifiers may also vary. Data
cleansing helps ensure that data is consistent so it can be analyzed accurately.
• Duplicate data. Data cleansing identifies duplicate records in data sets and either removes
or merges them through the use of deduplication measures. For example, when data from
two systems is combined, duplicate data entries can be reconciled to create single records.
• Irrelevant data. Some data -- outliers or out-of-date entries, for example -- may not be
relevant to analytics applications and could skew their results. Data cleansing 
removes redundant data from data sets, which streamlines data preparation and reduces the
required amount of data processing and storage resources.
Steps in data cleaning
1.Inspection and profiling. First, data is inspected and audited to assess its quality
level and identify issues that need to be fixed. This step usually involves data profiling,
which documents relationships between data elements, checks data quality and
gathers statistics on data sets to help find errors, discrepancies and other problems.
2.Cleaning. This is the heart of the cleansing process, when data errors are corrected
and inconsistent, duplicate and redundant data is addressed.
3.Verification. After the cleaning step is completed, the person or team that did the
work should inspect the data again to verify its cleanliness and make sure it conforms
to internal data quality rules and standards.
4.Reporting. The results of the data cleansing work should then be reported to IT and
business executives to highlight data quality trends and progress. The report could
include the number of issues found and corrected, plus updated metrics on the data's
quality levels.
Characteristics of clean data

• accuracy
• completeness
• consistency
• integrity
• timeliness
• uniformity
• validity
The benefits of effective data cleansing

• Improved decision-making. With more accurate data, 


analytics applications can produce better results. That enables organizations to make more informed
decisions on business strategies and operations, as well as things like patient care and government
programs.
• More effective marketing and sales. Customer data is often wrong, inconsistent or out of date.
Cleaning up the data in customer relationship management and sales systems helps improve the
effectiveness of marketing campaigns and sales efforts.
• Better operational performance. Clean, high-quality data helps organizations avoid inventory
shortages, delivery snafus and other business problems that can result in higher costs, lower
revenues and damaged relationships with customers.
• Increased use of data. Data has become a key corporate asset, but it can't generate business value
if it isn't used. By making data more trustworthy, data cleansing helps convince business managers
and workers to rely on it as part of their jobs.
• Reduced data costs. Data cleansing stops data errors and issues from further propagating in
systems and analytics applications. In the long term, that saves time and money, because IT and
data management teams don't have to continue fixing the same errors in data sets.
DATA PREPARATION
• Data preparation is the process of preparing raw data so that it is
suitable for further processing and analysis. Key steps include
collecting, cleaning, and labeling raw data into a form suitable for
machine learning (ML) algorithms and then exploring and visualizing
the data. Data preparation can take up to 80% of the time spent on an
ML project. Using specialized data preparation tools is important to
optimize this process.
What is the connection between ML and data preparation?

• Data flows through organizations like never before, arriving from


everything from smartphones to smart cities as both structured data
and unstructured data (images, documents, geospatial data, and more).
Unstructured data makes up 80% of data today. ML can analyze not
just structured data, but also discover patterns in unstructured data.
ML is the process where a computer learns to interpret data and make
decisions and recommendations based on that data. During the
learning process¬—and later when used to make predictions—
incorrect, biased, or incomplete data can result in inaccurate
predictions.
Why is data preparation important for ML?

• Data fuels ML. Harnessing this data to reinvent your business, while
challenging, is imperative to staying relevant now and in the future. It
is survival of the most informed, and those who can put their data to
work to make better, more informed decisions respond faster to the
unexpected and uncover new opportunities. This important yet tedious
process is a prerequisite for building accurate ML models and
analytics, and it is the most time-consuming part of an ML project. To
minimize this time investment, data scientists can use tools that help
automate data preparation in various ways.
How do you prepare your data?

• Data preparation follows a series of steps that starts with collecting the right data, followed by
cleaning, labeling, and then validation and visualization.
• Collect data
• Collecting data is the process of assembling all the data you need for ML. Data collection can be
tedious because data resides in many data sources, including on laptops, in data warehouses, in the
cloud, inside applications, and on devices. Finding ways to connect to different data sources can be
challenging. Data volumes are also increasing exponentially, so there is a lot of data to search
through. Additionally, data has vastly different formats and types depending on the source. For
example, video data and tabular data are not easy to use together.
• Clean data
• Cleaning data corrects errors and fills in missing data as a step to ensure data quality. After you
have clean data, you will need to transform it into a consistent, readable format. This process can
include changing field formats like dates and currency, modifying naming conventions, and
correcting values and units of measure so they are consistent.
How do you prepare your data?

Label data
• Data labeling is the process of identifying raw data (images, text files, videos, and so on) and adding
one or more meaningful and informative labels to provide context so an ML model can learn from it.
For example, labels might indicate if a photo contains a bird or car, which words were mentioned in
an audio recording, or if an X-ray discovered an irregularity. Data labeling is required for various
use cases, including computer vision, natural language processing, and speech recognition.
Validate and visualize
• After data is cleaned and labeled, ML teams often explore the data to make sure it is correct and
ready for ML. Visualizations like histograms, scatter plots, box and whisker plots, line plots, and bar
charts are all useful tools to confirm data is correct. Additionally, visualizations also help data
science teams complete exploratory data analysis. This process uses visualizations to discover
patterns, spot anomalies, test a hypothesis, or check assumptions. Exploratory data analysis does not
require formal modeling; instead, data science teams can use visualizations to decipher the data. 
• Feature Engineering for Machine Learning
• Feature engineering is the pre-processing step of machine
learning, which is used to transform raw data into features that
can be used for creating a predictive model using Machine
learning or statistical Modelling. Feature engineering in machine
learning aims to improve the performance of models.
What is a feature?

Generally, all machine learning algorithms take input data to generate


the output. The input data remains in a tabular form consisting of rows
(instances or observations) and columns (variable or attributes), and
these attributes are often known as features. For example, an image is
an instance in computer vision, but a line in the image could be the
feature. Similarly, in NLP, a document can be an observation, and the
word count could be the feature. So, we can say a feature is an
attribute that impacts a problem or is useful for the problem.
What is Feature Engineering?

• Feature engineering is the pre-processing step of machine


learning, which extracts features from raw data. It helps to
represent an underlying problem to predictive models in a better way,
which as a result, improve the accuracy of the model for unseen data.
The predictive model contains predictor variables and an outcome
variable, and while the feature engineering process selects the most
useful predictor variables for the model.
Feature engineering in ML contains mainly four processes: Feature Creation,
Transformations, Feature Extraction, and Feature Selection.

• These processes are described as below:


1.Feature Creation: Feature creation is finding the most useful variables to be used in a predictive model. The process is
subjective, and it requires human creativity and intervention. The new features are created by mixing existing features using
addition, subtraction, and ration, and these new features have great flexibility.
2.Transformations: The transformation step of feature engineering involves adjusting the predictor variable to improve the
accuracy and performance of the model. For example, it ensures that the model is flexible to take input of the variety of data; it
ensures that all the variables are on the same scale, making the model easier to understand. It improves the model's accuracy
and ensures that all the features are within the acceptable range to avoid any computational error.
3.Feature Extraction: Feature extraction is an automated feature engineering process that generates new variables by extracting
them from the raw data. The main aim of this step is to reduce the volume of data so that it can be easily used and managed for
data modelling. Feature extraction methods include cluster analysis, text analytics, edge detection algorithms, and
principal components analysis (PCA).
4.Feature Selection: While developing the machine learning model, only a few variables in the dataset are useful for building
the model, and the rest features are either redundant or irrelevant. If we input the dataset with all these redundant and
irrelevant features, it may negatively impact and reduce the overall performance and accuracy of the model. Hence it is very
important to identify and select the most appropriate features from the data and remove the irrelevant or less important
features, which is done with the help of feature selection in machine learning. "Feature selection is a way of selecting the
subset of the most relevant features from the original features set by removing the redundant, irrelevant, or noisy features."

You might also like