0% found this document useful (0 votes)
6 views

BI - Lecture04B - Intro To DataWrangling and EDA

Uploaded by

yasir11.work
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

BI - Lecture04B - Intro To DataWrangling and EDA

Uploaded by

yasir11.work
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Intro to

Data Wrangling and EDA


CS 459 Business Intelligence
Data Wrangling
February 24 CS459 - Business Intelligence - Abeera Tariq 2
• Data Wrangling is the process of gathering, collecting, and
transforming Raw data into another format for better
understanding, decision-making, accessing, and analysis
Data Wrangling in less time.
also called Data Munging • All the activity that you do on the raw data to make it “clean”
enough to input to your analytical algorithm is called data
wrangling or data munging. — Shubham Simar Tomar 2016

February 24 CS459 - Business Intelligence - Abeera Tariq 3


1. Discovering
• Getting familiar with the data
• Identify multiple ways to use data for
different purposes – check the
ingredients before cooking a meal
• Data possibly collected from multiple
sources; formatting is required to
understand relationships.

February 24 CS459 - Business Intelligence - Abeera Tariq 4


2. Structuring

• Data structuring transforms raw data


into a structured format for easier
interpretation and analysis.
• Raw data doesn't help analysts because
it's incomplete or incomprehensible.
• It needs to be parsed so that analysts
can extract relevant information.

February 24 CS459 - Business Intelligence - Abeera Tariq 5


3. Cleaning
• People often use data cleaning
and data wrangling
interchangeably. However,
data cleaning is one step in
the data wrangling process.
• Clean and resolve issues with
the data.

February 24 CS459 - Business Intelligence - Abeera Tariq 6


4. Enriching

• After transforming data into a usable format, find whether


data from other datasets can make your analysis more
effective.
• Helps improve quality of the data if it does not meet the
requirement.
• Enrich with data e.g. combine 2 databases where one
contains phone numbers and others don’t.
February 24 CS459 - Business Intelligence - Abeera Tariq 7
5. Validating
• Check for data
accuracy and
quality.
• Data validation
ensures that data
is fit for analysis.

February 24 CS459 - Business Intelligence - Abeera Tariq 8


6. Publishing

• Publish the data after validating.


• Shared as report, electronic document or deposited into a
database which can be processed further to create larger
and more complex structures such as data warehouses.
• Once published, data is all set for analysis.

February 24 CS459 - Business Intelligence - Abeera Tariq 9


Summarizing
6-steps of
Data Wrangling

February 24 CS459 - Business Intelligence - Abeera Tariq 10


Importance of
Data Wrangling
• In data science and data
analysis, the amount of
work that goes into data
wrangling is embodied by
the 80/20 rule – data
scientists typically spend
80% of their time
‘wrangling’ or preparing
data and 20% of their time
actually analyzing the data.

February 24 CS459 - Business Intelligence - Abeera Tariq 11


Exploratory Data Analysis (EDA)

In data science, exploratory data


analysis involves examining the
distribution of various variables in
the dataset, identifying outliers,
finding trends and patterns,
looking for relationships between
variables by using heat maps or
correlation metrics.

February 24 CS459 - Business Intelligence - Abeera Tariq 12


EDA

February 24 CS459 - Business Intelligence - Abeera Tariq 13


Data Wrangling

February
CS459
24 - Business Intelligence - Abeera Tariq 14
Data Cleaning

February 24 CS459 - Business Intelligence - Abeera Tariq 15


Types of dirty data

February 24 CS459 - Business Intelligence - Abeera Tariq 16


Duplicate data

February 24 CS459 - Business Intelligence - Abeera Tariq 17


Outdated Data

February 24 CS459 - Business Intelligence - Abeera Tariq 18


Incomplete data

February 24 CS459 - Business Intelligence - Abeera Tariq 19


Missing Values

February 24 CS459 - Business Intelligence - Abeera Tariq 20


Missing Values

• Every value in every column has a certain probability of being


missing (Rubin, 1976)
• Generally, there is a probability distribution of any column in any data,
i.e., which defines the shape of the probabilities of occurrence of that
column (e.g., bell curve, exponential, logarithmic etc.)
• Missing Completely at Random (MCAR)
• Missing at Random (MAR)
• Missing Not at Random (MNAR)

February 24 CS459 - Business Intelligence - Abeera Tariq 21


Missing Values - MCAR

• Missing Completely at Random (MCAR):


• Every column value has the same probability of being missing
• Causes of the missing data are unrelated to the data
• A product weighing scale generates missing data - batteries have
died down
• Sales data for an outlet is missing - outlet closed for maintenance
• ATM data missing over some time period - ATM was being filled
with cash or a technical glitch causing ripples at multiple locations.

February 24 CS459 - Business Intelligence - Abeera Tariq 22


Missing Values

• Missing at Random (MAR):


• Different column values (e.g., different groups) can have different
probabilities of being missing – most common case
• Causes of the missing data are related to the data
• A weighing scale produces more missing values for heavier products
• Sales data missing for teenage customers - no promotion for teenagers
• ATM data is missing for a time period – missing due to weekend or
holidays or due to lower transaction volumes. The missingness is
related to the observed variable (day of the week) but not directly to
the missing values.
February 24 CS459 - Business Intelligence - Abeera Tariq 23
Missing Values

• Missing Not at Random (MNAR):


• When the case cannot be categorized as MCAR or MAR -
probability of being missing is varying for unknown reasons
• Weighing scale gives missing values over time - wearing out -
cannot detect
• Sales data - more and more missing over time – customers
relocating – cannot detect
• ATM data – people coming lesser and lesser – fear of theft

February 24 CS459 - Business Intelligence - Abeera Tariq 24


February 24 CS459 - Business Intelligence - Abeera Tariq 25
Incorrect/Inaccurate Data

• If an online store records double the number of sales in a


certain month, it could lead to an increased average
customer spend value.
• While the data might make it seem like the store is
performing well, this false information could lead to poor
decision-making

February 24 CS459 - Business Intelligence - Abeera Tariq 26


• Incorrect data leads to
incorrect insights
• Will the analysis be
useful?
• A waste of time, energy
and resources.

February 24 CS459 - Business Intelligence - Abeera Tariq 27


Inconsistent Data

February 24 CS459 - Business Intelligence - Abeera Tariq 28


February 24 CS459 - Business Intelligence - Abeera Tariq 29
Types of dirty data

February 24 CS459 - Business Intelligence - Abeera Tariq 30


Data Cleaning

February 24 CS459 - Business Intelligence - Abeera Tariq 31


Problems with the Data

February 24 CS459 - Business Intelligence - Abeera Tariq 32


Interpreting
Histograms and Box
plots
What is a
Histogram?
A histogram is a graphical representation
of the frequency distribution of continuous
series using rectangles.
The x-axis of the graph represents the class
interval, and the y-axis shows the various
frequencies corresponding to different
class intervals

February 24 CS459 - Business Intelligence - Abeera Tariq 34


Analyzing Histograms:
Shape, Skew and Kurtosis

February 24 CS459 - Business Intelligence - Abeera Tariq 35


Mean, Median, Mode

• Mean: The "average" number; found by adding all data points and
dividing by the number of data points.
(impacted by outlier)
• Median: The middle number; found by ordering all data points and
picking out the one in the middle (or if there are two middle numbers,
taking the mean of those two numbers).
(Not impacted by outlier)
• Mode: The most frequent number—that is, the number that occurs the
highest number of times.

February 24 CS459 - Business Intelligence - Abeera Tariq 36


Skew

• Skewness is a statistical measure that assesses the


asymmetry of a probability distribution. It quantifies the
extent to which the data is skewed or shifted to one side.
Positive (long tail on right) and Negative (long tail on left)

February 24 CS459 - Business Intelligence - Abeera Tariq 37


Kurtosis

• Kurtosis is a statistical measure that quantifies the shape of a


probability distribution. It provides information about the
tails and peakedness of the distribution compared to a
normal distribution.
• Positive kurtosis indicates heavier tails and a more peaked
distribution, while negative kurtosis suggests lighter tails
and a flatter distribution.
February 24 CS459 - Business Intelligence - Abeera Tariq 38
Example: Scores on a Test

February 24 CS459 - Business Intelligence - Abeera Tariq 39


Example: Scores on a Test

February 24 CS459 - Business Intelligence - Abeera Tariq 40


Interpreting Box Plots

February 24 CS459 - Business Intelligence - Abeera Tariq 41


Histograms and Box Plots

February 24 CS459 - Business Intelligence - Abeera Tariq 42

You might also like