L1-D2 Basics of Data Preperation and Quality
L1-D2 Basics of Data Preperation and Quality
www.infocepts.com L2D1
Data Preparation
Data preparation is a Variables V Timeliness T
process of preparing
(or pre-processing) raw “Data must have all required “Lifetime and recency of data must
variables important for analysis or it be at required frequency or below to
data into refined should at least provide means to derive time bound conclusion from
information assets that derive required attributes correctly” analysis”
Granularity G Quality Q
“The raw data must have been “Quality data appropriate for analysis
To get correct results of
collected at required attributes so as must be Correct, Complete,
analysis, the analyst need data to ease its transformation as needed Consistent, Valid, Standardized, and
appropriate in all four aspects in analysis ” adhering to business rules ”
of data viz. Quality, Attributes,
timeliness and granularity
www.infocepts.com 2
Types of data values
www.infocepts.com
Variables Quantitative,
Variables are units of data that can change between Say variable Order
different cases. count is X, Then
instance X1=33, X2 =26
The different types of values decide type of variable.
and so on.
statistical methods can only be used with certain data
types. Qualitative,
You have to analyze continuous data differently than Categorical,
Nominal, Non-
categorical data otherwise it would result in a wrong
binary variables
analysis
Variables can be analyzed on their own (univariate
analysis), with one other variable (bivariate analysis)
or with a number of others (multivariate analysis).
Three universal rules of defining variable are
one variable can only contain one type of values.
One instance of variable will represent only one value
Value can be assigned to variable or its instance, but
not vice versa i.e. variables are unidirectional
Variable can have empty or null value
www.infocepts.com 4
Granularity Categorical variable in required
Measured
Numerical variable
Granularity
Granularity is the scale or level of detail in a set of
data
Higher the granularity greater the details of data.
Data is analyzed at different levels of details
depending on the problem it intends to address
Required granularity of data may have to
established by combining different attributes from
one or more source data sets
Data need to be pre-processed to assemble final
dataset at required granularity
Granularity is always related to measured
numerical data and consists of categorical data.
Data transformations like aggregation is used to
decrease the granularity of data to make it coarse
Observation
Incorrect choice of granularity leads to incorrect
results of analysis.
www.infocepts.com 5
Timeliness Time Granularity is daily
Measured
Numerical variable
Timeliness consists of following parts
Duration of data to be used for analysis E.g. for
September 2018
Refresh rate by which data is refreshed in source
systems e.g. Daily, weekly, monthly etc…
Duration of historical or recent data required for
analysis.
Based on Duration for which analysis is to be done the
historical duration for raw data is decided
Recency i.e. latest refreshed data is used for real –time,
near real-time or present data analysis.
Granularity of time or interval of data i.e. hourly, daily,
weekly observation needs to be identified for required
analysis.
Choosing lowest level time granularity is advised as it Historical duration is
can be aggregated to higher time interval if required. likely 2011
Observation
E.g. Hourly data aggregated to daily to weekly etc…
www.infocepts.com 6
Data Quality
www.infocepts.com 7
Data Preparation Process
Data Transformation
Identify transformation needs of
data viz. computation of derived
variable, aggregation, reduction
of statistically irrelevant variables
Variable Selection 04
Selecting attributes of data
that represent and impact
problem under analysis and
eliminating irrelevant 03
variables, Identifying
granularity
02
Finalize dataset
Create final dataset with all
transformed observations.
01 Extract statistical random
sample from it for analysis.
Data Cleansing
Check quality, and
clean the data of
quality issues
www.infocepts.com
Variable Selection
Selecting relevant variables for analysis is most crucial for correct results
Using business knowledge, problem understanding and discussion and verification from the business users play key role in selecting
relevant variables
Eliminating irrelevant variables is equally important.
www.infocepts.com 9
Illustration I Variable Selection Required variables
Problem Statement Identifying Observations 03
Raw Data
What is average order count per Sales Sales Ship Online
day in September 2018? Order Purchase Customer Person Method Order
Number Order Number ID ID ID TaxAmt Freight TotalDue Flag OrderDate ShipDate SubTotal
Numeric Measure 01 SO43663 PO18009186470 29565 276 5 40.2681 12.5838 472.3108 FALSE 9/1/2018 6/7/2011 419.4589
SO43665 PO16588191572 29580 283 5 1375.943 429.9821 16158.696 FALSE 9/2/2018 6/7/2011 14352.771
1. Numeric measure is count SO43668 PO14732180295 29614 282 5 3461.765 1081.802 40487.723 FALSE 9/3/2018 6/7/2011 35944.156
SO43660 PO18850127500 29672 279 5 124.2483 38.8276 1457.3288 FALSE 9/1/2018 6/7/2011 1294.2529
of orders in each day of SO43661 PO18473189620 29734 282 5 3153.77 985.553 36865.801 FALSE 9/1/2018 6/7/2011 32726.479
September 2018. SO43669 PO14123169936 29747 283 5 70.5175 22.0367 807.2585 FALSE 9/3/2018 6/7/2011 714.7043
SO43659 PO522145787 29825 279 5 1971.515 616.0984 23153.234 FALSE 9/1/2018 6/7/2011 20565.621
2. It is derived numeric SO43664 PO16617121983 29898 280 5 2344.992 732.81 27510.411 FALSE 9/2/2018 6/7/2011 24432.609
variable as count of orders SO43667 PO15428132599 29974 277 5 586.1203 183.1626 6876.3649 FALSE 9/2/2018 6/7/2011 6107.082
SO43662 PO18444174044 29994 282 5 2775.165 867.2389 32474.932 FALSE 9/1/2018 6/7/2011 28832.529
per day is not measured. SO43666 PO16008173883 30052 276 5 486.3747 151.9921 5694.8564 FALSE 9/2/2018 6/7/2011 5056.4896
Characteristics Tests
Completeness Check for blank values (not zero) & mark the observation for missing values of variables
Check Categorical values for consistent spellings and associated numeric labels
Data Consistency Check date variables for similar formats like either DDMMYYYY for all or MMDDYYY for all.
Checks uniform Decimal places and rounding rules for all numerical columns
Check for standard formats in variables like Zip code, Mobile no, Phone number which have
Data Format Compliance common standard format Zip code is alphanumeric 5 characters, Mobile no is 10 digit, phone
number is 7 or 8 Digit.
Check if data is within standard defined ranges like email must have @ symbol, no numeric
Validity character in name, counts are not decimal, binary data cannot have more than two distinct
values, etc…
Business Rule Compliance From business rules related to variables verify that data complies to it E.g. order date is not on
Sunday as it is holiday and no order is accepted on that day.
Duplicates Data observations are not exact duplicates i.e. all values of variables in two observations are not
exactly same.
www.infocepts.com
Basic Data cleansing
2. Data Cleansing
Clean the data for quality issues found in quality assessment. Below table provides some basic cleansing actions
These tests can be performed visually or using Tools like SQL, Excel or other tools.
Business Rule Compliance Eliminate observations not adhering to business rules unless corrections are provided and verified
by business users or domain experts.
Duplicates Remove Duplicate observations
www.infocepts.com
Illustration II Data Cleansing
Problem Statement
Intermediate Raw Dataset DQ Assessment Data cleansing result
What is average order count per
day in September 2018?
Data Cleansing 02
After transformation is complete the dataset finalization includes following steps after which dataset is ready for analysis
www.infocepts.com 15
Illustration III Data Transformation and dataset finalization
Intermediate Raw Dataset Data Transformation & Dataset Finalization Final Dataset
Problem Statement
What is average order count per
day in September 2018?
Nagpur Chennai
11/1 I.T. Park, Parsodi, TIDEL Park Ltd. Pune Bengaluru
Nagpur - 440022 Module No-1207/12th Floor, Sky Vista, Ground Floor, 2nd Floor, Santosh Complex
Ph: +91 712 666 0100 "D" Block North Side 4, Next to Eminence IT Park, D.No.1/5, Armugam Circle
Ph: +1 301 769 6212 Rajiv Gandhi Salai, Taramani Airport Road, Viman Nagar Basavangudi,
Fx: +91 712 664 9845 Chennai – 600113 Pune – 411014 Bengaluru - 560004