100% found this document useful (1 vote)
138 views17 pages

L1-D2 Basics of Data Preperation and Quality

The document discusses the importance of data preparation and quality for analysis. It outlines four key aspects of data that must be addressed: quality, variables, timeliness, and granularity. The summary is: Data preparation is crucial to ensure analysis yields correct results. It involves preparing raw data by addressing quality issues, selecting relevant variables, and ensuring the data has the proper level of detail, time frame, and values for the intended analysis. This prepares the data as a refined asset that can be effectively analyzed.

Uploaded by

Simar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
138 views17 pages

L1-D2 Basics of Data Preperation and Quality

The document discusses the importance of data preparation and quality for analysis. It outlines four key aspects of data that must be addressed: quality, variables, timeliness, and granularity. The summary is: Data preparation is crucial to ensure analysis yields correct results. It involves preparing raw data by addressing quality issues, selecting relevant variables, and ensuring the data has the proper level of detail, time frame, and values for the intended analysis. This prepares the data as a refined asset that can be effectively analyzed.

Uploaded by

Simar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Basic Data Preparation & Quality

Understanding the methods

www.infocepts.com L2D1
Data Preparation
Data preparation is a Variables V Timeliness T
process of preparing
(or pre-processing) raw “Data must have all required “Lifetime and recency of data must
variables important for analysis or it be at required frequency or below to
data into refined should at least provide means to derive time bound conclusion from
information assets that derive required attributes correctly” analysis”

can be used effectively


for analysis.

Granularity G Quality Q

“The raw data must have been “Quality data appropriate for analysis
To get correct results of
collected at required attributes so as must be Correct, Complete,
analysis, the analyst need data to ease its transformation as needed Consistent, Valid, Standardized, and
appropriate in all four aspects in analysis ” adhering to business rules ”
of data viz. Quality, Attributes,
timeliness and granularity

www.infocepts.com 2
Types of data values

Values are discrete and taken at


regular intervals.
The value represent single instance
of event.

Monthly revenue, daily temperature

www.infocepts.com
Variables Quantitative,
 Variables are units of data that can change between Say variable Order
different cases. count is X, Then
instance X1=33, X2 =26
 The different types of values decide type of variable.
and so on.
 statistical methods can only be used with certain data
types. Qualitative,
 You have to analyze continuous data differently than Categorical,
Nominal, Non-
categorical data otherwise it would result in a wrong
binary variables
analysis
 Variables can be analyzed on their own (univariate
analysis), with one other variable (bivariate analysis)
or with a number of others (multivariate analysis).
 Three universal rules of defining variable are
 one variable can only contain one type of values.
 One instance of variable will represent only one value
 Value can be assigned to variable or its instance, but
not vice versa i.e. variables are unidirectional
 Variable can have empty or null value

www.infocepts.com 4
Granularity Categorical variable in required
Measured
Numerical variable
Granularity
 Granularity is the scale or level of detail in a set of
data
 Higher the granularity greater the details of data.
 Data is analyzed at different levels of details
depending on the problem it intends to address
 Required granularity of data may have to
established by combining different attributes from
one or more source data sets
 Data need to be pre-processed to assemble final
dataset at required granularity
 Granularity is always related to measured
numerical data and consists of categorical data.
 Data transformations like aggregation is used to
decrease the granularity of data to make it coarse
Observation
 Incorrect choice of granularity leads to incorrect
results of analysis.

www.infocepts.com 5
Timeliness Time Granularity is daily
Measured
Numerical variable
 Timeliness consists of following parts
 Duration of data to be used for analysis E.g. for
September 2018
 Refresh rate by which data is refreshed in source
systems e.g. Daily, weekly, monthly etc…
 Duration of historical or recent data required for
analysis.
 Based on Duration for which analysis is to be done the
historical duration for raw data is decided
 Recency i.e. latest refreshed data is used for real –time,
near real-time or present data analysis.
 Granularity of time or interval of data i.e. hourly, daily,
weekly observation needs to be identified for required
analysis.
 Choosing lowest level time granularity is advised as it Historical duration is
can be aggregated to higher time interval if required. likely 2011
Observation
E.g. Hourly data aggregated to daily to weekly etc…

www.infocepts.com 6
Data Quality

Business rule is, no invoice can have null


or empty customer address, but record
shows empty values in customer address
field in invoice data

www.infocepts.com 7
Data Preparation Process
Data Transformation
Identify transformation needs of
data viz. computation of derived
variable, aggregation, reduction
of statistically irrelevant variables

Variable Selection 04
Selecting attributes of data
that represent and impact
problem under analysis and
eliminating irrelevant 03
variables, Identifying
granularity
02
Finalize dataset
Create final dataset with all
transformed observations.
01 Extract statistical random
sample from it for analysis.

Data Cleansing
Check quality, and
clean the data of
quality issues

www.infocepts.com
Variable Selection
 Selecting relevant variables for analysis is most crucial for correct results
 Using business knowledge, problem understanding and discussion and verification from the business users play key role in selecting
relevant variables
 Eliminating irrelevant variables is equally important.

Enlist measured Enlist relevant


Identify
numerical categorical Create raw dataset
observations
variables variables
Based on domain knowledge Based on domain knowledge From raw data  Create intermediate raw
and help from experts and help from experts and  identify observations that data set from Identified
business users business users contain all categorical and observations
 Enlist categorical variables measurement numeric  Ensure all required actual
 Identify and enlist variable
required for analysis. and derived variables and
measured numerical measured numerical are
 Identify missing variables
variables required for  Identify subtype and
 Eliminate irrelevant
present in dataset
analysis. measurement Level of
categorical variable i.e. variables from observations  Ensure all descriptive
 Identify source systems for attributes/variables with
nominal or ordinal  Ensure granularity required
raw data. for analysis is achieved by their observed values are
 Identify need for derived included in each
 Identify subtype and final observations
variables by computing observation
measurement Level of available variables  Ensure timeliness of data is
numerical variable i.e. as required for analysis E.g.  Go to next step for Data
 Enlist required descriptive Daily Data of September quality check and data
continuous or discrete,
Interval or ratio attributes e.g. customer 2018 cleansing
name, etc…

www.infocepts.com 9
Illustration I Variable Selection Required variables
Problem Statement Identifying Observations 03
Raw Data
What is average order count per Sales Sales Ship Online
day in September 2018? Order Purchase Customer Person Method Order
Number Order Number ID ID ID TaxAmt Freight TotalDue Flag OrderDate ShipDate SubTotal
Numeric Measure 01 SO43663 PO18009186470 29565 276 5 40.2681 12.5838 472.3108 FALSE 9/1/2018 6/7/2011 419.4589
SO43665 PO16588191572 29580 283 5 1375.943 429.9821 16158.696 FALSE 9/2/2018 6/7/2011 14352.771
1. Numeric measure is count SO43668 PO14732180295 29614 282 5 3461.765 1081.802 40487.723 FALSE 9/3/2018 6/7/2011 35944.156
SO43660 PO18850127500 29672 279 5 124.2483 38.8276 1457.3288 FALSE 9/1/2018 6/7/2011 1294.2529
of orders in each day of SO43661 PO18473189620 29734 282 5 3153.77 985.553 36865.801 FALSE 9/1/2018 6/7/2011 32726.479
September 2018. SO43669 PO14123169936 29747 283 5 70.5175 22.0367 807.2585 FALSE 9/3/2018 6/7/2011 714.7043
SO43659 PO522145787 29825 279 5 1971.515 616.0984 23153.234 FALSE 9/1/2018 6/7/2011 20565.621
2. It is derived numeric SO43664 PO16617121983 29898 280 5 2344.992 732.81 27510.411 FALSE 9/2/2018 6/7/2011 24432.609
variable as count of orders SO43667 PO15428132599 29974 277 5 586.1203 183.1626 6876.3649 FALSE 9/2/2018 6/7/2011 6107.082
SO43662 PO18444174044 29994 282 5 2775.165 867.2389 32474.932 FALSE 9/1/2018 6/7/2011 28832.529
per day is not measured. SO43666 PO16008173883 30052 276 5 486.3747 151.9921 5694.8564 FALSE 9/2/2018 6/7/2011 5056.4896

Intermediate Raw Dataset 04


Categorical variable needed
Duration of required data is 02
September 2018.
The raw data available at lowest
time granularity is time of day.
Data needs to be aggregated to
day level
We need order number (to count
orders) and day(date) variable
data with numerical measure
Basic Data cleansing

1. Data Quality Assessment


 The “Intermediate raw dataset” is then subjected to Data Quality tests(some important tests are as below table.
 These tests can be performed visually or using Tools like SQL, Excel or other tools.

Characteristics Tests
Completeness  Check for blank values (not zero) & mark the observation for missing values of variables
 Check Categorical values for consistent spellings and associated numeric labels
Data Consistency  Check date variables for similar formats like either DDMMYYYY for all or MMDDYYY for all.
 Checks uniform Decimal places and rounding rules for all numerical columns
 Check for standard formats in variables like Zip code, Mobile no, Phone number which have
Data Format Compliance common standard format Zip code is alphanumeric 5 characters, Mobile no is 10 digit, phone
number is 7 or 8 Digit.
 Check if data is within standard defined ranges like email must have @ symbol, no numeric
Validity character in name, counts are not decimal, binary data cannot have more than two distinct
values, etc…

Business Rule Compliance  From business rules related to variables verify that data complies to it E.g. order date is not on
Sunday as it is holiday and no order is accepted on that day.
Duplicates  Data observations are not exact duplicates i.e. all values of variables in two observations are not
exactly same.

www.infocepts.com
Basic Data cleansing
2. Data Cleansing
 Clean the data for quality issues found in quality assessment. Below table provides some basic cleansing actions
 These tests can be performed visually or using Tools like SQL, Excel or other tools.

Dimension Cleansing actions


Completeness  While there are various advanced techniques to replace missing or empty values. It is best to
avoid, observations with missing values in the data set or sample, if possible.
 Put correct spellings for Categorical values to get consistent spellings and associated numeric
labels
Data Consistency  correct the formats to maintain similar formats like either DDMMYYYY for all or MMDDYYY for
all.
 Maintain uniform Decimal places and rounding rules for all numerical columns
 There are standard formats for global variables like Zip code, Phone numbers, country names,
Data Format Compliance state names and abbreviations, Pin Codes, etc… follow them for respective variables and make
necessary corrections
 Correct errors in data to make values within standard defined ranges like email must have @
Validity symbol, no numeric character in name, counts are not decimal, binary data cannot have more
than two distinct values, etc…

Business Rule Compliance  Eliminate observations not adhering to business rules unless corrections are provided and verified
by business users or domain experts.
Duplicates  Remove Duplicate observations

www.infocepts.com
Illustration II Data Cleansing
Problem Statement
Intermediate Raw Dataset DQ Assessment Data cleansing result
What is average order count per
day in September 2018?

Data Quality Assessment 01

All Data Quality checks for


completeness, consistency, Incomplete & Inconsistent
validity, adherence to business
rules and duplicates are
performed.

Data Cleansing 02

Incomplete observations and


duplicate observations removed
Inconsistent date format
corrected
Duplicate

The incomplete and


inconsistent data would have
produced incorrect order
Inconsistent Date Corrected format
count
Basic Data transformation
 Data transformation is the process of converting data from one format or structure into another format or structure
 Some common data transformation types are as below

Computed Variables Aggregation Classification Dummy Variables

Derived numerical Derived numerical Classification of Indicator variables or flags


measure using measurements using observations based on or dummy variables to
mathematical Aggregation. combination of category classify observations in two
Computations. variables and/or their classes for the purpose of
The Aggregation reduces filtering or analysis.
These computations are at values.
the granularity to lesser
observation levels and variables in dataset and E.g. Quarter 01 is E.g. Online Orders flag =
don’t cross the TRUE or FALSE (1 or 0)
computation is addition or classification of
observation boundary. E.g. classifies orders into orders
count type observations of an year
received from website and
Profit Margin is computed into 4 quarters based on retail store. So that separate
as Order value – their month and date. analysis can be performed
production cost, Margin on online and offline orders.
percentage is ratio of
profit to order value .
www.infocepts.com 14
Dataset Finalization

After transformation is complete the dataset finalization includes following steps after which dataset is ready for analysis

60% 75% 99%

Combine Data Quality Sampling


combining derived variables Checking consistency, If Dataset is too large
With data set to form final Completeness, correctness Create sample from dataset
observation in new dataset and other data quality using random sampling method
aspects of new dataset Else use dataset for analysis

www.infocepts.com 15
Illustration III Data Transformation and dataset finalization
Intermediate Raw Dataset Data Transformation & Dataset Finalization Final Dataset
Problem Statement
What is average order count per
day in September 2018?

Data Quality Assessment 01 Excel Functions


All Data Quality checks for  Observation level derived variable
completeness, consistency,
Day of Month = day(OrderDate)
validity, adherence to business
rules and duplicates are  Aggregated Derived numerical
performed. measurement

Data Cleansing 02 Order Count =


Count(SalesOrderNumber) for each “Order
Incomplete observations and Date”
duplicate observations removed
 Combine “Day of month” Categorical
Inconsistent date format variable and “Order Count” numerical
corrected measure using OrderDate raw variable
and create final dataset.
The incomplete and
inconsistent data would have
produced incorrect order
count
InfoCepts BI App
Our iPad Apps and Custom Visualizations
that deliver visually appealing graphical displays

Thank You www.infocepts.com

USA APAC EMEA Singapore


1750 Tysons Blvd, 10, Anson Road, Chemin Francois- unit 08-04, The
Suite 1500, #27-08, International Lehmann 34, Signature, Changi
McLean, VA 22102 Plaza, Grand Saconnex - 1218, Business Park
Ph: (703) 289-5117 Singapore - 079903 Geneva, Switzerland
Fx: (240) 235-4306 Ph: +65 3157 6443 Ph: +44 7746 730 151

Nagpur Chennai
11/1 I.T. Park, Parsodi, TIDEL Park Ltd. Pune Bengaluru
Nagpur - 440022 Module No-1207/12th Floor, Sky Vista, Ground Floor, 2nd Floor, Santosh Complex
Ph: +91 712 666 0100 "D" Block North Side 4, Next to Eminence IT Park, D.No.1/5, Armugam Circle
Ph: +1 301 769 6212 Rajiv Gandhi Salai, Taramani Airport Road, Viman Nagar Basavangudi,
Fx: +91 712 664 9845 Chennai – 600113 Pune – 411014 Bengaluru - 560004

You might also like