Week 07 & 08 & 09
Week 07 & 08 & 09
and Techniques
2
3
Extract, Transform, Load (ETL)
4
Transform
• This stage applies a series of rules to extract data from
source to derive the data for loading into the end target
• Selecting only certain columns to load.
• Translating coded values (e.g., 1 for male and 2 for
female)
• Encoding free-form values (e.g., mapping "Male" to "M")
• Deriving a new calculated value
• Sorting
• Joining data from multiple sources (e.g., lookup, merge)
and de-duplicating the data
• Aggregation (e.g summarizing multiple rows of data —
total sales for each store, and region, etc.)
5
Transform
Generating surrogate-key values
Transposing or pivoting
Splitting a column into multiple columns
Lookup and validate the relevant data from tables or
referential files for slowly changing dimensions
Applying any form of simple or complex data validation.
6
7
Motivation for Data Integration
• Add value to disparate data sources by data transformations
• Find single source of truth for decision making
• Populating and maintaining a warehouse is complex
• Overcome challenges
– Large volumes of data
– Widely varying formats & units of measure
– Different update frequencies
– Missing data
– Lack of common identifiers
• Critical success factor for data warehouse projects
– Initially populating a data warehouse &
– Periodically refreshing a warehouse as data sources change
• Significant investments in effort, hardware, and software 8
Data Sources
• Internal Data Sources
– procured and consolidated from different branches
within your organization
• purchase orders from the sales team, transactions from accounting,
pre orders from inventory management, leads from marketing,
Location Marketplace
Management Compensation
Employee
Turnover
Outcome Variable / 10
quantitative variable
Periodic Refresh Processing of DWH
• Valid Time lag (diff btw the occurrence of an event in the real world (valid time)
& the storage of the event in an operational database (transaction time)
• Load Time lag (diff btw transaction time
11
and the storage of the event in a data warehouse (load time)
Periodic Refresh Workflow of DWH
Notification to user groups and administrators
Propagating the integrated changed data to fact,
dimension tables, materialized views, stored data
cubes & to data marts
Recording results of the merging process, performs
completeness & reasonableness checks & handling
exceptions
Merge the separate cleaned sources into one source,
removing inconsistencies
Recording results of the cleaning process, performs
completeness & reasonableness checks & handling
exceptions
16
Data Quality (7 Sources of poor data quality)
• Entry Quality
– Did the information enter the system correctly at the origin?
– Incorrect phone number/email address
– Cost of entry problems depends on use
• If used for informational purposes then its cost is low
• If used for marketing & driving new sales then its cost is significant
• Process Quality
– Was the integrity of the information maintained during processing
through the system?
• May result from a system crash, lost file or any technical occurrence
– Source of the problem needs to be identified for ramification
• Identification Quality
– Are two similar objects identified correctly to be the same or different?
17
Data Quality (Sources of poor data quality)
• Integration Quality (Quality of completeness)
– Is all the known information about an object integrated to the point of
providing an accurate representation of the object?
– Example: It might be important for an auto claims adjuster to know that
a customer is also a high-value life insurance customer
– It creates a need to develop MDM (Master Data Management)
• Enables the process of identifying records from multiple systems that refer to the
same entity. Records will then be consolidated.
• Usage Quality
– Is the information used and interpreted correctly at the point of access?
– Occurs due to lack of access to legacy source documentation or
subject matter experts.
– Making the Data warehouse experts guess the meaning and use of
certain data elements
– Need to have thorough documentation, robust metadata and data 18
governance program
Data Quality (Sources of poor data quality)
• Aging Quality
– Has enough time passed that the validity of the information can no
longer be trusted?
• (1) Maintaining a former customer's address for more than five years is probably not
useful. If customers haven't been heard from in several years despite marketing
efforts, how can we be certain they still live at the same address?
• (2) Maintaining customer address information for a homeowner's insurance claim
may be necessary and even required by law.
– Decisions need to be made by the business owners
• Organizational Quality
– Can the same information be reconciled between two systems based
on the way the organization constructs and views the data?
– Less technical more organizational issue
• marketing tries to "tie" their calculations to finance, where the reporting systems of
both the departments are quite different
– biggest challenge to reconciliation is getting the various departments to
19
agree that their A equals the other's B equals the other's C plus D.
Data Quality (Record Linkage – Example)
20
Data Profiling
• Process of examining data available from an existing information source
and collecting statistics or informative summaries about that data
OR
• A process of developing information about data instead of information from
data.
– Utilizes statistical variables
– Metadata
• Clarifies
– Structure, content, relationships, derivation rules of the data
– Metadata about data – to discover illegal values, misspellings, missing values,
varying value representation, duplicates
• Performed at several times with varying intensity:
– (1) Soon after when the candidate source systems are identified
– (2) prior to dimensional modeling process
– (3) after the data has been loaded into staging area 21
22
MS SQL Server Data Profiling Tool
23
Statistical Analysis System (SAS)
24
MS SQL Server Data Profiling Tool
25
Data Profiling Tools
26
Initial Data Warehouse Load
Data
quality Discover Resolve
problems
28
Refresh Processing Decision Making
• Data timeliness depends on the
sensitivity of decision making to the Refresh
currency of the data costs
• Completeness constraints
involves inclusion of changed
data from each data source
Completeness Consistency
• Availability constraints
involves conflicts between
30
online availability and
warehouse loading
Data Integration Concepts, Processes,
and Techniques
32
Basics of Change Data
• Derived from internal and external data sources
• Used to populate and refresh a data warehouse
– Insert rows in fact and dimension tables (common)
– Update rows in dimension tables (less common)
• Challenges
– Difficult to change to source systems especially
external systems
– Lack of SQL access and descriptive (meta) data
especially for legacy data
33
Cooperative Change Data
34
Logged Change Data
35
Queryable Change Data
• Queryable Change Data: comes directly from a data source via a query.
• Requires timestamping in the data source.
• Since few data sources contain timestamps for all data, queryable change
data usually is augmented with other kinds of change data.
• Queryable change data is most applicable for fact tables using fields such as
order date, shipment date, and hire date that are stored in operational data36
sources.
Snapshot Change Data
• involves periodic dumps of
source data.
• To derive change data, a
difference operation uses the
two most recent snapshots.
• The result of a difference
operation is called a delta.
• Snapshots are the only form of
change data without
requirements on a source
system.
37
Change Data Classification
38
Data Quality Problems
• Multiple identifiers
• Different units
• Missing values
• Text data with different components and formats
• Conflicting data
• Different update times
39
Data Integration Concepts, Processes,
and Techniques
41
Parsing
• Locates and separates individual data elements
in text
• Studied in computer science for decades
• Regular expressions for pattern specification
• Natural Language Processing (NLP)
42
Parsing Example
43
Correcting Values
Missing values
- Default value for inapplicable values
• For example, missing values for an order without an
employee can be replaced with a default value indicating a
web order.
- Typical value: for numeric : average, median, for non-
numeric : mode
- Complex processing for predicting values using
relationships to other fields : using data mining algos
Conflicting values
- More recent value
- More credible source : via domain experts
44
Correction Example
49
Regular Expressions (regex)
Search Expression
Escape
Literal Meta character
sequence
{n}, [ ],
? * + {n,m} . ^ $ \ |
[^]
Search expression
^[a-z]+\.com$
52
Meta Character Summary
Metacharacter Type Meaning
? Iteration Matches preceding character 0 or 1 time
* Iteration Matches preceding character 0 or more times
+ Iteration Matches preceding character 1 or more times
{n} Iteration Matches preceding character exactly n times
{n,m} Iteration Matches preceding character at least n times and at
most m times
[] Range Matches one of enclosed characters one time
^ Position Matches at the beginning of the target string; only has
meaning as the first character in a regular expression
^ Range Negation of search pattern if ^ is inside []. Hyphen
inside square brackets defines a range of characters.
$ Position Matches at the end of the target string; only has
meaning as the last character in a regular expression.
. Position Matches any character except a newline character at
the specified position only
| Alteration Matches either pattern to the left or right of the |
character.
() Group Groups for matching parts of target strings 53
Meta Character Examples I
• This table shows six examples with multiple target strings per example.
Search Expression Target Strings Evaluation details
“colou?r” “color”, “colour” Matches both target
strings
“tre*” “tree”, “tread”, “trough” Matches all three target
strings; Matches
preceding character 0
times in third target string
“tre+” “tree”, “tread”, “trough” Does not match the third does not match the third
target string target string because the
third character is o
“[abcd]” “dog”, “fond” , “pen” Matches first two strings Does not match the third
but not the third string target string because it
does not contain one of
the letters inside the
square brackets.
“[0-9]{3}-[0-9]{4}” “123-4567”, “1234-567” Matches first string but not the first range must be
the second string matched three times,
and the second range,
four times. 55
“ba{2,3}b” “baab”, “baaab”, “bab”, Matches first two strings the proceeding character
“baaaab” but not the last two strings a, must be matched
between two and three
times.
Meta Characters II
• This table shows search expressions using position, iteration and alteration meta characters.
Marketing Law
combine
customers from enforcement
different link crimes to
companies after individuals
merger
Actual
Predicted
Match Non Match
Match True match False match
Possible Match Investigation Investigation
Non Match False non match True non match
• The rows represent predictions and the columns represent actual results of
matching two records for duplication.
• A true match involves a predicted match and an actual match allowing
the two records to be combined correctly.
• The possible non match situations involve predictions without enough 65
66
Household Consolidation
Household consolidation involves linking records from individuals living in the same 67
household.
Transaction Linking
Account No.
83451234 Policy No.
ME309451-2
Transaction
B498/97
In transaction linking, all accounts and transactions are associated to the same
68
person.
Data Integration Concepts, Processes,
and Techniques
correction.
Edit Distance
• Used for comparing relatively short text
values occurring in entity matching applications.
• Very common distance function for text
• Operations to transform two text values
– Delete a character
– Insert a character
– Substitute one character for another
• Edit distance is defined as the minimal number
of operations to transform a source text value
into a target text value.
72
Edit Distance Example
Saturday Sunday
73
Quiz
• What is the edit distance between “Break”
and “Trick”
• 5
• 2
• 3
• 4
74
Phonetic Distance Functions
• Many applications in law enforcement to account
for different name spellings, but similar
pronunciations.
• Words of the same pronunciation, should have
the same phonetic value.
• Phonetic distance mainly codes words into
standard consonant sounds.
• Two phonetic distances functions are widely
available in DBMSs and data integration tools
– Soundex: 6 consonant sounds
75
– Metaphone: 16 consonant sounds
– Metaphone, with more consonant sounds, was
developed as in improvement to Soundex.
Phonetic Matching Examples
• Soundex
– Soundex(Assistance) = A223
– Soundex(Assistants) = A223
• Metaphone
– Metaphone(Assistance) = ASSTNS
– Metaphone(Assistants) = ASSTNTS