0% found this document useful (0 votes)
74 views74 pages

Week 07 & 08 & 09

The document discusses concepts and processes related to data integration. It explains the extract, transform, load (ETL) process used to integrate data from multiple sources into a data warehouse. This involves extracting data from sources, transforming it (e.g., cleaning, aggregating), and loading it into the target data warehouse. Periodically refreshing the data warehouse is a complex process that involves overcoming challenges like data quality issues, different data formats and frequencies. Tools like data profiling and matching are used to ensure high quality data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views74 pages

Week 07 & 08 & 09

The document discusses concepts and processes related to data integration. It explains the extract, transform, load (ETL) process used to integrate data from multiple sources into a data warehouse. This involves extracting data from sources, transforming it (e.g., cleaning, aggregating), and loading it into the target data warehouse. Periodically refreshing the data warehouse is a complex process that involves overcoming challenges like data quality issues, different data formats and frequencies. Tools like data profiling and matching are used to ensure high quality data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Data Integration Concepts, Processes,

and Techniques

Concepts of Data Integration Processes


Lesson Objectives
• Explain diagrams for refresh processing and
typical tasks
• Discuss difficulties of initial population of a data
warehouse
• Understand tradeoffs and constraints in
managing refresh processing

2
3
Extract, Transform, Load (ETL)

4
Transform
• This stage applies a series of rules to extract data from
source to derive the data for loading into the end target
• Selecting only certain columns to load.
• Translating coded values (e.g., 1 for male and 2 for
female)
• Encoding free-form values (e.g., mapping "Male" to "M")
• Deriving a new calculated value
• Sorting
• Joining data from multiple sources (e.g., lookup, merge)
and de-duplicating the data
• Aggregation (e.g summarizing multiple rows of data —
total sales for each store, and region, etc.)
5
Transform
 Generating surrogate-key values
 Transposing or pivoting
 Splitting a column into multiple columns
 Lookup and validate the relevant data from tables or
referential files for slowly changing dimensions
 Applying any form of simple or complex data validation.

6
7
Motivation for Data Integration
• Add value to disparate data sources by data transformations
• Find single source of truth for decision making
• Populating and maintaining a warehouse is complex
• Overcome challenges
– Large volumes of data
– Widely varying formats & units of measure
– Different update frequencies
– Missing data
– Lack of common identifiers
• Critical success factor for data warehouse projects
– Initially populating a data warehouse &
– Periodically refreshing a warehouse as data sources change
• Significant investments in effort, hardware, and software 8
Data Sources
• Internal Data Sources
– procured and consolidated from different branches
within your organization
• purchase orders from the sales team, transactions from accounting,
pre orders from inventory management, leads from marketing,

• External Data Sources


– Not collected by your organization.
– Obtained from a source outside of your
organization.
• Examples would be, purchasing a list from a list broker or gaining
access to a proprietary database
9
Business Analyst Perspective

Location Marketplace

Management Compensation
Employee
Turnover

Factor / qualitative variable

Outcome Variable / 10
quantitative variable
Periodic Refresh Processing of DWH

• Valid Time lag (diff btw the occurrence of an event in the real world (valid time)
& the storage of the event in an operational database (transaction time)
• Load Time lag (diff btw transaction time
11
and the storage of the event in a data warehouse (load time)
Periodic Refresh Workflow of DWH
Notification to user groups and administrators
Propagating the integrated changed data to fact,
dimension tables, materialized views, stored data
cubes & to data marts
Recording results of the merging process, performs
completeness & reasonableness checks & handling
exceptions
Merge the separate cleaned sources into one source,
removing inconsistencies
Recording results of the cleaning process, performs
completeness & reasonableness checks & handling
exceptions

Standardize & improves the quality of extracted data

Movement of extracted data to Staging Area

Retrieves data from individual source system 12


Data Quality
• An essential characteristic that determines the reliability of
data for making decisions
• “High Quality” means, if it is “fit” for its intended uses in
operations, decision making, and planning
• Tools to ensure Data Quality
– Data Profiling - initially assessing the data to understand its quality challenges
– Data Standardization - a business rules engine that ensures that data conforms
to quality rules
– Geocoding - for name and address data. Corrects data to Worldwide postal
standards
– Matching or Linking - similar, but slightly different records can be aligned
– Monitoring - keeping track of data quality over time and reporting variations
– Batch & Real time
13
14
15
Example cont..

16
Data Quality (7 Sources of poor data quality)
• Entry Quality
– Did the information enter the system correctly at the origin?
– Incorrect phone number/email address
– Cost of entry problems depends on use
• If used for informational purposes then its cost is low
• If used for marketing & driving new sales then its cost is significant
• Process Quality
– Was the integrity of the information maintained during processing
through the system?
• May result from a system crash, lost file or any technical occurrence
– Source of the problem needs to be identified for ramification
• Identification Quality
– Are two similar objects identified correctly to be the same or different?
17
Data Quality (Sources of poor data quality)
• Integration Quality (Quality of completeness)
– Is all the known information about an object integrated to the point of
providing an accurate representation of the object?
– Example: It might be important for an auto claims adjuster to know that
a customer is also a high-value life insurance customer
– It creates a need to develop MDM (Master Data Management)
• Enables the process of identifying records from multiple systems that refer to the
same entity. Records will then be consolidated.

• Usage Quality
– Is the information used and interpreted correctly at the point of access?
– Occurs due to lack of access to legacy source documentation or
subject matter experts.
– Making the Data warehouse experts guess the meaning and use of
certain data elements
– Need to have thorough documentation, robust metadata and data 18

governance program
Data Quality (Sources of poor data quality)
• Aging Quality
– Has enough time passed that the validity of the information can no
longer be trusted?
• (1) Maintaining a former customer's address for more than five years is probably not
useful. If customers haven't been heard from in several years despite marketing
efforts, how can we be certain they still live at the same address?
• (2) Maintaining customer address information for a homeowner's insurance claim
may be necessary and even required by law.
– Decisions need to be made by the business owners
• Organizational Quality
– Can the same information be reconciled between two systems based
on the way the organization constructs and views the data?
– Less technical more organizational issue
• marketing tries to "tie" their calculations to finance, where the reporting systems of
both the departments are quite different
– biggest challenge to reconciliation is getting the various departments to
19
agree that their A equals the other's B equals the other's C plus D.
Data Quality (Record Linkage – Example)

20
Data Profiling
• Process of examining data available from an existing information source
and collecting statistics or informative summaries about that data
OR
• A process of developing information about data instead of information from
data.
– Utilizes statistical variables
– Metadata

• Clarifies
– Structure, content, relationships, derivation rules of the data
– Metadata about data – to discover illegal values, misspellings, missing values,
varying value representation, duplicates
• Performed at several times with varying intensity:
– (1) Soon after when the candidate source systems are identified
– (2) prior to dimensional modeling process
– (3) after the data has been loaded into staging area 21
22
MS SQL Server Data Profiling Tool

23
Statistical Analysis System (SAS)

24
MS SQL Server Data Profiling Tool

25
Data Profiling Tools

26
Initial Data Warehouse Load

Data
quality Discover Resolve
problems

• Major development activity


• More open ended than refresh with difficult to estimate
time requirements
• Use profiling tools to discover data quality problems
• Initial population process should be performed for each
27
major extensions of data warehouse.
Primary Objective of managing the
refresh process
• The primary objective in managing the refresh
process is to determine the refresh frequency for
each data source and set detailed refresh
schedules.

28
Refresh Processing Decision Making
• Data timeliness depends on the
sensitivity of decision making to the Refresh
currency of the data costs

• Some decisions are very time Timeliness


Constraints
sensitive such as inventory importance
decisions - minimize inventory
carrying costs by stocking goods as Manage
refresh
close as possible to the time frequency
needed. and
schedules

• Other decisions are not so time


sensitive. For example, the decision Net Refresh benefit defined
to close a poor performing store as the value of data timeliness
would typically be done using data
over a long period of time.
minus the cost of refresh. 29
Refresh Constraints
• Source access constraints
can be due to legacy Source
technology with restricted Access
scalability

• Integration constraints often


involve identification of
Availability Integration
common entities

• Consistency constraints Satisfy


involves usage of the same Constraints
time period in change data

• Completeness constraints
involves inclusion of changed
data from each data source
Completeness Consistency
• Availability constraints
involves conflicts between
30
online availability and
warehouse loading
Data Integration Concepts, Processes,
and Techniques

Change Data Concepts


Lesson Objectives
• Explain the types of data sources involved in data
integration
• Provide examples of typical data quality problems
encountered during data integration
• Reflect on the relationship between type of
change data and data quality

32
Basics of Change Data
• Derived from internal and external data sources
• Used to populate and refresh a data warehouse
– Insert rows in fact and dimension tables (common)
– Update rows in dimension tables (less common)
• Challenges
– Difficult to change to source systems especially
external systems
– Lack of SQL access and descriptive (meta) data
especially for legacy data

33
Cooperative Change Data

34
Logged Change Data

35
Queryable Change Data

• Queryable Change Data: comes directly from a data source via a query.
• Requires timestamping in the data source.
• Since few data sources contain timestamps for all data, queryable change
data usually is augmented with other kinds of change data.
• Queryable change data is most applicable for fact tables using fields such as
order date, shipment date, and hire date that are stored in operational data36
sources.
Snapshot Change Data
• involves periodic dumps of
source data.
• To derive change data, a
difference operation uses the
two most recent snapshots.
• The result of a difference
operation is called a delta.
• Snapshots are the only form of
change data without
requirements on a source
system.

37
Change Data Classification

38
Data Quality Problems

• Multiple identifiers
• Different units
• Missing values
• Text data with different components and formats
• Conflicting data
• Different update times

39
Data Integration Concepts, Processes,
and Techniques

Data Cleaning Tasks


Lesson Objectives
• Explain the three types of data cleaning tasks
• Provide examples depicting data cleaning tasks
• Reflect on the tedious nature of data cleaning

41
Parsing
• Locates and separates individual data elements
in text
• Studied in computer science for decades
• Regular expressions for pattern specification
• Natural Language Processing (NLP)

42
Parsing Example

43
Correcting Values
 Missing values
- Default value for inapplicable values
• For example, missing values for an order without an
employee can be replaced with a default value indicating a
web order.
- Typical value: for numeric : average, median, for non-
numeric : mode
- Complex processing for predicting values using
relationships to other fields : using data mining algos
 Conflicting values
- More recent value
- More credible source : via domain experts
44
Correction Example

Detailed investigations, possibly conducted using search services,


can resolve some cases of unknown values and conflicting values.
45
Standardization
 Applies conversion routines to transform data
into preferred formats
 Uses both standard and custom business rules
can be developed.
 Common standardizations:
 Unit of measure transformations
 Standard abbreviations (state names, titles, street
types)
 In addition, data standardization services can be 46
purchased for names, addresses, and product
details, although, customization may be
necessary.
Standardization Example

This example extends the previous corrected example with


standardization. 47
Data Integration Concepts, Processes,
and Techniques

Pattern Matching with Regular


Expressions
Lesson Objectives
• Explain the three major elements of regular
expressions
• Practice with regular expressions
• Reflect on the complexity and limitations of
regular expressions

49
Regular Expressions (regex)
Search Expression

Escape
Literal Meta character
sequence

• A literal is any character used in a search expression or target string.


• A metacharacter is one or more special characters that have a unique meaning and
are NOT used as literals in a search expression, for example, the character ^
(circumflex or caret) is a metacharacter.
• An escape sequence turns off the special meaning of a metacharacter so that it is
matched as a literal. In a regular expression an escape sequence involves placing the
metacharacter \ (backslash) in front of the metacharacter that we want to use as a
literal. 50
Pattern Matching

Search expression Target string Match result


^[a-z]+\.com$ abc.com abc.com

Meta characters Literals Escape sequence


• ^ (caret or • c • \.
circumflex) • o
• [ • m
• ] • a
• + • z
• - • .
• \
• $
51
Common meta characters

Iteration or quantifier Position Other

{n}, [ ],
? * + {n,m} . ^ $ \ |
[^]

Search expression
^[a-z]+\.com$
52
Meta Character Summary
Metacharacter Type Meaning
? Iteration Matches preceding character 0 or 1 time
* Iteration Matches preceding character 0 or more times
+ Iteration Matches preceding character 1 or more times
{n} Iteration Matches preceding character exactly n times
{n,m} Iteration Matches preceding character at least n times and at
most m times
[] Range Matches one of enclosed characters one time
^ Position Matches at the beginning of the target string; only has
meaning as the first character in a regular expression
^ Range Negation of search pattern if ^ is inside []. Hyphen
inside square brackets defines a range of characters.
$ Position Matches at the end of the target string; only has
meaning as the last character in a regular expression.
. Position Matches any character except a newline character at
the specified position only
| Alteration Matches either pattern to the left or right of the |
character.
() Group Groups for matching parts of target strings 53
Meta Character Examples I
• This table shows six examples with multiple target strings per example.
Search Expression Target Strings Evaluation details
“colou?r” “color”, “colour” Matches both target
strings
“tre*” “tree”, “tread”, “trough” Matches all three target
strings; Matches
preceding character 0
times in third target string
“tre+” “tree”, “tread”, “trough” Does not match the third does not match the third
target string target string because the
third character is o
“[abcd]” “dog”, “fond” , “pen” Matches first two strings Does not match the third
but not the third string target string because it
does not contain one of
the letters inside the
square brackets.
“[0-9]{3}-[0-9]{4}” “123-4567”, “1234-567” Matches first string but not the first range must be
the second string matched three times,
and the second range,
four times. 55
“ba{2,3}b” “baab”, “baaab”, “bab”, Matches first two strings the proceeding character
“baaaab” but not the last two strings a, must be matched
between two and three
times.
Meta Characters II
• This table shows search expressions using position, iteration and alteration meta characters.

Search Target Strings Evaluation details


Expression
“^win” “erwin”, “window” Second string but not does not match the first target string
first string because win does not appear in the
beginning of the target string.
“win$” “erwin”, “window” First string but not does not match the second target string
second string because win does not appear at the end of
the target string.
“[^0-9]+” “123”, “abc”, Matches the second the caret inside the square brackets
“a456” and third target strings negates the enclosed character range
matching any non-digit.
“abc.e*” “fabc”, “fabcd”, Matches the second the period, a positional meta character in
“fabcee” and third target strings the search expression requires a character
following abc, so the search expression
does not match the first target string.
“dog|cat|frog” “a dog”, “cat Matches all three meta characters, that is vertical bars,
friend”, “frogman” target strings match all three target strings, as each one
contains one of the choices dog, cat or 57
frog.
More Complex Examples

Field Search Expression


User name ^[a-z0-9_-]{3,16}$
Hex value ^#?([a-f0-9]{6}|[a-f0-9]{3})$
Email address ^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$
Web address ^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$

Regular expression testing sites


- http://regex101.com/
- http://www.regexr.com/
- http://regexpal.com/
- http://rubular.com/
58
Data Integration Concepts, Processes,
and Techniques

Matching and Consolidation


60
Entity Matching
 Identifyies common entities from separate data
sources when no reliable common identifier
exists.
 Difficult matching process: no common identifier
 Data mining problem
 Also known as the record linkage, entity identification,
and entity resolution
 Many approaches
61

 Improve data quality for better matching results


Matching Example
Source 1 Source 2
First name Aimee First name Aimee
Middle name Christina Middle name C.
Last name Parker Last name Parker-Lewis
Job title Product Manager Job title Prod. Mgr.
Firm Microsoft Corporation Firm Microsoft
Street 15580 NE 31st Street Street 16517 78th Place NE
City Redmond City Bothell
State WA State WA
Postal Code 98052 Postal Code 98020
Country USA Country USA

pre marriage name and marital name and home 62

work address address


Merging Example
This example shows a possible result of merging records from the previous matching
example.
Target
First name Aimee
Middle name Christina
Last name Parker-Lewis
Job title Product Manager
Firm Microsoft Corporation
Street 16517 78th Place NE
City Bothell Use latest name (married)
State WA
Postal Code 98020
Country USA
63
Entity Matching Applications

Marketing Law
combine
customers from enforcement
different link crimes to
companies after individuals
merger

Fraud Health care


detection combining
filing health records
fraudulent tax from same
returns with individuals 64
treated at
different SSN
different clinics
Entity Matching Outcomes

Actual
Predicted
Match Non Match
Match True match False match
Possible Match Investigation Investigation
Non Match False non match True non match

• The rows represent predictions and the columns represent actual results of
matching two records for duplication.
• A true match involves a predicted match and an actual match allowing
the two records to be combined correctly.
• The possible non match situations involve predictions without enough 65

certainty to indicate a match or non match.


Consolidation
 Matched entities can be merged or linked.
 Merging matched records
 Linking matched records
 Households : For households, linking combines
individuals with family and other social relationships.
 Transactions : In transaction linking, all accounts
and transactions are associated to the same person.

66
Household Consolidation

George Janet Karen Thomas


Smith Smith Smith Smith

Household consolidation involves linking records from individuals living in the same 67
household.
Transaction Linking

Account No.
83451234 Policy No.
ME309451-2

Transaction
B498/97
In transaction linking, all accounts and transactions are associated to the same
68

person.
Data Integration Concepts, Processes,
and Techniques

Quasi Identifiers & Distance Functions


for Entity Matching
Quasi Identifiers
• Used in entity matching : entity matching algorithms use
quasi identifiers to compensate for missing common
identifiers.
• Almost unique in combination
• In a study published in 2000, Sweeney demonstrated that
87% of the US population can be identified by a
combination of gender, birth date, and postal code.
• Examples
– Name components
– Location components
– Profession
– Birthdate
70
– Race
Distance Functions
• Poor data quality such as missing values and unknown update
times complicate choices for quasi identifiers.
• Entity matching approaches use distance functions to determine if
quasi identifiers in two entities indicate the same entity.
• Nurmeri-quasi identifiers : Determine amount of space between
records or values
– Determine distance between combination of quasi identifier values
– Determine distance between two quasi identifier values
• Text quasi identifiers : Text distance
– Important for quasi identifiers containing text
– Examples: name and location components. They differ in spelling, length,
and context.
– Distance function for text are used to compare quasi identifiers with these
differences.
– Have many applications outside of entity matching : such as spelling 71

correction.
Edit Distance
• Used for comparing relatively short text
values occurring in entity matching applications.
• Very common distance function for text
• Operations to transform two text values
– Delete a character
– Insert a character
– Substitute one character for another
• Edit distance is defined as the minimal number
of operations to transform a source text value
into a target text value.
72
Edit Distance Example

Saturday Sunday

1. Sturday (delete “a”) The first sequence is preferred


2. Surday (delete “t”) because it contains fewer
3. Sunday (substitute “n” for “r”)
operations.
1. Suturday (substitute “u” for “a”)
2. Sunurday (substitute “n” for “t”)
3. Sunrday (delete “u”)
4. Sunday (delete “r”)

73
Quiz
• What is the edit distance between “Break”
and “Trick”
• 5
• 2
• 3
• 4

74
Phonetic Distance Functions
• Many applications in law enforcement to account
for different name spellings, but similar
pronunciations.
• Words of the same pronunciation, should have
the same phonetic value.
• Phonetic distance mainly codes words into
standard consonant sounds.
• Two phonetic distances functions are widely
available in DBMSs and data integration tools
– Soundex: 6 consonant sounds
75
– Metaphone: 16 consonant sounds
– Metaphone, with more consonant sounds, was
developed as in improvement to Soundex.
Phonetic Matching Examples
• Soundex
– Soundex(Assistance) = A223
– Soundex(Assistants) = A223
• Metaphone
– Metaphone(Assistance) = ASSTNS
– Metaphone(Assistants) = ASSTNTS

Examples from W3C schools


- Soundex examples from
http://www.w3schools.com/php/func_string_soundex.asp
- Metaphone examples from
http://www.w3schools.com/php/func_string_metaphone.asp 76

You might also like