Data Profiling PPT - How To
Data Profiling PPT - How To
IDQ CONFERENCE
N OV E M B E R 4 , 20 13
LITTLE ROCK, AR
Class Overview
• Meet your instructor, class demographics
• Data profiling: an overview
• Tools
• Where does data profiling fit?
• Modern data profiling techniques
• Basic Data Profiling
• Advanced Data Profiling Techniques
• Subjects and Subject Level Data Profiling
• Reporting
• Tools
• Discussion, Questions, Wrap-Up
• Mathematical
• Geometrical (your instructor has been described as having a ‘Roman Profile’)
• Racial
• Ignore other factors when deciding whom to stop, frisk, scan, etc..
• Data
• Simplest example: study one data element regardless of any other attributes
• We will not dwell on the tools, just give you a quick feel for how these
techniques have been implemented by the software development
community (in the cases where they have)
• We will demo by example using Talend since it’s free and you guys
can go ahead and download it yourselves if you like
• Key Constraints
• Frequency Distributions
• Outlier Study
• Frequent Values, Infrequent Values
• I have played around with three, but I can’t make any claims regarding
which is better than the other, I will talk about the three and compare
them, look at plusses and minuses.
• Tools can operate as desktop versions and/or client server installs which
can facilitate collaboration.
• I will demonstrate Talend, and give you some hand outs I have prepared
which should give you a flavor of Trillium and DataFlux.
• Bottom line is the tool will provide you a great deal of metadata, the art
here is how you arrange and disseminate that metadata.
Reporting
• The situation in the ‘real world’ is no different. If you cannot take the output of
your data profile and create some simple and easy to swallow summaries, your
project sponsors will feel lost. This will lead to bad things.
• The best reporting tools start at a very high level and allow drill down so interested
parties can dig and see the detail, but the details are not provided until asked for.
• A lot of good work as been done with regard to this sort of drill down reporting, the
entire field of BI is essentially (as I understand it) designed around the careful
extraction and dissemination of information.
• Anybody can press a button and create a bunch of meta-data, the art of this
business is preparing useful, usable, and actionable reports.
Reporting
• Anybody can press a button and create a bunch of meta-data, the art of this
business is preparing useful, usable, and actionable reports… how to do this?
• I find the simple approach best. Create a single table (or as few as possible) to hold
all your results, this is especially easy at the subject level.
• This table itself can then be profiled to provide summaries and overviews, but since
it contains all the metadata can allow drill down to the meta-data and if you’re
careful, the data itself.
Demonstration using Talend
At this point we will fire up Talend and run through some examples of the
data profiling techniques we discussed.
• Questions/feedback?
Dataflux, Talend, Trillium
We’ve attached some screen shots and notes for you to read later at the bar
or on the plane back home
Desktop looks
like this, this
product here is
called the Data
Management
Studio.
Let’s use the PS_PERSONAL_DATA table since it has a lot of recognizable data fields.
From the main screen, select new-> profile
We can also profile the relationships between two or more fields in a table.
Start by selecting a field name from the list under Standard Metrics (we’re checking EMPLID vs EMPLID_INT here)
• Finally, we can save each and all of these reports to Excel for easy distribution.
• In fact, you can schedule a job to run this profile on a schedule, create the excel report
and email it around.
• Start by selecting Export… from the file menu.
• Configure the next menu like this. Careful, if you select all tables and fields you’ll get
1500 excel reports.
• It takes a few seconds, but you’ll get a nice excel report with some
useful stats.
• I find this report useful for a file delivery, it provides a good
overview of the structure of the data, max’s and min’s number of
unique and null values.
The freeware product we will demo today is called Talend Open Studio for Data
Quality (TOS-DQ)
Oracle Tables:
PS_JOB - TABLE1214
PS_EMPLOYMENT – TABLE1216
PS_EARNINGS_BAL –
TABLE1218
PS_PAY_CHECK – TABLE1220
PS_PERS_NID – TABLE1222
PS_PERSONAL_DATA –
TABLE1224
Select TABLE1224
This table has very many
columns; select 15 or 20 or
so.
Now we need to tell Talend what to analyze, by default it will do nothing and complain that you
didn’t set the ‘indicators’.
Select the hyperlink (click on) ‘Select indicators for each column’. This will allow you to analyze
specific things for each column you selected in the prior step.
results
Run Job
• Just for fun, let’s profile another table that we derived from the personal data table.
• We created a table that has two columns, one for SSN and one for FIRST_NAME | LAST_NAME | BIRTHDATE
• We can use Talend to look for duplicate entries in this table
Trillium operates as a client/server, but for today’s exercise, we have the source databases,
the server and the client running on the same box
• Next, we will have to define the database connections we will use, you can also define
flat file connections here.
• Right click on Loader Connections and select Add Loader Connection…
• Now we will exit out of the Repository Manager and start the TSS-13 Control Center.
• We will now load some data. Trillium performs its analysis on the data as it is loaded.
The parameters of this analysis are set to default values that can be adjusted based on
your particular situation
• Select the ‘Entities’ tab, right click and select ‘Create Entity’
• IMPORTANT! To save time and space, we selected a 10% sample of the records from
the source table.
• This will take few minutes to load and analyze the tables, we’ll jump to another
repository already created and with its data analyzed.
• You can track the status if you select Analysis -> Background Tasks.
• When the jobs are done, there is a ton of metadata collected, getting through it can
seem daunting at first.
• Start by getting back to the main screen and selecting ‘Entities’, pick a table (for
example Tdwi Owner Table1216 and click on it.
• Select ‘Relationships’ and you’ll find the results of Trillium key analysis
• Here, the software has correctly identified Emplid Int and Emplid as table keys.
• One of the more interesting things Trillium uncovers is the relationships between
different pairs of data elements, it picks up correlations.
• Like most of its results, Trillium over-shoots and some of what it picks up can be
thrown away, but in my example it did uncover some non-obvious relationships
between data elements.
• Go to Relationship Summary and select Discovered under Dependencies.
• Go ahead and click around and see what else you can find, you can do no damage
here.
• Click a field name on the left and wait a few seconds and you’ll get some interesting
breakdowns.
• Let’s say you want to know what one of these breakdowns is tell you. Click on the
diagram
• Now, if you right click on the row listed, you can drill down to the data.
• The Soundex* analysis is interesting also. If you find a column with, say, a city name,
like in our Table1224, you can see how the algorithm is used during the match address
analysis.
• The Min is $0.01, which even during our current economic situation, seems low for an
annual rate. I assume this is ‘rate’ not actual compensation received.
• So I’d like to create a business rule to see how many tiny annual rates we have.
• Configure thusly:
• Now the cool thing about Trillium is that you can quickly drill down and see the records
that passed or failed this business rule. This allows you to research the ‘bad’ data,
tweak your rule and so on.
• Notice anything?
• As an example, we have a file of data containing electric usage data. The data contains a meter ID and a number of readings for
each meter.
• 1st we profile this file and record the profiles of each data attribute (AKA column, field).
• In doing so we see that the data spans 6 weeks or so
• Now we used a different product to do the subject level profile and determine the breakdown of number of
reads per meter (here meter is the subject)
Looking now to the frequency distribution of the voltages, we can see these
are all very low (with in the bottom 0.1%), so we discovered why these meters
aren’t giving out reads as often as they should. They have dead batteries.
• In order to study the time component we are required to look at subsequent records in the data ordered by
time, for a specific subject. (subsequent records that cross into a different subject are meaningless)
• I do not know how to do this with any of the tools we have here, but I suspect it is possible, I am investigating.
Meanwhile, I have a created a metadata table with a simple script and then profiled that to understand the
state transition behavior of the JOB table.
• Profiling this table of metadata yields the time ordered pairs present in the data. Often these pairs are different
than the ‘allowed’ values given to the DQ analyst by the IT guys, working through any discrepancies between
the actual vs. the expected values is a useful exercise that can yield a few business rules.
• Again, we look for very frequent and very infrequent values, here we have no standouts.