Chapter 3
Chapter 3
• soon after publishing a result, you might realize that the output is not
exactly correct and you need to apply additional transformations or
expose some additional profiling results.
Additional Aspects: Subsetting and
Sampling
• consider the case in which your dataset contains a heterogeneous set
of records, differing either in structure (e.g., some records contain
more or different fields from the rest) or in granularity (e.g., some
records correspond to customers, whereas others correspond to
accounts).
• In this case, the iterative process of transforming and profiling your data
is materially hampered by the time required to compute and execute
transformations.
• Chances are, you have some sense of what these ages should be: you
started your business 11 years ago, so no customer should show a
duration of more than 11 years.
• Structuring primarily involves moving record field values around, and in some
cases summarizing those values.
• In either event, the explicit goal when loading raw data is to perform
the minimal amount of transformations to the data to make it
available for metadata analysis and eventual refinement.
• The general objectives are “don’t lose any data” and “fixing quality
problems comes next.”
• Satisfying these objectives will require limited structuring
transformations and enough profiling to ensure that data was not lost
or corrupted in the ingestion process.
Describing Data
• Assessing the structure, granularity, accuracy, temporality, and scope
of your data is a profiling-heavy activity.
• As the new data is blended in with the old, set-based profiling will
provide the basic feedback on the quality of the blend