0% found this document useful (0 votes)
34 views32 pages

Chapter 3

The document discusses the core tasks involved in data wrangling, which are accessing data, transforming data, profiling data, and publishing data. It describes different types of data transformations including structuring (e.g. aggregations, pivots, joins, unions), enriching (e.g. metadata insertion, new field calculations), and cleansing data. It also discusses profiling data at the individual value level and set level to check transformations. Sampling techniques are important for handling large datasets.

Uploaded by

Bala Murali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views32 pages

Chapter 3

The document discusses the core tasks involved in data wrangling, which are accessing data, transforming data, profiling data, and publishing data. It describes different types of data transformations including structuring (e.g. aggregations, pivots, joins, unions), enriching (e.g. metadata insertion, new field calculations), and cleansing data. It also discusses profiling data at the individual value level and set level to check transformations. Sampling techniques are important for handling large datasets.

Uploaded by

Bala Murali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Data Wrangling Dynamics

Data Wrangling Dynamics


• Data wrangling is a generic phrase capturing the range of tasks
involved in preparing your data for analysis.

• Data wrangling begins with accessing your data.

• Sometimes, access is gated on getting appropriate permission and


making the corresponding changes in your data infrastructure.
Data Wrangling Dynamics

• Access also involves manipulating the locations and relationships


between datasets.

• This kind of data wrangling involves everything from moving datasets


around a folder hierarchy, to replicating datasets across warehouses
for easier access, to analyzing differences between similar datasets
and assessing overlaps and conflicts.
• After you have successfully accessed your data, the bulk of your data
wrangling work will involve transforming the data itself—
manipulating the structure, granularity, accuracy, temporality, and
scope of your data to better align with your analysis goals.

• All of these transformations are best performed with tools that


provide meaningful feedback (so that the manipulator is assured that
the manipulations were successful).
• We refer to this feedback as profiling. In many cases, a predefined
(and, hence, somewhat generic) set of profiling feedback is sufficient
to determine whether an applied transformation was successful.

• In other cases, customized profiling is required to make this


determination. In either event, the bulk of data wrangling involves
frequent iterations between transforming and profiling your data.
• A final set of data wrangling tasks can be understood as publishing.

• Publishing is best understood from the perspective of what is


published.

• In some cases, what is published is a transformed version of the input


datasets (e.g., in the design and creation of “refined” datasets).

• In other cases, the published entity is the transformation logic itself


(e.g., as a script that generates the range of statistics and insights in a
regular report).
• A final kind of publishing involves creating profiling metadata about
the dataset.
• These profiling reports are critical for managing automated data
services and products.
• In addition to iterations between transforming and profiling the data,
there are less frequent iterations that return to accessing data.

• soon after publishing a result, you might realize that the output is not
exactly correct and you need to apply additional transformations or
expose some additional profiling results.
Additional Aspects: Subsetting and
Sampling
• consider the case in which your dataset contains a heterogeneous set
of records, differing either in structure (e.g., some records contain
more or different fields from the rest) or in granularity (e.g., some
records correspond to customers, whereas others correspond to
accounts).

• The best wrangling approach is to split up the original dataset and


wrangle each subset separately; then, if necessary, merge the results
together again.
• Now consider the case in which your dataset is too large to manually
review each record or when your dataset is so large that even simple
transformations require prohibitively long timeframes to complete.

• In this case, the iterative process of transforming and profiling your data
is materially hampered by the time required to compute and execute
transformations.

• Suppose that you make a small change to a derived calculation or you


change a rule to group a few customer segments into a wider segment.
Now you apply this transformation and must wait a minute, or 10
minutes, or half a day to see what the results might look like.
• Understandably, data wrangling work will dominate your analysis
workflows and you won’t get through many analyses.
• The critical approach to speeding up your data wrangling is to work
with some samples of the entire dataset that you can transform and
profile at interactive time scales (ideally within 100 milliseconds, but
occasionally up to a few seconds).
• Unfortunately, working with samples to speed up your data wrangling
is not as straightforward as it sounds.
• To understand the importance of sampling, consider a simple
transformation involving the calculation to determine the length of
time each of your customers has been using your product or service
based on the date each one registered for it.

• Chances are, you have some sense of what these ages should be: you
started your business 11 years ago, so no customer should show a
duration of more than 11 years.

• You had a big increase in customers about 3 years ago, so you’d


expect to see a corresponding bump in the overall age distribution
around 3. And so on.
• These expectations point to a couple of sampling techniques that
would be useful for assessing whether your transformation to
calculate customer age is working correctly.

• Namely, you want a sample that contains extreme values (customer


records with the earliest and latest registration dates) and that
randomly samples over the rest of the records (so that overall
distributional trends are visible).
• Consider a more complex situation that involves case-based
transformations based on record groups; for example, you need to
convert transaction amounts to US dollars, and your dataset contains
transactions in Euros, GB Pounds, and so on.
• Each reporting currency requires its own transformation.
• To assess that all of the currency transformations were applied
correctly, you need to profile results covering all of the currencies
occurring in your dataset.
• Samples that cover all groups, or “strata,” are often referred to as
stratified samples; they provide representation of each group, even
though they might bias the overall trends of the full dataset (by over
representing small groups relative to large ones, for example).

• There are numerous techniques for extracting different kinds of


samples from large datasets (e.g., see Synopses for Massive Data:
Samples, Histograms, Wavelets, Sketches by Cormode et al.), and
some software packages and databases implement these methods for
you.
• With an understanding of the basic steps in data wrangling—access,
transformation, profiling, and publishing—and how these steps can
incorporate aspects of sampling to handle big datasets and
split/fix/merge strategies for heterogeneous datasets.

• we turn our attention now to the core types of transformations and


profiling.
Core Transformation and Profiling Actions
• The core tasks of data wrangling are transformation and profiling, and
the general workflow involves quick iterations (on the order of
seconds) between these tasks.

• intent in this section is to provide a basic description of the various


types of transformation and profiling.

• Let’s begin our discussion by exploring the transformation tasks


involved in data wrangling.
• The core types of transformation are structuring, enriching, and cleansing.

• Structuring primarily involves moving record field values around, and in some
cases summarizing those values.

• Structuring might be as simple as changing the order of fields within a record.

• More complex transformations that restructure each record independently


include breaking record fields into smaller components or combining fields into
complex structures.

• At the interrecord level, some structuring transformations remove subsets of


records
• Finally, the most complex interrecord structuring transformations
involve aggregations and pivots of the data.

• Aggregations enable a shift in the granularity of the dataset. (e.g.,


moving from individual customers to segments of customers, or from
individual sales transactions to monthly or quarterly net revenue
calculations).

• Pivoting involves shifting records into fields or shifting fields into


records.

• The quintessential structuring transformations are joins and unions.


• Joins combine datasets by linking records.

• Unions blend multiple datasets together by matching up records from


two different datasets and concatenating them “horizontally” into a
wider table that includes attributes from both sides of the match.

• Beyond joins and unions, another common class of enriching


transformations inserts metadata into your dataset.

• The inserted metadata might be dataset independent (e.g., the current


time or the username of the person transforming the data) or specific to
the dataset (e.g., filenames or locations of each record within the
dataset).
• Yet another class of enriching transformations involves the
computation of new data values from the existing data.

• In broad strokes, these kinds of transformations either derive generic


metadata (e.g., time conversion or geo-based calculations like
latitude-longitude coordinates from a street address or a sentiment
score inferred from a customer support chat log) or

• Custom metadata (e.g., mineral deposit volumes inferred from rock


samples or health outcomes inferred from treatment records).
• The third type of transformation cleans a dataset by fixing quality and
consistency issues.

• Cleaning predominately involves manipulating individual field values


within records.

• The most common variant fixes missing (or NULL) values.

• Switching gears, the core types of profiling are distinguishable by the


unit of data they operate on: individual values or sets of values.
• Profiling on individual field values involves two kinds of constraints:
syntactic and semantic.

• Syntactic constraints focus on formatting; for example, a date value


should be in MM-DD-YYYY format.

• Semantic constraints are rooted in context or proprietary business


logic; for example, your company is closed for business on New Year’s
Day so no transactions should exist on January 1 of any year.
• Set-based profiling focuses on the shape and extent of the
distribution of values found within a single record field, or in the
range of relationships between multiple record fields.

• For example, you might expect retail sales to be higher in holiday


months than in non holiday months; thus, you could construct a set-
based profile to confirm that sales are distributed across months as
expected.
Data Wrangling in the Workflow Framework
• Ingesting Data
• ingesting data into the raw data stage can involve some amount of
data wrangling.
• Loading the data into the raw data stage location might require some
nontrivial transformation of the data to ensure that it conforms to
basic structural requirements (e.g., records and field values encoded
in a particular formats).
• The extent of the constraints to load the data will vary by the kind of
infrastructure of your raw data stage.
• Older data warehouses will likely require particular file formats and
value encodings, whereas more modern infrastructures like MongoDB
or HDFS will permit a wider variety of structures on the ingested data
(involving less data wrangling at this stage).

• In either event, the explicit goal when loading raw data is to perform
the minimal amount of transformations to the data to make it
available for metadata analysis and eventual refinement.
• The general objectives are “don’t lose any data” and “fixing quality
problems comes next.”
• Satisfying these objectives will require limited structuring
transformations and enough profiling to ensure that data was not lost
or corrupted in the ingestion process.
Describing Data
• Assessing the structure, granularity, accuracy, temporality, and scope
of your data is a profiling-heavy activity.

• The range of profiling views of your data required to build a broad


understanding of your data will also require an exploratory range of
transformations.
• Most of the exploratory transformations will involve structuring:
• breaking out subcomponents of existing values to assess their quality
and consistency,
• filtering the dataset down to subsets of records to assess scope and
accuracy, aggregating and pivoting the data to triangulate values
against other internal and external references, and so on.
Assessing Data Utility
• Assessing the custom metadata of a dataset primarily involves
enriching and cleaning transformations.
• In particular, if the dataset is a new installment to prior datasets, you
will need to assess the ability to union the data.
• Additionally, you will likely want to join the new dataset to existing
ones.
• Attempting this join will likely want to join the new dataset to existing
ones.
• this join will reveal issues with linking records between the datasets:
perhaps too few links are found, or, equally problematic, there are too
many duplicative links.
• In either case, by treating your existing data as a baseline standard to
which the new dataset must adhere or align,

• likely to spend a good amount of time cleaning and altering values in


the new dataset to tune its overlap with existing data.

• As the new data is blended in with the old, set-based profiling will
provide the basic feedback on the quality of the blend

You might also like