0% found this document useful (0 votes)

34 views32 pages

Chapter 3

The document discusses the core tasks involved in data wrangling, which are accessing data, transforming data, profiling data, and publishing data. It describes different types of data transformations including structuring (e.g. aggregations, pivots, joins, unions), enriching (e.g. metadata insertion, new field calculations), and cleansing data. It also discusses profiling data at the individual value level and set level to check transformations. Sampling techniques are important for handling large datasets.

Uploaded by

Bala Murali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views32 pages

Chapter 3

Uploaded by

Bala Murali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

Data Wrangling Dynamics

• Data wrangling is a generic phrase capturing the range of tasks
involved in preparing your data for analysis.

• Data wrangling begins with accessing your data.

• Sometimes, access is gated on getting appropriate permission and

making the corresponding changes in your data infrastructure.
Data Wrangling Dynamics

• Access also involves manipulating the locations and relationships

between datasets.

• This kind of data wrangling involves everything from moving datasets

around a folder hierarchy, to replicating datasets across warehouses
for easier access, to analyzing differences between similar datasets
and assessing overlaps and conflicts.
• After you have successfully accessed your data, the bulk of your data
wrangling work will involve transforming the data itself—
manipulating the structure, granularity, accuracy, temporality, and
scope of your data to better align with your analysis goals.

• All of these transformations are best performed with tools that

provide meaningful feedback (so that the manipulator is assured that
the manipulations were successful).
• We refer to this feedback as profiling. In many cases, a predefined
(and, hence, somewhat generic) set of profiling feedback is sufficient
to determine whether an applied transformation was successful.

• In other cases, customized profiling is required to make this

determination. In either event, the bulk of data wrangling involves
frequent iterations between transforming and profiling your data.
• A final set of data wrangling tasks can be understood as publishing.

• Publishing is best understood from the perspective of what is

published.

• In some cases, what is published is a transformed version of the input

datasets (e.g., in the design and creation of “refined” datasets).

• In other cases, the published entity is the transformation logic itself

(e.g., as a script that generates the range of statistics and insights in a
regular report).
• A final kind of publishing involves creating profiling metadata about
the dataset.
• These profiling reports are critical for managing automated data
services and products.
• In addition to iterations between transforming and profiling the data,
there are less frequent iterations that return to accessing data.

• soon after publishing a result, you might realize that the output is not
exactly correct and you need to apply additional transformations or
expose some additional profiling results.
Additional Aspects: Subsetting and
Sampling
• consider the case in which your dataset contains a heterogeneous set
of records, differing either in structure (e.g., some records contain
more or different fields from the rest) or in granularity (e.g., some
records correspond to customers, whereas others correspond to
accounts).

• The best wrangling approach is to split up the original dataset and

wrangle each subset separately; then, if necessary, merge the results
together again.
• Now consider the case in which your dataset is too large to manually
review each record or when your dataset is so large that even simple
transformations require prohibitively long timeframes to complete.

• In this case, the iterative process of transforming and profiling your data
is materially hampered by the time required to compute and execute
transformations.

• Suppose that you make a small change to a derived calculation or you

change a rule to group a few customer segments into a wider segment.
Now you apply this transformation and must wait a minute, or 10
minutes, or half a day to see what the results might look like.
• Understandably, data wrangling work will dominate your analysis
workflows and you won’t get through many analyses.
• The critical approach to speeding up your data wrangling is to work
with some samples of the entire dataset that you can transform and
profile at interactive time scales (ideally within 100 milliseconds, but
occasionally up to a few seconds).
• Unfortunately, working with samples to speed up your data wrangling
is not as straightforward as it sounds.
• To understand the importance of sampling, consider a simple
transformation involving the calculation to determine the length of
time each of your customers has been using your product or service
based on the date each one registered for it.

• Chances are, you have some sense of what these ages should be: you
started your business 11 years ago, so no customer should show a
duration of more than 11 years.

• You had a big increase in customers about 3 years ago, so you’d

expect to see a corresponding bump in the overall age distribution
around 3. And so on.
• These expectations point to a couple of sampling techniques that
would be useful for assessing whether your transformation to
calculate customer age is working correctly.

• Namely, you want a sample that contains extreme values (customer

records with the earliest and latest registration dates) and that
randomly samples over the rest of the records (so that overall
distributional trends are visible).
• Consider a more complex situation that involves case-based
transformations based on record groups; for example, you need to
convert transaction amounts to US dollars, and your dataset contains
transactions in Euros, GB Pounds, and so on.
• Each reporting currency requires its own transformation.
• To assess that all of the currency transformations were applied
correctly, you need to profile results covering all of the currencies
occurring in your dataset.
• Samples that cover all groups, or “strata,” are often referred to as
stratified samples; they provide representation of each group, even
though they might bias the overall trends of the full dataset (by over
representing small groups relative to large ones, for example).

• There are numerous techniques for extracting different kinds of

samples from large datasets (e.g., see Synopses for Massive Data:
Samples, Histograms, Wavelets, Sketches by Cormode et al.), and
some software packages and databases implement these methods for
you.
• With an understanding of the basic steps in data wrangling—access,
transformation, profiling, and publishing—and how these steps can
incorporate aspects of sampling to handle big datasets and
split/fix/merge strategies for heterogeneous datasets.

• we turn our attention now to the core types of transformations and

profiling.
Core Transformation and Profiling Actions
• The core tasks of data wrangling are transformation and profiling, and
the general workflow involves quick iterations (on the order of
seconds) between these tasks.

• intent in this section is to provide a basic description of the various

types of transformation and profiling.

• Let’s begin our discussion by exploring the transformation tasks

involved in data wrangling.
• The core types of transformation are structuring, enriching, and cleansing.

• Structuring primarily involves moving record field values around, and in some
cases summarizing those values.

• Structuring might be as simple as changing the order of fields within a record.

• More complex transformations that restructure each record independently

include breaking record fields into smaller components or combining fields into
complex structures.

• At the interrecord level, some structuring transformations remove subsets of

records
• Finally, the most complex interrecord structuring transformations
involve aggregations and pivots of the data.

• Aggregations enable a shift in the granularity of the dataset. (e.g.,

moving from individual customers to segments of customers, or from
individual sales transactions to monthly or quarterly net revenue
calculations).

• Pivoting involves shifting records into fields or shifting fields into

records.

• The quintessential structuring transformations are joins and unions.

• Joins combine datasets by linking records.

• Unions blend multiple datasets together by matching up records from

two different datasets and concatenating them “horizontally” into a
wider table that includes attributes from both sides of the match.

• Beyond joins and unions, another common class of enriching

transformations inserts metadata into your dataset.

• The inserted metadata might be dataset independent (e.g., the current

time or the username of the person transforming the data) or specific to
the dataset (e.g., filenames or locations of each record within the
dataset).
• Yet another class of enriching transformations involves the
computation of new data values from the existing data.

• In broad strokes, these kinds of transformations either derive generic

metadata (e.g., time conversion or geo-based calculations like
latitude-longitude coordinates from a street address or a sentiment
score inferred from a customer support chat log) or

• Custom metadata (e.g., mineral deposit volumes inferred from rock

samples or health outcomes inferred from treatment records).
• The third type of transformation cleans a dataset by fixing quality and
consistency issues.

• Cleaning predominately involves manipulating individual field values

within records.

• The most common variant fixes missing (or NULL) values.

• Switching gears, the core types of profiling are distinguishable by the

unit of data they operate on: individual values or sets of values.
• Profiling on individual field values involves two kinds of constraints:
syntactic and semantic.

• Syntactic constraints focus on formatting; for example, a date value

should be in MM-DD-YYYY format.

• Semantic constraints are rooted in context or proprietary business

logic; for example, your company is closed for business on New Year’s
Day so no transactions should exist on January 1 of any year.
• Set-based profiling focuses on the shape and extent of the
distribution of values found within a single record field, or in the
range of relationships between multiple record fields.

• For example, you might expect retail sales to be higher in holiday

months than in non holiday months; thus, you could construct a set-
based profile to confirm that sales are distributed across months as
expected.
Data Wrangling in the Workflow Framework
• Ingesting Data
• ingesting data into the raw data stage can involve some amount of
data wrangling.
• Loading the data into the raw data stage location might require some
nontrivial transformation of the data to ensure that it conforms to
basic structural requirements (e.g., records and field values encoded
in a particular formats).
• The extent of the constraints to load the data will vary by the kind of
infrastructure of your raw data stage.
• Older data warehouses will likely require particular file formats and
value encodings, whereas more modern infrastructures like MongoDB
or HDFS will permit a wider variety of structures on the ingested data
(involving less data wrangling at this stage).

• In either event, the explicit goal when loading raw data is to perform
the minimal amount of transformations to the data to make it
available for metadata analysis and eventual refinement.
• The general objectives are “don’t lose any data” and “fixing quality
problems comes next.”
• Satisfying these objectives will require limited structuring
transformations and enough profiling to ensure that data was not lost
or corrupted in the ingestion process.
Describing Data
• Assessing the structure, granularity, accuracy, temporality, and scope
of your data is a profiling-heavy activity.

• The range of profiling views of your data required to build a broad

understanding of your data will also require an exploratory range of
transformations.
• Most of the exploratory transformations will involve structuring:
• breaking out subcomponents of existing values to assess their quality
and consistency,
• filtering the dataset down to subsets of records to assess scope and
accuracy, aggregating and pivoting the data to triangulate values
against other internal and external references, and so on.
Assessing Data Utility
• Assessing the custom metadata of a dataset primarily involves
enriching and cleaning transformations.
• In particular, if the dataset is a new installment to prior datasets, you
will need to assess the ability to union the data.
• Additionally, you will likely want to join the new dataset to existing
ones.
• Attempting this join will likely want to join the new dataset to existing
ones.
• this join will reveal issues with linking records between the datasets:
perhaps too few links are found, or, equally problematic, there are too
many duplicative links.
• In either case, by treating your existing data as a baseline standard to
which the new dataset must adhere or align,

• likely to spend a good amount of time cleaning and altering values in

the new dataset to tune its overlap with existing data.

• As the new data is blended in with the old, set-based profiling will
provide the basic feedback on the quality of the blend

Caterpillar 994F Wheel Loader: Venue Date
100% (2)
Caterpillar 994F Wheel Loader: Venue Date
97 pages
Application Under Section 476
100% (5)
Application Under Section 476
2 pages
Grid Scale Battery Storage
100% (1)
Grid Scale Battery Storage
8 pages
Smoke Control Hotels PDF
No ratings yet
Smoke Control Hotels PDF
9 pages
LOADALL - 533-105: Static Dimensions
No ratings yet
LOADALL - 533-105: Static Dimensions
4 pages
ERP Presentation
100% (1)
ERP Presentation
30 pages
Company Profile: /shega Interiors
No ratings yet
Company Profile: /shega Interiors
25 pages
M. Tech. Bulletin: Aerospace Engineering Department
No ratings yet
M. Tech. Bulletin: Aerospace Engineering Department
34 pages
Things Go Better With...
No ratings yet
Things Go Better With...
1 page
Model BFV-300 Butterfly Valve Wafer Style General Description Technical Data
No ratings yet
Model BFV-300 Butterfly Valve Wafer Style General Description Technical Data
8 pages
Memopower 1-3KVA User Manual 4 BT
No ratings yet
Memopower 1-3KVA User Manual 4 BT
26 pages
Brand Loyalty vs. Repeat Purchasing Behavior
No ratings yet
Brand Loyalty vs. Repeat Purchasing Behavior
9 pages
Load Line 1979
No ratings yet
Load Line 1979
76 pages
Multiple Injuries After Ship Tips Over at Edinburgh Dockyard
No ratings yet
Multiple Injuries After Ship Tips Over at Edinburgh Dockyard
10 pages
Chapter Four: International Management and Cross-Cultural Competence
No ratings yet
Chapter Four: International Management and Cross-Cultural Competence
35 pages
Project Report: "In Pursuit of Global Competitiveness"
75% (4)
Project Report: "In Pursuit of Global Competitiveness"
9 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
72 pages
A Refined Integrated MAC and Routing Protocol With
No ratings yet
A Refined Integrated MAC and Routing Protocol With
8 pages
CSS Fonts
No ratings yet
CSS Fonts
4 pages
Es PPT - Green Building and Smart Cities
No ratings yet
Es PPT - Green Building and Smart Cities
28 pages
Agilent ERP Failure
No ratings yet
Agilent ERP Failure
2 pages
Chapter 5
No ratings yet
Chapter 5
32 pages
Chapter 7
No ratings yet
Chapter 7
8 pages
Company Profile Acurate Packtech
No ratings yet
Company Profile Acurate Packtech
6 pages
Head Assy
No ratings yet
Head Assy
1 page
A New Decade For Soci Al Changes
No ratings yet
A New Decade For Soci Al Changes
16 pages
Assignment MCA 103
No ratings yet
Assignment MCA 103
4 pages
Image and Video Processing in The Compressed Domain Jayanta Mukhopadhyay
No ratings yet
Image and Video Processing in The Compressed Domain Jayanta Mukhopadhyay
45 pages
PDS AlphaTec MANUAL PRESSURE TEST KIT 2003
No ratings yet
PDS AlphaTec MANUAL PRESSURE TEST KIT 2003
1 page
OPA Annex 4 Request For Funds Format (15 March 2018)
No ratings yet
OPA Annex 4 Request For Funds Format (15 March 2018)
5 pages
HFACS 8.0 How To
No ratings yet
HFACS 8.0 How To
18 pages
MATULAC Activity 1 MidTerm
No ratings yet
MATULAC Activity 1 MidTerm
3 pages
Air Act 1981 Project Arjun Dubey 4046
No ratings yet
Air Act 1981 Project Arjun Dubey 4046
3 pages
Denim Fabric Consumption & Booking (Final)
No ratings yet
Denim Fabric Consumption & Booking (Final)
7 pages

Chapter 3

Uploaded by

Chapter 3

Uploaded by

Data Wrangling Dynamics

Data Wrangling Dynamics

• Data wrangling begins with accessing your data.

• Sometimes, access is gated on getting appropriate permission and

• Access also involves manipulating the locations and relationships

• This kind of data wrangling involves everything from moving datasets

• All of these transformations are best performed with tools that

• In other cases, customized profiling is required to make this

• Publishing is best understood from the perspective of what is

• In some cases, what is published is a transformed version of the input

• In other cases, the published entity is the transformation logic itself

• The best wrangling approach is to split up the original dataset and

• Suppose that you make a small change to a derived calculation or you

• You had a big increase in customers about 3 years ago, so you’d

• Namely, you want a sample that contains extreme values (customer

• There are numerous techniques for extracting different kinds of

• we turn our attention now to the core types of transformations and

• intent in this section is to provide a basic description of the various

• Let’s begin our discussion by exploring the transformation tasks

• Structuring might be as simple as changing the order of fields within a record.

• More complex transformations that restructure each record independently

• At the interrecord level, some structuring transformations remove subsets of

• Aggregations enable a shift in the granularity of the dataset. (e.g.,

• Pivoting involves shifting records into fields or shifting fields into

• The quintessential structuring transformations are joins and unions.

• Unions blend multiple datasets together by matching up records from

• Beyond joins and unions, another common class of enriching

• The inserted metadata might be dataset independent (e.g., the current

• In broad strokes, these kinds of transformations either derive generic

• Custom metadata (e.g., mineral deposit volumes inferred from rock

• Cleaning predominately involves manipulating individual field values

• The most common variant fixes missing (or NULL) values.

• Switching gears, the core types of profiling are distinguishable by the

• Syntactic constraints focus on formatting; for example, a date value

• Semantic constraints are rooted in context or proprietary business

• For example, you might expect retail sales to be higher in holiday

• The range of profiling views of your data required to build a broad

• likely to spend a good amount of time cleaning and altering values in

You might also like