Chapter 5
Chapter 5
Transformation:
Structuring
Overview
• Structuring as a transformation action involves changing your
dataset’s structure or granularity.
• Shifting the granularity of the dataset and the fields associated with
records through aggregations and pivots.
Intrarecord Structuring: Extracting Values
• Extraction involves creating a new record field from an existing record
field.
• https://www.fec.gov/data/browse-data/?tab=bulk-data -
• Individual Contributions dataset - column 14 - transaction date for
each campaign contribution.
“From source column 14, extract the characters located from position 3 to position 4.”
• more complex version of positional substring extraction - pluck a
substring from a record field when the starting and ending positions
of the substring differ from record to record.
• In each key-value pair, the key represents the name of a property and the
value represents the value that describes that property.
• However, it’s more likely that some shopping carts only contain a subset of
possible properties.
• When you are wrangling data, you might need to create a single field
that merges values from multiple related fields.
• We want to combine the data from these two columns into a single
column, and then separate the city and state with a comma.
• Our desired output will look like the following column:
• Combining the data from the these two fields can be useful if your
downstream analysis wants to consider this data as part of a single
record field.
Interrecord Structuring: Filtering Records and
Fields
• Filtering involves removing records or fields from a dataset.
• Based on the FEC data dictionary, this field contains eight distinct
values: CAN, CCM, COM, IND, ORG, PAC, and PTY.
• Based on this column, we could say that the granularity of the dataset
is fairly coarse.
• Aggregations and pivots are structuring operations that enable a shift in the
granularity of a dataset.
• Eg: you might start with a dataset of sales transactions and want total sales
amounts by week or by store or by region.
• A more complex pivot might involve extracting the items purchased out of
the transaction records and building a dataset in which each record
corresponds to an item.
• consider a dataset composed of individual sales transactions, where
each transaction record contains a field listing the products that were
purchased.
• You can pivot this dataset such that each product becomes a record
with fields describing the product and an aggregated count field
indicating the number of transactions involving this product.
• Alternatively, you could pivot the same dataset to count the number
of transactions per product where the product was purchased alone,
with one additional product, with two additional products, and so on.
Simple Aggregations
• each input record maps to one and only one output record.
• each output record maps to one and only one input record.
• It is particularly useful when your source data contains multiple columns that
represent the same type of data.
• For example, you may have a transactions file that contains the total
sales numbers per region, per year.
• The data could be formatted as shown in the following table:
• Output record fields might involve simple aggregations (e.g., sum or max) or
involve more complex expansions based on the field values.
• In this case, we want to create one new column for each contribution type.
• A subset of the Individual Contributions dataset contains the
following data: