0% found this document useful (0 votes)
25 views32 pages

Chapter 5

The document discusses various types of structuring transformations that can be applied to datasets including intrarecord transformations like extracting values, combining fields, and reordering fields as well as interrecord transformations like filtering records, aggregations, and pivots. Structuring transformations change the structure or granularity of a dataset.

Uploaded by

Bala Murali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views32 pages

Chapter 5

The document discusses various types of structuring transformations that can be applied to datasets including intrarecord transformations like extracting values, combining fields, and reordering fields as well as interrecord transformations like filtering records, aggregations, and pivots. Structuring transformations change the structure or granularity of a dataset.

Uploaded by

Bala Murali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Chapter 5.

Transformation:
Structuring
Overview
• Structuring as a transformation action involves changing your
dataset’s structure or granularity.

• Structuring consists of any actions that change the form or schema of


your data.

• Two sets of structuring actions - group of structuring transformations


involves manipulating individual records and fields - intrarecord
structuring.

• The second group of structuring transformations involves operating


on multiple records and fields at once - interrecord structuring.
Intrarecord structuring.
• Reordering record fields (moving columns)

• Creating new record fields through extracting values

• Combining multiple record fields into a single record field


Interrecord structuring
• Filtering datasets by removing sets of records.

• Shifting the granularity of the dataset and the fields associated with
records through aggregations and pivots.
Intrarecord Structuring: Extracting Values
• Extraction involves creating a new record field from an existing record
field.

• Frequently, this involves identifying a substring in an existing column


and placing that substring into a new column.
Positional Extraction
• By specifying the starting position and ending position that
correspond to the substring that you want to extract from a set of
record fields - positional extraction.

• https://www.fec.gov/data/browse-data/?tab=bulk-data -
• Individual Contributions dataset - column 14 - transaction date for
each campaign contribution.

• We want to extract the day of the month for each individual


contribution into a new column.

• This will allow us to create a field that we can use to determine if


individual campaign contributions were more frequent in certain
times of the month.
Positional Extraction

“From source column 14, extract the characters located from position 3 to position 4.”
• more complex version of positional substring extraction - pluck a
substring from a record field when the starting and ending positions
of the substring differ from record to record.

• Eg: Address fields - complex positional extraction can be utilized


effectively.

• use functions that can search for a particular sequence of characters


within the original record field.

• These functions return either the start position of the searched-for


sequence or the length of the sequence.
• Individual Contributions dataset - column 8 - name of the person or organization
who made each campaign contribution.

• last name - inconsistent length – 1st record - 6 characters long (ARNOLD).

• Third record - 7 characters long (ROSSMAN).

• Simple positional extraction would not work in


this case because the ending positions of the
last name differ from record to record.

• complex positional extraction functions - identify the position of the comma -


extract the appropriate substring.
Pattern Extraction
• This method uses rules to describe the sequence of characters that
you want to extract.

• column 20 contains free text that describes each contribution.

• You can often use regular expressions to represent patterns in code.

• Regular expressions are also supported by most data wrangling


software products.
• A sample of data from column 20 is below:

• “From column 20, extract the first sequence of digits, followed by a


period.

• followed by another sequence of digits.”


Complex Structure Extraction
• Sometimes, when you are wrangling data - need to extract elements
from within complex hierarchical structures.

• Commonly - this type of complex structure extraction required when


wrangling JSON files or other semistructured formats.

• wrangling JSON data - two types of complex structures: maps and


arrays.

• These structures are common in semistructured data because they


allow datasets to include a variable number of records and fields.
• A JSON array represents an ordered sequence of values.

• Example array: [“Sally”,”Bob”,”Alon”,”Georgia”]

• A JSON - a set of key-value pairs.

• In each key-value pair, the key represents the name of a property and the
value represents the value that describes that property.

• Example map: {“product”:”Trifacta,


Wrangler”,”price”:”free”,”category”:”wrangling tool”}
• In a given dataset, an array in one record might be a different length
from an array in another record.
• Eg: customer orders, where each record represents a unique
customer’s shopping cart.
• In the first record an array of orders might include two elements,
whereas in the next record, an array of orders might include three
elements.
• Maps also support variability across records.

• Shopping cart example, each cart might contain a variety of possible


properties—say, “gift_wrapped”, “shipping_address”, “billing_address”,
“billing_name”, and “shipping_name”.

• Ideally, every record will contain all of the possible properties.

• However, it’s more likely that some shopping carts only contain a subset of
possible properties.

• Representing the properties and their associated values in a JSON map


allows us to avoid creating a very sparsely populated table.
• JSON format - ideal for storing data efficiently – not structured ideally
for use in analytics tools.

• These tools commonly expect tabular data as input.

• convert JSON-formatted data into the rectangular structure - needed


for downstream analytics.
Intrarecord Structuring: Combining Multiple
Record Fields
• Combining multiple fields is essentially the reverse of extraction.

• When you are wrangling data, you might need to create a single field
that merges values from multiple related fields.

• Eg: The Individual Contributions dataset. This dataset contains two


related columns: column 9 (city) and column 10 (state).

• We want to combine the data from these two columns into a single
column, and then separate the city and state with a comma.
• Our desired output will look like the following column:

• Combining the data from the these two fields can be useful if your
downstream analysis wants to consider this data as part of a single
record field.
Interrecord Structuring: Filtering Records and
Fields
• Filtering involves removing records or fields from a dataset.

• Although filtering is often utilized in cleaning transformations


designed to address dataset quality.

• you also can use it to alter the granularity of a dataset by changing


the types of records and fields represented in a dataset.
• Eg: The Individual Contributions dataset contains a column that
represents the type of entity that made each donation.

• Based on the FEC data dictionary, this field contains eight distinct
values: CAN, CCM, COM, IND, ORG, PAC, and PTY.

• Based on this column, we could say that the granularity of the dataset
is fairly coarse.

• After all, records can belong to one of eight distinct groups.


• Let’s assume that we are interested in analyzing only campaign
contributions that originated from individuals (represented in the
entity column by “IND”).

• We will need to filter our dataset so that it includes records that


contain only the value “IND” in column 7.

• Performing this operation will produce a dataset with a finer


granularity because each record will now belong to only a single
category of values from the entity type column.

• This type of filtering is called record based filtering.


• Another type of filtering that is commonly used as a structuring
operation is field-based filtering.

• This type of filtering affects the number of fields, or columns, in your


dataset
Interrecord Structuring: Aggregations and
Pivots

• Aggregations and pivots are structuring operations that enable a shift in the
granularity of a dataset.

• Eg: you might start with a dataset of sales transactions and want total sales
amounts by week or by store or by region.

• Fairly straightforward aggregation involving the summation of record fields.

• A more complex pivot might involve extracting the items purchased out of
the transaction records and building a dataset in which each record
corresponds to an item.
• consider a dataset composed of individual sales transactions, where
each transaction record contains a field listing the products that were
purchased.

• You can pivot this dataset such that each product becomes a record
with fields describing the product and an aggregated count field
indicating the number of transactions involving this product.

• Alternatively, you could pivot the same dataset to count the number
of transactions per product where the product was purchased alone,
with one additional product, with two additional products, and so on.
Simple Aggregations
• each input record maps to one and only one output record.

• whereas each output record combines one or more input records.

• For simple aggregations, the output record fields are simple


aggregations (sum, mean, min, list concatenation, etc.) of the input
record fields.
• We can perform a basic aggregation on the Individual Contributions
dataset.

• One column that contains the average contribution made to each


campaign committee.

• One column that contains the sum of contributions made to each


campaign committee.

• One column that counts the number of contributions made to each


campaign committee.
• we will be performing this basic aggregation on the following limited
sample of data from the Individual Contributions dataset:

• based on the FEC’s data dictionary, column 1 – campaign committee


and column 15 - contribution amount.
• After aggregating
Column-to-Row Pivots
• column-to-row pivots - each input record maps to multiple output records, and

• each output record maps to one and only one input record.

• The output records contain a subset of the input record fields.

• This type of column-to-row pivot is commonly referred to as “unpivoting” or


“denormalizing” data.

• It is particularly useful when your source data contains multiple columns that
represent the same type of data.
• For example, you may have a transactions file that contains the total
sales numbers per region, per year.
• The data could be formatted as shown in the following table:

• We want to restructure this dataset so that a single row


• contains the sales for a single unique combination of region and year.
Row-to-Column Pivots
• output records sourced from multiple input records and input records might
support multiple output records.

• Output record fields might involve simple aggregations (e.g., sum or max) or
involve more complex expansions based on the field values.

• This type of pivot is called a row-to-column pivot.

• Individual Contributions dataset - We want to create a refined dataset that


shows the sum of contributions made to each campaign committee, broken out
by contribution type.

• In this case, we want to create one new column for each contribution type.
• A subset of the Individual Contributions dataset contains the
following data:

You might also like