0% found this document useful (0 votes)

25 views32 pages

Chapter 5

The document discusses various types of structuring transformations that can be applied to datasets including intrarecord transformations like extracting values, combining fields, and reordering fields as well as interrecord transformations like filtering records, aggregations, and pivots. Structuring transformations change the structure or granularity of a dataset.

Uploaded by

Bala Murali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views32 pages

Chapter 5

Uploaded by

Bala Murali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

Chapter 5.

Transformation:
Structuring
Overview
• Structuring as a transformation action involves changing your
dataset’s structure or granularity.

• Structuring consists of any actions that change the form or schema of

your data.

• Two sets of structuring actions - group of structuring transformations

involves manipulating individual records and fields - intrarecord
structuring.

• The second group of structuring transformations involves operating

on multiple records and fields at once - interrecord structuring.
Intrarecord structuring.
• Reordering record fields (moving columns)

• Creating new record fields through extracting values

• Combining multiple record fields into a single record field

Interrecord structuring
• Filtering datasets by removing sets of records.

• Shifting the granularity of the dataset and the fields associated with
records through aggregations and pivots.
Intrarecord Structuring: Extracting Values
• Extraction involves creating a new record field from an existing record
field.

• Frequently, this involves identifying a substring in an existing column

and placing that substring into a new column.
Positional Extraction
• By specifying the starting position and ending position that
correspond to the substring that you want to extract from a set of
record fields - positional extraction.

• https://www.fec.gov/data/browse-data/?tab=bulk-data -
• Individual Contributions dataset - column 14 - transaction date for
each campaign contribution.

• We want to extract the day of the month for each individual

contribution into a new column.

• This will allow us to create a field that we can use to determine if

individual campaign contributions were more frequent in certain
times of the month.
Positional Extraction

“From source column 14, extract the characters located from position 3 to position 4.”
• more complex version of positional substring extraction - pluck a
substring from a record field when the starting and ending positions
of the substring differ from record to record.

• Eg: Address fields - complex positional extraction can be utilized

effectively.

• use functions that can search for a particular sequence of characters

within the original record field.

• These functions return either the start position of the searched-for

sequence or the length of the sequence.
• Individual Contributions dataset - column 8 - name of the person or organization
who made each campaign contribution.

• last name - inconsistent length – 1st record - 6 characters long (ARNOLD).

• Third record - 7 characters long (ROSSMAN).

• Simple positional extraction would not work in

this case because the ending positions of the
last name differ from record to record.

• complex positional extraction functions - identify the position of the comma -

extract the appropriate substring.
Pattern Extraction
• This method uses rules to describe the sequence of characters that
you want to extract.

• column 20 contains free text that describes each contribution.

• You can often use regular expressions to represent patterns in code.

• Regular expressions are also supported by most data wrangling

software products.
• A sample of data from column 20 is below:

• “From column 20, extract the first sequence of digits, followed by a

period.

• followed by another sequence of digits.”

Complex Structure Extraction
• Sometimes, when you are wrangling data - need to extract elements
from within complex hierarchical structures.

• Commonly - this type of complex structure extraction required when

wrangling JSON files or other semistructured formats.

• wrangling JSON data - two types of complex structures: maps and

arrays.

• These structures are common in semistructured data because they

allow datasets to include a variable number of records and fields.
• A JSON array represents an ordered sequence of values.

• Example array: [“Sally”,”Bob”,”Alon”,”Georgia”]

• A JSON - a set of key-value pairs.

• In each key-value pair, the key represents the name of a property and the
value represents the value that describes that property.

• Example map: {“product”:”Trifacta,

Wrangler”,”price”:”free”,”category”:”wrangling tool”}
• In a given dataset, an array in one record might be a different length
from an array in another record.
• Eg: customer orders, where each record represents a unique
customer’s shopping cart.
• In the first record an array of orders might include two elements,
whereas in the next record, an array of orders might include three
elements.
• Maps also support variability across records.

• Shopping cart example, each cart might contain a variety of possible

properties—say, “gift_wrapped”, “shipping_address”, “billing_address”,
“billing_name”, and “shipping_name”.

• Ideally, every record will contain all of the possible properties.

• However, it’s more likely that some shopping carts only contain a subset of
possible properties.

• Representing the properties and their associated values in a JSON map

allows us to avoid creating a very sparsely populated table.
• JSON format - ideal for storing data efficiently – not structured ideally
for use in analytics tools.

• These tools commonly expect tabular data as input.

• convert JSON-formatted data into the rectangular structure - needed

for downstream analytics.
Intrarecord Structuring: Combining Multiple
Record Fields
• Combining multiple fields is essentially the reverse of extraction.

• When you are wrangling data, you might need to create a single field
that merges values from multiple related fields.

• Eg: The Individual Contributions dataset. This dataset contains two

related columns: column 9 (city) and column 10 (state).

• We want to combine the data from these two columns into a single
column, and then separate the city and state with a comma.
• Our desired output will look like the following column:

• Combining the data from the these two fields can be useful if your
downstream analysis wants to consider this data as part of a single
record field.
Interrecord Structuring: Filtering Records and
Fields
• Filtering involves removing records or fields from a dataset.

• Although filtering is often utilized in cleaning transformations

designed to address dataset quality.

• you also can use it to alter the granularity of a dataset by changing

the types of records and fields represented in a dataset.
• Eg: The Individual Contributions dataset contains a column that
represents the type of entity that made each donation.

• Based on the FEC data dictionary, this field contains eight distinct
values: CAN, CCM, COM, IND, ORG, PAC, and PTY.

• Based on this column, we could say that the granularity of the dataset
is fairly coarse.

• After all, records can belong to one of eight distinct groups.

• Let’s assume that we are interested in analyzing only campaign
contributions that originated from individuals (represented in the
entity column by “IND”).

• We will need to filter our dataset so that it includes records that

contain only the value “IND” in column 7.

• Performing this operation will produce a dataset with a finer

granularity because each record will now belong to only a single
category of values from the entity type column.

• This type of filtering is called record based filtering.

• Another type of filtering that is commonly used as a structuring
operation is field-based filtering.

• This type of filtering affects the number of fields, or columns, in your

dataset
Interrecord Structuring: Aggregations and
Pivots

• Aggregations and pivots are structuring operations that enable a shift in the
granularity of a dataset.

• Eg: you might start with a dataset of sales transactions and want total sales
amounts by week or by store or by region.

• Fairly straightforward aggregation involving the summation of record fields.

• A more complex pivot might involve extracting the items purchased out of
the transaction records and building a dataset in which each record
corresponds to an item.
• consider a dataset composed of individual sales transactions, where
each transaction record contains a field listing the products that were
purchased.

• You can pivot this dataset such that each product becomes a record
with fields describing the product and an aggregated count field
indicating the number of transactions involving this product.

• Alternatively, you could pivot the same dataset to count the number
of transactions per product where the product was purchased alone,
with one additional product, with two additional products, and so on.
Simple Aggregations
• each input record maps to one and only one output record.

• whereas each output record combines one or more input records.

• For simple aggregations, the output record fields are simple

aggregations (sum, mean, min, list concatenation, etc.) of the input
record fields.
• We can perform a basic aggregation on the Individual Contributions
dataset.

• One column that contains the average contribution made to each

campaign committee.

• One column that contains the sum of contributions made to each

campaign committee.

• One column that counts the number of contributions made to each

campaign committee.
• we will be performing this basic aggregation on the following limited
sample of data from the Individual Contributions dataset:

• based on the FEC’s data dictionary, column 1 – campaign committee

and column 15 - contribution amount.
• After aggregating
Column-to-Row Pivots
• column-to-row pivots - each input record maps to multiple output records, and

• each output record maps to one and only one input record.

• The output records contain a subset of the input record fields.

• This type of column-to-row pivot is commonly referred to as “unpivoting” or

“denormalizing” data.

• It is particularly useful when your source data contains multiple columns that
represent the same type of data.
• For example, you may have a transactions file that contains the total
sales numbers per region, per year.
• The data could be formatted as shown in the following table:

• We want to restructure this dataset so that a single row

• contains the sales for a single unique combination of region and year.
Row-to-Column Pivots
• output records sourced from multiple input records and input records might
support multiple output records.

• Output record fields might involve simple aggregations (e.g., sum or max) or
involve more complex expansions based on the field values.

• This type of pivot is called a row-to-column pivot.

• Individual Contributions dataset - We want to create a refined dataset that

shows the sum of contributions made to each campaign committee, broken out
by contribution type.

• In this case, we want to create one new column for each contribution type.
• A subset of the Individual Contributions dataset contains the
following data:

Relational Algebra Operations in Mapreduce
No ratings yet
Relational Algebra Operations in Mapreduce
28 pages
Data Extraction Part2
No ratings yet
Data Extraction Part2
15 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Combine PDF
No ratings yet
Combine PDF
270 pages
Chap 1309 - Ak - EP
No ratings yet
Chap 1309 - Ak - EP
44 pages
Information Extraction: Sunita Sarawagi
No ratings yet
Information Extraction: Sunita Sarawagi
117 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
PM Unit 4
No ratings yet
PM Unit 4
36 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Week 3
No ratings yet
Week 3
29 pages
Data Sceince - UNIT - 4
No ratings yet
Data Sceince - UNIT - 4
70 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Module 3
No ratings yet
Module 3
30 pages
Big Data - Unit - 3
No ratings yet
Big Data - Unit - 3
20 pages
Datamining 1
No ratings yet
Datamining 1
21 pages
Interview Question
No ratings yet
Interview Question
48 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Wa0000
No ratings yet
Wa0000
38 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
Ads Ise 2
No ratings yet
Ads Ise 2
11 pages
Cs 614
No ratings yet
Cs 614
12 pages
Sem3 Unit1 DW
No ratings yet
Sem3 Unit1 DW
12 pages
Lecture 16
No ratings yet
Lecture 16
21 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
DAA - Chapter 02
No ratings yet
DAA - Chapter 02
12 pages
Extract, Transform, Load
No ratings yet
Extract, Transform, Load
9 pages
DAA - Chapter 02
No ratings yet
DAA - Chapter 02
11 pages
Lecture 01
No ratings yet
Lecture 01
40 pages
Chapter IV
No ratings yet
Chapter IV
22 pages
Data Extraction
No ratings yet
Data Extraction
14 pages
Abinitio Components
No ratings yet
Abinitio Components
10 pages
Ais Elect - Reviewer
No ratings yet
Ais Elect - Reviewer
5 pages
CHAPTER 8 Data Structures and Caatts
No ratings yet
CHAPTER 8 Data Structures and Caatts
57 pages
Bases de Dados e Armazéns de Dados: Bibliography
No ratings yet
Bases de Dados e Armazéns de Dados: Bibliography
11 pages
IDAB Assignment 3: 1. Explain SQL Subqueries
No ratings yet
IDAB Assignment 3: 1. Explain SQL Subqueries
6 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Unit III DWM
No ratings yet
Unit III DWM
13 pages
Data Warehousing: Lecture No 07
No ratings yet
Data Warehousing: Lecture No 07
38 pages
BI Architecture
No ratings yet
BI Architecture
4 pages
Data Mining and Data Warehousing Notes ct1
No ratings yet
Data Mining and Data Warehousing Notes ct1
12 pages
Data Wrangling
No ratings yet
Data Wrangling
6 pages
UNIT 4 Data Science Notes
No ratings yet
UNIT 4 Data Science Notes
4 pages
Adbms
No ratings yet
Adbms
19 pages
221
No ratings yet
221
2 pages
03 Etl 081028 2055
No ratings yet
03 Etl 081028 2055
46 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
Data Structures and CAATTs For Data Extraction
No ratings yet
Data Structures and CAATTs For Data Extraction
11 pages
Question 3
No ratings yet
Question 3
4 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Download
No ratings yet
Download
4 pages
Chapter 1.3
No ratings yet
Chapter 1.3
9 pages
Domain 2
No ratings yet
Domain 2
3 pages
Data Cleaning: Information Integration
No ratings yet
Data Cleaning: Information Integration
42 pages
Big Data Analytics (1) : Definition
No ratings yet
Big Data Analytics (1) : Definition
15 pages
E Xtract T Ransform L OAD: MIS Systems (Acct, HR) Legacy Systems
No ratings yet
E Xtract T Ransform L OAD: MIS Systems (Acct, HR) Legacy Systems
30 pages
Designing The Data Warehouse Aima Second Lecture
No ratings yet
Designing The Data Warehouse Aima Second Lecture
34 pages

Chapter 5

Uploaded by

Chapter 5

Uploaded by

Chapter 5.

• Structuring consists of any actions that change the form or schema of

• Two sets of structuring actions - group of structuring transformations

• The second group of structuring transformations involves operating

• Creating new record fields through extracting values

• Combining multiple record fields into a single record field

• Frequently, this involves identifying a substring in an existing column

• We want to extract the day of the month for each individual

• This will allow us to create a field that we can use to determine if

• Eg: Address fields - complex positional extraction can be utilized

• use functions that can search for a particular sequence of characters

• These functions return either the start position of the searched-for

• last name - inconsistent length – 1st record - 6 characters long (ARNOLD).

• Third record - 7 characters long (ROSSMAN).

• Simple positional extraction would not work in

• complex positional extraction functions - identify the position of the comma -

• column 20 contains free text that describes each contribution.

• You can often use regular expressions to represent patterns in code.

• Regular expressions are also supported by most data wrangling

• “From column 20, extract the first sequence of digits, followed by a

• followed by another sequence of digits.”

• Commonly - this type of complex structure extraction required when

• wrangling JSON data - two types of complex structures: maps and

• These structures are common in semistructured data because they

• Example array: [“Sally”,”Bob”,”Alon”,”Georgia”]

• A JSON - a set of key-value pairs.

• Example map: {“product”:”Trifacta,

• Shopping cart example, each cart might contain a variety of possible

• Ideally, every record will contain all of the possible properties.

• Representing the properties and their associated values in a JSON map

• These tools commonly expect tabular data as input.

• convert JSON-formatted data into the rectangular structure - needed

• Eg: The Individual Contributions dataset. This dataset contains two

• Although filtering is often utilized in cleaning transformations

• you also can use it to alter the granularity of a dataset by changing

• After all, records can belong to one of eight distinct groups.

• We will need to filter our dataset so that it includes records that

• Performing this operation will produce a dataset with a finer

• This type of filtering is called record based filtering.

• This type of filtering affects the number of fields, or columns, in your

• Fairly straightforward aggregation involving the summation of record fields.

• whereas each output record combines one or more input records.

• For simple aggregations, the output record fields are simple

• One column that contains the average contribution made to each

• One column that contains the sum of contributions made to each

• One column that counts the number of contributions made to each

• based on the FEC’s data dictionary, column 1 – campaign committee

• The output records contain a subset of the input record fields.

• This type of column-to-row pivot is commonly referred to as “unpivoting” or

• We want to restructure this dataset so that a single row

• This type of pivot is called a row-to-column pivot.

• Individual Contributions dataset - We want to create a refined dataset that

You might also like