0% found this document useful (0 votes)

52 views23 pages

Processing XML With AWS Glue and Databricks Spark

The document discusses processing XML data with AWS Glue and Databricks Spark-XML. It provides an example of using Glue to crawl an XML dataset, convert it to CSV, and flatten the data using Glue transforms and PySpark. It also shows how to use the Databricks Spark-XML library with a Glue DevEndpoint.

Uploaded by

sparkaredla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views23 pages

Processing XML With AWS Glue and Databricks Spark

Uploaded by

sparkaredla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Processing XML with AWS Glue

and Databricks Spark-XML

A fast introduction to Glue and some tricks for XML
processing , which has never been easy

P laying with unstructured data can be sometimes cumbersome

and might include mammoth tasks to have control over the data
if you have strict rules on the quality and structure of the data.

In this article I will be sharing my experience of processing XML

files with Glue transforms versus Databricks Spark-xml library.

AWS Glue is “the” ETL service provided by AWS. It has three main
components, which are Data Catalogue, Crawler and ETL Jobs. As
Crawler helps you to extract information(schema and statistics) of
your data,Data Catalogue is used for centralised metadata
management. With ETL Jobs, you can process the data stored on
AWS data stores with either Glue proposed scripts or your custom
scripts with additional libraries and jars.

XML… Firstly, you can use Glue crawler for exploration of data
schema. As xml data is mostly multilevel nested, the crawled
metadata table would have complex data types such as structs,
array of structs,…And you won’t be able to query the xml with
Athena since it is not supported.So it is necessary to convert xml
into a flat format.To flatten the xml either you can choose an easy
way to use Glue’s magic(!), a simple trick convert it to csv or you
can use Glue transforms to flatten the data, which i will elaborate
on shortly.

You need to be careful about the flattening, which might cause null
values even the data is available in original structure.

I will give an example for alternative approaches, and it is up to

you which to choose according to your use case.

 Crawl XML
 Convert to CSV with Glue Job
 Using Glue PySpark Transforms to flatten the data
 An Alternative : Use Databricks Spark-xml

Dataset : http://opensource.adobe.com/Spry/data/donuts.xml

Code&Snippets
: https://github.com/elifinspace/GlueETL/tree/article-2

0. Upload dataset to S3:

Download the file from the given link and go to S3 service on AWS
console.
Create a bucket with “aws-glue-” prefix(I am leaving settings default for now)

Click on the bucket name and click on Upload:(this is the easiest

way to do this, you can also setup AWS CLI to interact with aws
services from your local machine, which would require a bit more
work incl. installing aws cli/configurations etc.)

Click on Add files and choose the file you would like to upload, just
click Upload.
You can setup security/lifecycle configurations, if you click Next.

1. Crawl XML Metadata

First of all , if you know the tag in the xml data to choose as base
level for the schema exploration, you can create a custom classifier
in Glue . Without the custom classifier, Glue will infer the schema
from the top level.

In the example xml dataset above, I will choose “items” as my

classifier and create the classifier as easily as follows:

Go to Glue UI and click on Classifiers tab under Data Catalog

section.
“item” will be the root level for the schema exploration

I create the crawler with the classifier :

Give the crawler a name and Select the classifier from the list

Leave everything as default for now , browse for the sample data location (‘Include path’)

Add Another Data Store : No

You can use your IAM role with the relevant read/write
permissions on the S3 bucket or you can create a new one :

Frequency: Run On Demand

Choose the default db(or you can create a new one) and leave settings as default

Review and Click Finish.

Now we are ready to run the crawler: Select the crawler and click
on Run Crawler ,once the Status is ‘Ready’ , visit Database section
and see the tables in database.

(Tables added :1 means that our metadata table is created )

Go to Tables and filter your DB:

Click on table name and the output schema is as follows:

Now we have an idea of the schema, but we have complex data

types and need to flatten the data.

2. Convert to CSV :

It will be simple and we will use the script provided by Glue:

Go to Jobs section in ETL menu and Add Job:

Name the job and choose the IAM role we created earlier simply(make sure that this role has
permissions to read/write from/to source and target locations)
Tick the option above,Choose the target data store as S3 ,format CSV and set target path

Now the magic step:(If we selected Parquet as format, we would

do the flattening ourselves, as parquet can have complex types but
the mapping is revealed easily for csv.)
You can rename, change the data types, remove and add columns in target. I want to point that
the array fields mapped to string which is not desirable from my point of view.

I leave everything as default,review,save and continue with edit

script.

Glue proposed script:

We can Run the job immediately or edit the script in any way.Since it is a python code
fundamentally, you have the option to convert the dynamic frame into spark dataframe, apply
udfs etc. and convert back to dynamic frame and save the output.(You can stick to Glue
transforms, if you wish .They might be quite useful sometimes since the Glue Context provides
extended Spark transformations.)

I have added some lines to the proposed script to generate a single

CSV output, otherwise the output will be multiple small csv files
based on partitions.

Save and Click on Run Job, this will bring a configuration review,
so you can set the DPU to 2(the least it can be) and timeout as
follows:
Let’s run and see the output.You can monitor the status in Glue UI
as follows:
Once the Run Status is Succeeded , go to your target S3 location:

Click on the file name and go to the Select From tab as below:
If you scroll down, you can preview and query small files easily by
clicking Show File Preview/Run SQL(Athena in the background):

The struct fields propagated but the array fields remained, to explode array type columns, we
will use pyspark.sql explode in coming stages.

3. Glue PySpark Transforms for Unnesting

There are two pyspark transforms provided by Glue :

 Relationalize : Unnests the nested columns, pivots array

columns, generates joinkeys for relational
operations(joins, etc.), produces list of frames
 UnnestFrame : Unnests the frame, generates joinkeys for
array type columns , produces a single frame with all
fields incl. joinkey columns.

We will use Glue DevEndpoint to visualize these transformations :

Glue DevEndpoint is the connection point to data stores for you to

debug your scripts , do exploratory analysis on data using Glue
Context with a Sagemaker or Zeppelin Notebook .

Moreover you can also access this endpoint from Cloud9 ,which is
the cloud-based IDE environment to write, run, and debug your
codes.You just need to generate SSH key on Cloud9 instance and
add the public ssh key while creating the endpoint. To connect to
the endpoint you will use the “SSH to Python REPL” command in
endpoint details(click on endpoint name in Glue UI),replace
private key parameter with the location of yours on your Cloud9
instance.

 Create a Glue DevEndpoint and a Sagemaker Notebook:

I will use this endpoint also for Databricks spark-xml example, so

download the jar file to your PC
from https://mvnrepository.com/artifact/com.databricks/spark-
xml_2.11/0.4.1, upload the jar to S3 and set “Dependent jars path”
accordingly:
Name it and choose the IAM role we used before.If you have a codebase you want to use, you
can add its path to Python library path.

You can leave every other configuration as default and click

Finish .It takes approx. 6 mins for the endpoint to be Ready.

Once the endpoint is ready, we are ready to create a notebook to

connect to it.

Choose your endpoint and click create Sagemaker Notebook from

Actions drop down list.It will take a couple of minutes for the
notebook to be ready, once created.
Name it, leave default settings and name the new IAM role , click Create Notebook

Open the notebook and create a new Pyspark notebook:

You can copy and paste the boilerplate from the csv job we created
previously , change glueContext line as below and comment out
the job related libraries and snippets:
You can either create dynamic frame from catalog, or using “from options” with which you can
point to a specific S3 location to read the data and, without creating a classifier as we did
before ,you can just set format options to read the data.

You can find more about format options

in https://docs.aws.amazon.com/glue/latest/dg/aws-glue-
programming-etl-format.html

 Relationalize:

I used the frame created by from options for the following steps:
(the outputs will be the same even if you use the catalog option,
the catalog does not persist a static schema for the data.)
You can see that the transform returns a list of frames, each has an id and index col for join keys
and array elements respectively.

It will be clearer if you look at the root table. For e.g. the fillings
has only an integer value in this field in root table, this value
matches the id column in the root_fillings_filling frame above.
An important thing is that we see that “batters.batter” field
propagated into multiple columns.For the item 2 “batters.batter”
column is identified as struct , however for item 3 this field is an
array!. So here comes the difficulty of working with Glue.

If you have complicated multilevel nested complicated structure then this behavior might cause
lack of maintenance and control over the outputs and problems such as data loss ,so alternative
solutions should be considered.

 Unnest Frame:

Let’s see how this transform will give us a different output :

We can see that this time everything is in one frame but again “batters.batter” resulted in
multiple columns , this brings uncertainty around the number of columns also. Considering an
ETL pipeline, each time a new file comes in, this structure will probably change.

And unnest could spread out the upper level structs but is not
effective on flattening the array of structs. So since we can not
apply udfs on dynamic frames we need to convert the dynamic
frame into Spark dataframe and apply explode on columns to
spread array type columns into multiple rows.I will leave this part
for your own investigation.

Moreover I would expect to have not two different spread of “batters.batter” and imho there
could be an “array of structs” type column for this field and the “item 2” would have an array of
length 1 having its one struct data.

And Finally… Databricks spark-xml :

It may not be the best solution but this package is very useful in
terms of control and accuracy. A good feature is that un-parseable
records are also detected and a _corrupt_record column is added
with relevant information.

Now here is the difference I expected :) . You can see that “batters.batter” is an array of structs.
Moreover for more reading options, you can have a look at https://github.com/databricks/spark-
xml

Batters : No nulls, no probs

So you don’t need to consider whether there is an struct or array

column, you can write a generic function for exploding array
columns by making use of the extracted schema.
Just to mention , I used Databricks’ Spark-XML in Glue
environment, however you can use it as a standalone python
script, since it is independent of Glue.

We saw that even though Glue provides one line transforms for
dealing with semi/unstructured data, if we have complex data
types, we need to work with samples and see what fits our
purpose.

AWS Interview Questions-1
No ratings yet
AWS Interview Questions-1
23 pages
T15 AWSAnalyticsAndAI ProblemStatement Mocktest
No ratings yet
T15 AWSAnalyticsAndAI ProblemStatement Mocktest
14 pages
Aws Glue Interview
No ratings yet
Aws Glue Interview
259 pages
Big Data Analysis Term Work
No ratings yet
Big Data Analysis Term Work
65 pages
Lab - Performing ETL On A Dataset by Using AWS Glue
100% (1)
Lab - Performing ETL On A Dataset by Using AWS Glue
26 pages
Social Media Data Integration and Analysis Ps
No ratings yet
Social Media Data Integration and Analysis Ps
7 pages
AWS Glue
No ratings yet
AWS Glue
36 pages
AWS DATA Engineering Abhishek
No ratings yet
AWS DATA Engineering Abhishek
6 pages
How To Work With Iceberg Format in AWS-Glue
No ratings yet
How To Work With Iceberg Format in AWS-Glue
17 pages
Glue by Pushpjeet
No ratings yet
Glue by Pushpjeet
7 pages
Project Ringba
No ratings yet
Project Ringba
6 pages
AWS Project1
No ratings yet
AWS Project1
13 pages
Aws Glue
No ratings yet
Aws Glue
3 pages
Simple AWS ETL Project
No ratings yet
Simple AWS ETL Project
3 pages
Abd213 R Howtobuildadatalakewithawsgluedatacatalog 180208045612
No ratings yet
Abd213 R Howtobuildadatalakewithawsgluedatacatalog 180208045612
43 pages
AWS Glue
100% (1)
AWS Glue
225 pages
Aws Glue Information
No ratings yet
Aws Glue Information
46 pages
Data Pipelines With AWS Glue (Level 200)
No ratings yet
Data Pipelines With AWS Glue (Level 200)
33 pages
Tiktok Ads
No ratings yet
Tiktok Ads
8 pages
Glue Workflow
No ratings yet
Glue Workflow
2 pages
AWS Glue For Handling Metadata - Analytics Vidhya
No ratings yet
AWS Glue For Handling Metadata - Analytics Vidhya
5 pages
Lab Aws 14-10
100% (1)
Lab Aws 14-10
25 pages
Exercise 3 - Processing Data in A Data Lake
No ratings yet
Exercise 3 - Processing Data in A Data Lake
6 pages
Modernserverlessdatalak
No ratings yet
Modernserverlessdatalak
45 pages
Company Interview Question Bank
No ratings yet
Company Interview Question Bank
16 pages
Athena
No ratings yet
Athena
13 pages
Affinity
No ratings yet
Affinity
7 pages
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
From Everand
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
Arun Manivannan
No ratings yet
JavaScript for Kids: Start Your Coding Adventure
From Everand
JavaScript for Kids: Start Your Coding Adventure
Abdelfattah Ragab
No ratings yet
SQL| KILLING STEPS TO INTRODUCE SQL DATABASES
From Everand
SQL| KILLING STEPS TO INTRODUCE SQL DATABASES
Ben Brumm
No ratings yet
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
Building Serverless Apps with Azure Functions and Cosmos DB: Leverage Azure functions and Cosmos DB for building serverless applications (English Edition)
From Everand
Building Serverless Apps with Azure Functions and Cosmos DB: Leverage Azure functions and Cosmos DB for building serverless applications (English Edition)
Hansamali Gamage
No ratings yet
Creating your MySQL Database: Practical Design Tips and Techniques
From Everand
Creating your MySQL Database: Practical Design Tips and Techniques
Marc Delisle
3/5 (1)
AWS Glue
No ratings yet
AWS Glue
10 pages
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
From Everand
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
Dave Fowler
No ratings yet
Azure For Starters
From Everand
Azure For Starters
Chinmoy Mukherjee
No ratings yet
JavaScript Patterns JumpStart Guide (Clean up your JavaScript Code)
From Everand
JavaScript Patterns JumpStart Guide (Clean up your JavaScript Code)
Dan Wahlin
4.5/5 (3)
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
User Manual: DCSF-C Series Servo Motor
100% (1)
User Manual: DCSF-C Series Servo Motor
46 pages
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
OpenCart Tips and Tricks
From Everand
OpenCart Tips and Tricks
iSenseLabs
No ratings yet
Starting Database Administration: Oracle DBA
From Everand
Starting Database Administration: Oracle DBA
anuragbaruah84
3/5 (2)
CDOP3103 Introduction To Object Oriented Approach PDF
0% (1)
CDOP3103 Introduction To Object Oriented Approach PDF
211 pages
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Previewpdf
No ratings yet
Previewpdf
28 pages
Assignment 7 - Computer Software: Product Logo Category Type of Software
No ratings yet
Assignment 7 - Computer Software: Product Logo Category Type of Software
3 pages
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-3: AZ 104 EXAM STUDY GUIDE
From Everand
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-3: AZ 104 EXAM STUDY GUIDE
Devi Prasad
No ratings yet
Mastering JavaScript: The Complete Guide to JavaScript Mastery
From Everand
Mastering JavaScript: The Complete Guide to JavaScript Mastery
Tim Robards
5/5 (1)
i-CAT FLX - User Manual
No ratings yet
i-CAT FLX - User Manual
97 pages
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Black Blue Modern Professional CV Resume Template-1
No ratings yet
Black Blue Modern Professional CV Resume Template-1
1 page
Chat Bot
No ratings yet
Chat Bot
48 pages
Software Design Patterns Made Simple
No ratings yet
Software Design Patterns Made Simple
31 pages
2003.03.12 S.B. No. 5460R2 ELECTRONIC ENGINE CONTROL - REPROGRAMMINGREPLACEMENT OF
No ratings yet
2003.03.12 S.B. No. 5460R2 ELECTRONIC ENGINE CONTROL - REPROGRAMMINGREPLACEMENT OF
15 pages
The Beginner’s Guide to JavaScript
From Everand
The Beginner’s Guide to JavaScript
Steven Mcananey
No ratings yet
SamplerSight-Pharma Rev H
No ratings yet
SamplerSight-Pharma Rev H
117 pages
AWS Glue for Data Engineers: Serverless ETL Made Easy
From Everand
AWS Glue for Data Engineers: Serverless ETL Made Easy
Robert Johnson
No ratings yet
Salesforce Developer Interview Questions: 1.0, #1
From Everand
Salesforce Developer Interview Questions: 1.0, #1
SFDC TELUGU
No ratings yet
Chapter 28: Introduction To Dialog Programming
100% (1)
Chapter 28: Introduction To Dialog Programming
11 pages
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
2 Product Specification
No ratings yet
2 Product Specification
28 pages
Darshan Solanki Resume
No ratings yet
Darshan Solanki Resume
1 page
PCS-902S X Selection Guide en Overseas General X R1.00
No ratings yet
PCS-902S X Selection Guide en Overseas General X R1.00
43 pages
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Java Roadmap 2024
No ratings yet
Java Roadmap 2024
1 page
MX14R1 - Kumpulan Soal MTCNA 2014
No ratings yet
MX14R1 - Kumpulan Soal MTCNA 2014
9 pages
JavaScript File Handling from Scratch: A Practical Guide with Examples
From Everand
JavaScript File Handling from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
Midterm Module 2 Week 9 Server Installation
No ratings yet
Midterm Module 2 Week 9 Server Installation
12 pages
Computer Graphics Lab Manual
No ratings yet
Computer Graphics Lab Manual
11 pages
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
Full Text 02
No ratings yet
Full Text 02
45 pages
Srs Template-Ieee
No ratings yet
Srs Template-Ieee
8 pages
Ladder Pumback
No ratings yet
Ladder Pumback
5 pages
CISA Domain (Simplilearn) Simulation
No ratings yet
CISA Domain (Simplilearn) Simulation
69 pages
Knight's Microsoft SQL Server 2012 Integration Services 24-Hour Trainer
From Everand
Knight's Microsoft SQL Server 2012 Integration Services 24-Hour Trainer
Brian Knight
No ratings yet
SAP S4HANA 1809 Release Information Note
No ratings yet
SAP S4HANA 1809 Release Information Note
12 pages
Iitd Exam Os Solution
No ratings yet
Iitd Exam Os Solution
6 pages
L2 Rules of Netiquette
No ratings yet
L2 Rules of Netiquette
17 pages
Funding Your COL Account
No ratings yet
Funding Your COL Account
9 pages
Ai Syllabus
No ratings yet
Ai Syllabus
2 pages
Launch EPANET PDF
No ratings yet
Launch EPANET PDF
13 pages
Sfr1M2-Gu Fu For Shima Seiki Ses User Manual
75% (4)
Sfr1M2-Gu Fu For Shima Seiki Ses User Manual
2 pages
Peach Payments Refund Reverse Transaction
No ratings yet
Peach Payments Refund Reverse Transaction
1 page
Oracle Database Administration Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
From Everand
Oracle Database Administration Interview Questions You'll Most Likely Be Asked: Job Interview Questions Series
Vibrant Publishers
5/5 (1)
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Learn SQLite in 24 Hours
From Everand
Learn SQLite in 24 Hours
Alex Nordeen
No ratings yet

Processing XML With AWS Glue and Databricks Spark

Uploaded by

Processing XML With AWS Glue and Databricks Spark

Uploaded by

Processing XML with AWS Glue

and Databricks Spark-XML

P laying with unstructured data can be sometimes cumbersome

In this article I will be sharing my experience of processing XML

I will give an example for alternative approaches, and it is up to

0. Upload dataset to S3:

Click on the bucket name and click on Upload:(this is the easiest

1. Crawl XML Metadata

In the example xml dataset above, I will choose “items” as my

Go to Glue UI and click on Classifiers tab under Data Catalog

I create the crawler with the classifier :

Add Another Data Store : No

Frequency: Run On Demand

Review and Click Finish.

(Tables added :1 means that our metadata table is created )

Go to Tables and filter your DB:

Now we have an idea of the schema, but we have complex data

It will be simple and we will use the script provided by Glue:

Go to Jobs section in ETL menu and Add Job:

Now the magic step:(If we selected Parquet as format, we would

I leave everything as default,review,save and continue with edit

Glue proposed script:

I have added some lines to the proposed script to generate a single

3. Glue PySpark Transforms for Unnesting

 Relationalize : Unnests the nested columns, pivots array

We will use Glue DevEndpoint to visualize these transformations :

Glue DevEndpoint is the connection point to data stores for you to

 Create a Glue DevEndpoint and a Sagemaker Notebook:

I will use this endpoint also for Databricks spark-xml example, so

You can leave every other configuration as default and click

Once the endpoint is ready, we are ready to create a notebook to

Choose your endpoint and click create Sagemaker Notebook from

Open the notebook and create a new Pyspark notebook:

You can find more about format options

Let’s see how this transform will give us a different output :

And Finally… Databricks spark-xml :

Batters : No nulls, no probs

So you don’t need to consider whether there is an struct or array

You might also like