Processing XML With AWS Glue and Databricks Spark
Processing XML With AWS Glue and Databricks Spark
AWS Glue is “the” ETL service provided by AWS. It has three main
components, which are Data Catalogue, Crawler and ETL Jobs. As
Crawler helps you to extract information(schema and statistics) of
your data,Data Catalogue is used for centralised metadata
management. With ETL Jobs, you can process the data stored on
AWS data stores with either Glue proposed scripts or your custom
scripts with additional libraries and jars.
XML… Firstly, you can use Glue crawler for exploration of data
schema. As xml data is mostly multilevel nested, the crawled
metadata table would have complex data types such as structs,
array of structs,…And you won’t be able to query the xml with
Athena since it is not supported.So it is necessary to convert xml
into a flat format.To flatten the xml either you can choose an easy
way to use Glue’s magic(!), a simple trick convert it to csv or you
can use Glue transforms to flatten the data, which i will elaborate
on shortly.
You need to be careful about the flattening, which might cause null
values even the data is available in original structure.
Crawl XML
Convert to CSV with Glue Job
Using Glue PySpark Transforms to flatten the data
An Alternative : Use Databricks Spark-xml
Dataset : http://opensource.adobe.com/Spry/data/donuts.xml
Code&Snippets
: https://github.com/elifinspace/GlueETL/tree/article-2
Download the file from the given link and go to S3 service on AWS
console.
Create a bucket with “aws-glue-” prefix(I am leaving settings default for now)
Click on Add files and choose the file you would like to upload, just
click Upload.
You can setup security/lifecycle configurations, if you click Next.
First of all , if you know the tag in the xml data to choose as base
level for the schema exploration, you can create a custom classifier
in Glue . Without the custom classifier, Glue will infer the schema
from the top level.
Leave everything as default for now , browse for the sample data location (‘Include path’)
You can use your IAM role with the relevant read/write
permissions on the S3 bucket or you can create a new one :
Now we are ready to run the crawler: Select the crawler and click
on Run Crawler ,once the Status is ‘Ready’ , visit Database section
and see the tables in database.
2. Convert to CSV :
Name the job and choose the IAM role we created earlier simply(make sure that this role has
permissions to read/write from/to source and target locations)
Tick the option above,Choose the target data store as S3 ,format CSV and set target path
Save and Click on Run Job, this will bring a configuration review,
so you can set the DPU to 2(the least it can be) and timeout as
follows:
Let’s run and see the output.You can monitor the status in Glue UI
as follows:
Once the Run Status is Succeeded , go to your target S3 location:
Click on the file name and go to the Select From tab as below:
If you scroll down, you can preview and query small files easily by
clicking Show File Preview/Run SQL(Athena in the background):
The struct fields propagated but the array fields remained, to explode array type columns, we
will use pyspark.sql explode in coming stages.
Moreover you can also access this endpoint from Cloud9 ,which is
the cloud-based IDE environment to write, run, and debug your
codes.You just need to generate SSH key on Cloud9 instance and
add the public ssh key while creating the endpoint. To connect to
the endpoint you will use the “SSH to Python REPL” command in
endpoint details(click on endpoint name in Glue UI),replace
private key parameter with the location of yours on your Cloud9
instance.
You can copy and paste the boilerplate from the csv job we created
previously , change glueContext line as below and comment out
the job related libraries and snippets:
You can either create dynamic frame from catalog, or using “from options” with which you can
point to a specific S3 location to read the data and, without creating a classifier as we did
before ,you can just set format options to read the data.
Relationalize:
I used the frame created by from options for the following steps:
(the outputs will be the same even if you use the catalog option,
the catalog does not persist a static schema for the data.)
You can see that the transform returns a list of frames, each has an id and index col for join keys
and array elements respectively.
It will be clearer if you look at the root table. For e.g. the fillings
has only an integer value in this field in root table, this value
matches the id column in the root_fillings_filling frame above.
An important thing is that we see that “batters.batter” field
propagated into multiple columns.For the item 2 “batters.batter”
column is identified as struct , however for item 3 this field is an
array!. So here comes the difficulty of working with Glue.
If you have complicated multilevel nested complicated structure then this behavior might cause
lack of maintenance and control over the outputs and problems such as data loss ,so alternative
solutions should be considered.
Unnest Frame:
And unnest could spread out the upper level structs but is not
effective on flattening the array of structs. So since we can not
apply udfs on dynamic frames we need to convert the dynamic
frame into Spark dataframe and apply explode on columns to
spread array type columns into multiple rows.I will leave this part
for your own investigation.
Moreover I would expect to have not two different spread of “batters.batter” and imho there
could be an “array of structs” type column for this field and the “item 2” would have an array of
length 1 having its one struct data.
Now here is the difference I expected :) . You can see that “batters.batter” is an array of structs.
Moreover for more reading options, you can have a look at https://github.com/databricks/spark-
xml
We saw that even though Glue provides one line transforms for
dealing with semi/unstructured data, if we have complex data
types, we need to work with samples and see what fits our
purpose.