AWS Glue
AWS Glue
AWS Glue
AWS Glue is a server less ETL (Extract, transform and load) service on AWS cloud. It makes it easy for to
prepare data for analytics.
Data catalog: The data catalog holds the metadata and the structure of the data.
Database: It is used to create or access the database for the sources and targets.
Table: Create one or more tables in the database that can be used by the source and target.
Crawler and Classifier: A crawler is used to retrieve data from the source using built-in or
custom classifiers. It creates/uses metadata tables that are pre-defined in the data catalog.
Job: A job is business logic that carries out an ETL task. Internally, Apache Spark with python
or scal0061 language writes this business logic.
Trigger: A trigger starts the ETL job execution on-demand or at a specific time.
Development endpoint: It creates a development environment where the ETL job script can be
tested, developed and debugged.
On the Amazon S3 console, click on the Create a bucket where can store files and folders.
Enter a bucket name, select a Region and click on Next.
The remaining configuration settings for creating an S3 bucket are optional. Click Next to create S3
bucket.
Create a new folder in bucket and upload the source CSV files.
Prerequisite: Must have an existing cluster, database name and user for the database in Amazon
Redshift.
In the AWS Glue console, click on the Add Connection in the left pane.
Amazon Redshift connection is now created and can be verified through the Test Connection.
On the left pane in the AWS Glue console, click on Crawlers -> Add Crawler
Enter the crawler name in the dialog box and click Next
Choose S3 as the data store from the drop-down list
Select the folder where your CSVs are stored in the Include path field
Select Choose an existing IAM role and select the previously created role name from the dropdown list
of IAM roles and click Next
Table prefixes are optional and left to the user to customer. The system would also create these
automatically after running the crawler. Click Next.
Choose a data target table from the list of tables. Either can create new tables or choose an existing one.
If haven’t created any target table, select Create tables in data target option
Enter a database name that must exist in the target data store. Click Next.
Can map the columns of the source table with those of the target table. Click Save job and edit script.
Open the Python script by selecting the recently created job name. Click on Action -> Edit Script.
The left pane shows a visual representation of the ETL process. The right-hand pane shows the script
code and just below that can see the logs of the running Job.
The script that performs extraction, transformation and loading process on AWS Glue.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
Extract the data of tbl_syn_source_1_csv and tbl_syn_source_2_csv tables from the data catalog. AWS
Glue supports Dynamic Frames of the data.
datasource1 = glueContext.create_dynamic_frame.from_catalog(database =
"db_demo1", table_name = "tbl_syn_source_1_csv", transformation_ctx =
"datasource1")
datasource2 = glueContext.create_dynamic_frame.from_catalog(database =
"db_demo1", table_name = "tbl_syn_source_2_csv", transformation_ctx =
"datasource2")
Now, Apply transformation on the source tables. Can join both the tables on statecode column of
tbl_syn_source_1_csv and code column of tbl_syn_source_2_csv.
Several transformations are available within AWS Glue such as RenameField, SelectField, Join, etc.
Refer – https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html.
Load the joined Dynamic Frame in Amazon Redshift (Database=dev and Schema=shc_demo_1).
job.commit()
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## Read the data from Amazon S3 and have their structure in the data catalog.
datasource1 = glueContext.create_dynamic_frame.from_catalog(database =
"db_demo1", table_name = "tbl_syn_source_1_csv", transformation_ctx =
"datasource1")
datasource2 = glueContext.create_dynamic_frame.from_catalog(database =
"db_demo1", table_name = "tbl_syn_source_2_csv", transformation_ctx =
"datasource2")
AWS Glue has a few limitations on the transformations such as UNION, LEFT JOIN, RIGHT JOIN, etc. To
overcome this issue, can use Spark.
Convert Dynamic Frame of AWS Glue to Spark DataFrame and then can apply Spark functions for various
transformations.
Example: Union transformation is not available in AWS Glue. Can use spark union() to achieve Union on
two tables.
https://gist.github.com/nitinmlvya/ba4626e8ec40dc546119bb14a8349b45