Exploratory Data Analysis
Exploratory Data Analysis
• Merge
• Convert data type
Data Data • Handle missing
and incorrect Data Destination
Source Source
values
• Organize data
• Glue DynamicFrame is
similar to a Spark
Glue DynamicFrame DataFrame
• Optimized for ETL jobs
• Glue Catalog Integration
• Easy to read and write
from AWS Data sources and
destinations
• Support for mixed data
types in a field
Server 1 Server 2 Server 3
Cluster
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Mixed Data Type
Price Price
Cast to float
100.00 100.00
“Three” NULL
toDF
Glue DynamicFrame Spark DataFrame
fromDF
Glue DynamicFrame Spark DataFrame
S3 Data Sources
• Last modified time of Object
State Description
Enabled Glue persists state information. The initial job execution
handles all data, while subsequent runs process only new
data
Disabled Job processes the entire dataset during every job run
Paused Job processes incremental data since the last successful
run. However, new bookmark state is not updated
If your job is skipping data, verify bookmark configuration and reset them if
necessary
Job Purpose
Spark Jobs ETL Jobs run on serverless Spark Cluster. Suitable for
Batch processing
Streaming ETL ETL Jobs for continuous processing of data from sources
like Kinesis and Kafka. Load data to a data lake or
database. Process and deliver data in minutes
Python Shell Jobs For ETL Jobs that do not require Apache Spark
Ray Ray.io is a new framework for Machine Learning workflows
Technology Purpose
Python DataFrame Small to medium datasets
(Pandas) Single machine processing
Apache Spark DataFrame Distributed architecture and in-memory
processing
Very large datasets
Glue DynamicFrame Optimized for ETL workflows
Bookmarking support
Use Glue DynamicFrame for AWS integration touchpoints and Spark
DataFrame for data processing and transformation
▪ Athena
▪ Lazy CSV SerDe
▪ Open CSV SerDe
▪ Handle Data Quality Issues
▪ SQL – Extract, Load, Transform
▪ Apache Spark – Extract, Transform, Load
▪ Views to Simplify Querying
▪ Visualize with Amazon QuickSight
Crawler
QuickSight Athena Glue Catalog
N. Virginia S3
Create
Table
Athena Glue Catalog
S3
SQL
Create
Table
Athena Glue Catalog
S3
Numeric comparison
1005 is greater than 5
String comparison
“1005” is less than “5”
Create
SQL Athena
View
Create
Table
Glue Catalog
S3
QuickSight Athena
Table
Glue Catalog
S3
• Highest Rank
• University Distribution
• Top-N universities in the World
• Regional Ranking
QuickSight Athena
N. Virginia
Table
Glue Catalog
N. Virginia S3
7X AWS Certified
Chandra Lingam
100K+ Students