0% found this document useful (0 votes)
7 views

Exploratory Data Analysis

Uploaded by

ramdotind
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Exploratory Data Analysis

Uploaded by

ramdotind
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Exploratory Data Analysis

University Rankings Dataset

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Glue ETL

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Extract, Transform, and Load

Extract Transform Load

• Merge
• Convert data type
Data Data • Handle missing
and incorrect Data Destination
Source Source
values
• Organize data

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Glue ETL

Fully managed ETL service to run jobs

Serverless Apache Spark

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


DataFrame - Python and R
DataFrame • Python (pandas) and R widely
used for data transformation
analysis
• Load to a DataFrame
• Fast local access
Computer memory • Suitable for small, medium
datasets
• Single machine analysis
Database File Share

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Apache Spark DataFrame

• Spark DataFrame is spread


across many machines
Spark DataFrame • Spark ETL - Code is moved
to the server where data
is located
• Python and R DataFrame –
Data is moved to the
server where ETL code is
located
• Spark is efficient with
Server 1 Server 2 Server 3
large datasets
Cluster
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Glue DynamicFrame

• Glue DynamicFrame is
similar to a Spark
Glue DynamicFrame DataFrame
• Optimized for ETL jobs
• Glue Catalog Integration
• Easy to read and write
from AWS Data sources and
destinations
• Support for mixed data
types in a field
Server 1 Server 2 Server 3

Cluster
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Mixed Data Type

Spark DataFrame stores Quantity as a string


Quantity type
25 Quantity: str
“Three”
Glue DynamicFrame stores Quantity as a
union type
Quantity: [str, integer]

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


DynamicFrame - Handling Data Type Inconsistencies

• Cast column to a single data type


• Make new columns – one for each data type

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


DynamicFrame - Cast to a single Type

Price Price
Cast to float
100.00 100.00
“Three” NULL

Cast to a single type


• Only valid values are converted
• Others are stored as NULL

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


DynamicFrame - Make Column

Price Price_float Price_string


Make column
100.00 100.00 NULL
“Three” NULL “Three”

Create new columns – one for each type


Value for a row is stored in one of the columns and NULL in
other columns

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


DynamicFrame and Spark DataFrame

toDF
Glue DynamicFrame Spark DataFrame

fromDF
Glue DynamicFrame Spark DataFrame

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Glue ETL Bookmark

Track processed data

Persist state between job runs

Prevent reprocessing of old data and process


only new data

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Bookmarks – Identifying New Data

S3 Data Sources
• Last modified time of Object

JDBC Data Sources


• Primary key of a table is used as Bookmark key
(default)
• Optional: Specify columns to use as Bookmark key
• Bookmark key needs to increase or decrease over time

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Bookmark Configuration

State Description
Enabled Glue persists state information. The initial job execution
handles all data, while subsequent runs process only new
data
Disabled Job processes the entire dataset during every job run
Paused Job processes incremental data since the last successful
run. However, new bookmark state is not updated

If your job is skipping data, verify bookmark configuration and reset them if
necessary

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Transformation Context

Transformation Context - Optional string parameter in Glue


DynamicFrame methods

If specified - persist state for that operation

If not specified – state is not tracked for that operation

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Glue Job Types

Job Purpose
Spark Jobs ETL Jobs run on serverless Spark Cluster. Suitable for
Batch processing
Streaming ETL ETL Jobs for continuous processing of data from sources
like Kinesis and Kafka. Load data to a data lake or
database. Process and deliver data in minutes
Python Shell Jobs For ETL Jobs that do not require Apache Spark
Ray Ray.io is a new framework for Machine Learning workflows

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Summary

Technology Purpose
Python DataFrame Small to medium datasets
(Pandas) Single machine processing
Apache Spark DataFrame Distributed architecture and in-memory
processing
Very large datasets
Glue DynamicFrame Optimized for ETL workflows
Bookmarking support
Use Glue DynamicFrame for AWS integration touchpoints and Spark
DataFrame for data processing and transformation

Pure Spark code is portable to other environments

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Lab – Iris CSV to Parquet

• Lower storage and ETL Job


query costs
Read Write
• Improve query CSV Parquet
performance
• Visual Editor to Glue Catalog Glue Catalog
build Glue ETL job
• Serverless Apache S3 S3
Spark
CSV Source Parquet Destination

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


University Ranking Dataset - Objectives

Use Data Lake for Querying

Answer Common Queries


• Top Universities in the World
• Regional Ranking
• Differences between Public and Private Universities
• Number of Universities

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


What you will learn

▪ Athena
▪ Lazy CSV SerDe
▪ Open CSV SerDe
▪ Handle Data Quality Issues
▪ SQL – Extract, Load, Transform
▪ Apache Spark – Extract, Transform, Load
▪ Views to Simplify Querying
▪ Visualize with Amazon QuickSight

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Lab – University Ranking (Lazy SerDe)

• Glue crawler to catalog University Rankings dataset


• Query with Athena

Crawler
QuickSight Athena Glue Catalog

N. Virginia S3

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Lab – University Ranking (Open CSV SerDe)

• Ensure data in CSV is correctly mapped to columns in


the table
• Handle double quotes and embedded comma

Create
Table
Athena Glue Catalog

S3

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Lab – Data Cleanup

• Convert from String type to correct data type


• Handle missing value and error conditions

SQL
Create
Table
Athena Glue Catalog

S3

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Why Correct Data Type is Important

String type and Numeric comparison – Don’t work!

Numeric comparison
1005 is greater than 5

String comparison
“1005” is less than “5”

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Function Recap

SPLIT to handle ranks with range of values


(like 401-410)

CAST for data type conversion

TRY for handling exceptions

COALESCE to replace NULL values

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Lab – Simplify Querying with Views

• Hide complexity of Query with Views

Create
SQL Athena
View

Create
Table
Glue Catalog

S3

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Explore Data

• Top-N universities in the World


• Regional Ranking
• Country-specific Ranking
• Distribution

View hid the complex data cleanup logic and


allowed us to explore data

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Benefits of a Data Lake

• Take source data as-is and enable SQL


querying
• Quickly build prototypes and proof-of-
concepts
• Extract, Load and Transform - Cleanup at
query time
• Query time cleanup can be inefficient

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Lab – Apache Spark ETL
• Query time cleanup is inefficient
• Optimize Solution – Clean up first and then load to a
data lake
• Improve performance and lower querying costs

Extract Transform Load

Apache Spark Serverless – Glue

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Object Oriented Interaction

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


SQL-based Interaction

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Lab – QuickSight Business Intelligence

• Visually analyze data using plots, pivot tables


• No-code

QuickSight Athena

Table

Glue Catalog

S3

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


QuickSight Analysis

• Highest Rank
• University Distribution
• Top-N universities in the World
• Regional Ranking

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Lab – QuickSight Business Intelligence

• Visually analyze data using plots, pivot tables


• No-code

QuickSight Athena
N. Virginia

Table

Glue Catalog
N. Virginia S3

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.


Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.
Instructor, Course Developer

7X AWS Certified

For a list of courses, visit


https://www.computewithcloud.com/

Connect with me on LinkedIn


https://www.linkedin.com/in/chandralingam/

Chandra Lingam
100K+ Students

Copyright © 2023 ChandraMohan Lingam. All Rights Reserved.

You might also like