Print Notes
Print Notes
DMS
lambda,SNS, IAM)
1. Components of a Website
Client: The user’s device or browser.
Network: Pathway connecting client and server.
Server: Provides requested content to the client.
2. Components of a Server
Compute (CPU): Handles processing tasks.
Storage: Hard drives, SSDs, or databases.
Memory (RAM): Temporary storage for fast processing.
Network: Routers, switches, DNS for connectivity.
On-Premises Infrastructure (Traditional Approach)
Disadvantages
Limited Scaling: Expensive and slow to add capacity.
High Costs: Major upfront investment and ongoing maintenance.
Maintenance Requirements: Continuous hardware and software upkeep.
Disaster Risk: Susceptible to data loss.
Benefits of Cloud Computing
1. Managed Services: Cloud provider handles infrastructure management.
2. Data Security: Built-in protection and compliance support.
3. On-Demand Delivery: Scale resources based on real-time needs.
4. Geographic Redundancy: Data backed up across multiple locations.
Cloud Computing Basics
Definition: On-demand access to compute, storage, and networking resources.
Pricing: Pay-as-you-go; only pay for what you use.
Availability: Instantly available resources without long-term contracts.
Cloud Deployment Models
1. Public Cloud: Resources shared across multiple users (AWS, Azure, GCP).
2. Private Cloud: Dedicated to one organization, better for sensitive data.
3. Hybrid Cloud: Mix of public and private, often for secure and scalable needs.
4. Multi-Cloud: Use of multiple public clouds to avoid vendor lock-in.
Types of Cloud Services
ETL Basics
ETL (Extract, Transform, Load) is a data integration process that moves data from various sources to a target
database or data warehouse, ensuring it is cleaned, transformed, and ready for analysis.
2. Components of ETL
A. Extract
o Data Extraction: Pulls data from source systems such as databases, files, APIs, or data lakes.
o Common sources include Salesforce, Data Lakes, and Data Warehouses.
o Change Data Capture (CDC): Captures only modified data to improve efficiency and minimize
load times.
B. Transform
o Data Cleaning: Identifies and corrects data errors and inconsistencies.
o Data Transformation: Converts and restructures data to be compatible with the target
database.
o Data Enrichment: Adds extra information or attributes to enhance data quality.
C. Load
o Data Staging: Stores transformed data temporarily before loading it into the target system.
o Data Loading: Inserts or updates data in the target database; often uses UPSERT for updating
existing records.
o Error Handling: Tracks and manages errors during the loading process.
3. ETL Process Flow
A. Extraction Phase
o Connect to Source Systems: Uses connections like JDBC/ODBC to access source data.
o Data Selection: Defines rules to select relevant data for extraction.
B. Transformation Phase
o Data Mapping: Maps data fields from source to target structure.
o Data Cleansing: Detects and corrects data quality issues.
o Data Validation: Ensures transformed data quality using frameworks like ABC (Account-
Balance-Control) validation.
o Aggregation: Combines and summarizes data for reporting.
C. Loading Phase
o Data Staging: Holds transformed data temporarily before final loading.
o Bulk Loading: Efficiently loads large volumes of data.
o Indexing: Optimizes data retrieval in the target database.
o Post-Load Verification: Confirms successful data loading and integrity.
Popular ETL Tools
Apache NiFi,Talend,Informatica,Microsoft SSIS (SQL Server Integration Services),Apache
Spark,Cloud-Based Services: AWS Glue, AWS EMR, GCP Dataproc
Other Data Processing Architectures
ELT (Extract, Load, Transform): For modern data warehouses; data is loaded first and then
transformed within the target system.
EtLT (Extract, Transform Lite, Load, Transform): Data is extracted, partially transformed, loaded, and
then fully transformed in the target warehouse.
Key Points for Interview Preparation
Understand ETL vs ELT: Be able to explain the difference and when to use each based on the data
architecture and tools.
Explain Key Transformations: Focus on how data cleaning, validation, and enrichment are handled,
especially in complex ETL flows.
ETL Tool Expertise: Be prepared to discuss your experience with specific tools like AWS Glue, Talend,
or Apache Spark.
Error Handling and Logging: Familiarity with handling, logging, and monitoring errors during ETL to
maintain data integrity.
Optimization Techniques: Discuss ways to optimize data retrieval and storage, like indexing and
partitioning.
AWS Glue Cheat Sheet
AWS Glue is a serverless data integration and ETL (Extract, Transform, Load) service designed for
discovering, preparing, and combining data from various sources.
Primarily used for data analysis, machine learning, and application development.
Key Features
1. Managed ETL Service:
o Serverless infrastructure—AWS handles most of the provisioning.
2. Data Catalog:
o Centralized metadata repository for storing information about data sources,
transformations, and target structures.
o Stores information about databases and tables logically for easy reference.
3. ETL Jobs:
o Automates the extraction, transformation, and loading of data between sources and targets.
4. Crawlers:
o Automatically discover schemas in data sources and create tables in the Data Catalog.
5. Triggers:
o Allows ETL jobs scheduling based on time or events (e.g., on-demand, schedule, or event-
driven).
6. Connections:
o Manages connections for database access used in ETL workflows.
7. Workflow:
o Orchestrates and automates ETL workflows by sequencing jobs, triggers, and crawlers.
8. Serverless Architecture:
o No need to manage infrastructure, facilitating scalable and reliable operations.
9. Scalability and Reliability:
o Built-in fault tolerance and retries; provides data durability for critical tasks.
10. Development Endpoints:
Supports interactive development, debugging, and testing via Zeppelin or Jupyter Notebooks.
Detailed Components
A. Glue Crawler
Data Sources: Supports S3, JDBC, DynamoDB, Redshift, RDS, MongoDB.
Crawling Options:
o Can crawl all sub-folders, new sub-folders only, or based on events.
Custom Classifiers: Defines schema for non-standard file formats (e.g., JSON, CSV).
o Supports grok, XML, JSON, CSV.
IAM Roles: Access control for Glue to interact with S3 and other services.
Scheduling: Can be set to run on-demand, time-based, or custom schedules.
B. Glue ETL Job
Job Creation Methods:
o Visual: Uses sources, transformations, and targets in a visual interface.
o Spark Script Editor: Provides the flexibility to code complex transformations.
Key Glue Concepts:
o GlueContext: Acts as a wrapper for Glue’s functionality.
o Dynamic Frame: Specialized data frame for Glue, supports easy transformation.
o Transformation Functions: e.g., Create_dynamic_frame, write_dynamic_frame.
Spark Script Example:
o Convert between dynamic frames and Spark DataFrames to leverage Spark transformations.
o Example functions:
python
Copy code
spark_df = dynamic_frame.toDF() # Dynamic Frame to Spark DF
dynamic_frame = DynamicFrame.fromDF(spark_df, glueContext, "dynamic_frame") # Spark DF to
Dynamic Frame
Important Note: Always place transformation code between job.init() and job.commit().
C. Glue Triggers
Trigger Types:
o Scheduled: Recurring jobs.
o Event-Driven: Triggers based on ETL job or Crawler events.
o On-Demand: Manual execution.
o EventBridge: Event-based automation.
Conditional Logic: Allows ALL or ANY conditions for resource triggers.
D. Glue Workflows
Automates ETL workflows with dependencies and execution order.
Components:
o Triggers: Define conditions.
o Nodes: Represent jobs or crawlers.
Workflow Blueprints: Predefined templates for common workflows.
Interview Preparation Tips
Understand Glue ETL Components: Be ready to explain the roles of Crawlers, Triggers, Workflows, and
the Data Catalog.
Explain Dynamic Frames: Highlight the differences between Spark DataFrames and Glue Dynamic
Frames, particularly for data integration.
Error Handling and Scalability: Be prepared to discuss how Glue manages retries, fault tolerance, and
large-scale data operations.
Serverless Benefits: Emphasize cost savings, scalability, and reduced infrastructure management.
Data Transformation Knowledge: Be familiar with transformation functions in Glue, such as schema
changes, joins, filtering, and aggregation.
Glue Use Cases: Describe scenarios for using Glue in real-world applications like data lake
management, machine learning pipeline prep, or reporting data integration.
AWS Elastic MapReduce (EMR)
Amazon EMR is a managed cluster platform used for big data processing and analytics, providing a
scalable solution for running frameworks like Apache Spark, Hadoop, Flink, and more.
1. EMR Architecture and Provisioning
A) Application Bundle:
o Preconfigured application sets provided by EMR for different big data frameworks.
o Examples: Spark, Core Hadoop, Flink, or Custom configurations.
B) AWS Glue Data Catalog Integration:
o Allows EMR to use AWS Glue as an external metastore for shared metadata management
across services.
C) Operating System Options:
o Custom AMIs can be configured for tailored OS requirements.
D) Cluster Configuration:
1. Instance Groups:
Each node group (primary, core, task) uses a single instance type.
2. Instance Fleets:
Allows up to 5 instance types per node group for flexible, cost-effective scaling.
E) Cluster Scaling and Provisioning:
o Manual Cluster Size: Fixed number of nodes specified by the user.
o EMR Managed Scaling: Sets min/max node limits, scaling handled by EMR.
o Custom Automatic Scaling: Min/max limits with user-defined scaling rules.
F) Steps:
o Defined jobs that can be submitted to the EMR cluster, such as ETL, analysis, or ML jobs.
G) Cluster Termination:
o Manual: Terminate through console or CLI.
o Automatic Termination: After a specified idle period (recommended).
o Termination Protection: Prevents accidental termination; needs to be disabled for manual
termination if enabled.
H) Bootstrap Actions:
o Custom scripts run at cluster startup for installing dependencies or configuration.
I) Cluster Logs:
o Configure an S3 location for storing EMR logs for monitoring and debugging.
J) IAM Roles:
o Service Role: EMR actions for cluster provisioning and management.
o EC2 Instance Profile: Access for EC2 nodes to AWS resources.
o Custom Scaling Role: For custom auto-scaling.
K) EBS Root Volume:
o Add extra storage for applications needing high-capacity storage on EMR nodes.
2. Submitting Applications to the Cluster
AWS Management Console: Submit steps or use EMR Notebooks for interactive analysis.
AWS CLI: Commands for cluster creation, step addition, termination, etc.
AWS SDK (e.g., boto3 for Python): API calls to interact with EMR programmatically.
command-runner.jar:
o Utility for executing custom commands and scripts on EMR, enabling task automation during
cluster initialization.
3. EMR Serverless
EMR Serverless: Abstracts infrastructure management, allowing focus on data processing and
analytics without server provisioning.
Simplifies big data workloads by automatically scaling resources as required.
4. EMR CLI Commands
create-cluster: Creates a new EMR cluster.
terminate-cluster: Terminates a running cluster.
add-step: Adds a new step (job) to an existing cluster.
list-clusters: Lists active clusters and their details.
5. boto3 (AWS SDK for Python)
Overview: SDK for automating AWS service interactions through Python.
boto3 Client: Manages AWS resources using programmatically accessed credentials, useful for scripts
to control EMR and other AWS resources.
Interview Preparation Tips
Cluster Configuration Knowledge: Understand instance groups vs. fleets, scaling options, and
scenarios for each.
EMR Serverless: Describe how serverless reduces complexity by managing infrastructure
automatically.
Glue Data Catalog Integration: Explain how the Glue Data Catalog enhances metadata management in
a multi-service ecosystem.
Bootstrap and command-runner.jar: Highlight how these can install dependencies and run custom
scripts on cluster startup.
Use Cases for Managed Scaling vs. Custom Scaling: Discuss the cost-saving benefits and when custom
rules are preferable.
IAM Role Setup: Detail the roles needed to provision, access other AWS services, and control scaling.
Logging and Monitoring: Know how to set up logs in S3 for debugging and cost-monitoring purposes.
Amazon Redshift Overview:
A petabyte-scale data warehouse service designed for analyzing large amounts of data, emphasizing
fast query performance via parallel query execution.Used for DRL queries.(Redshift is beneficial for
downstream team ie data analyst,data scientists .They access the history tables ,Incremental data is
first stored in staging table and then in history tables(merge operation ie upsert operations based on
logic)
Redshift port no 5439
Architecture:
Cluster Nodes: Includes Leader Node and Compute Nodes (RA3, DC2, DS2) with a columnar storage
model for efficient data storage and retrieval.
Node Slices: Divides compute nodes for parallel data processing.
Redshift Managed Storage (RMS): Employs columnar storage and compression techniques.
Performance Features:
Massively parallel processing (MPP), columnar data storage, data compression, and query optimizer
with caching to enhance query speed.
Data Security:
Utilizes SSL/TLS for encryption in transit and AES-256 for encryption at rest. Options for both server-
side and client-side encryption are available.
Serverless Components (Workgroups and Namespaces):
Workgroup: Group of compute resources (RPUs[2 Virtual CPUs and 16 GB of RAM]) for processing.
Namespace: Organizes storage resources like schemas, tables, and encryption keys.
Data Loading and Query Execution:WAYS to load data
COPY Command: Loads data from sources like S3 into Redshift storage, with customizable options for
file format, compression, and error handling.(latency is less in physically attached data)
Redshift Spectrum: Queries external tables on S3 directly, optimizing data integration without moving
it into Redshift.(latency is more in network attached data)
UTF8
Performance Optimization:
Use appropriate distribution styles (AUTO, EVEN, ALL, or KEY) and sort keys (COMPOUND or
INTERLEAVED) for efficient query performance.
Workload Management (WLM) and query caching help manage and prioritize queries in multi-user
environments.
Vacuum and Analyze Commands: Vacuum removes unused space and reorganizes data; Analyze
updates statistics, improving query plans.
Best Practices:
Use sort keys and distribution styles effectively, compress data files, use staging tables for merging
data, and leverage the COPY command with multiple files and bulk inserts.
Optimize data loading with sequential blocks and sort key alignment for faster performance.
This structure ensures a smooth flow from raw data ingestion to processed outputs ready for analytics. Each
file has a specific role, allowing easy traceability and efficient data processing in the HMS data pipeline.
End-to-End Automated Hospital Management System (HMS) Data Pipeline for AWS Data Engineer with
Experience
This project involves building an automated HMS data pipeline on AWS that handles the ingestion,
transformation, and loading of hospital data. The pipeline will use AWS services such as S3, Glue, Athena,
Redshift, Lambda, Airflow, and SNS for orchestration, monitoring, and notifications. As a 4-year experienced
AWS data engineer, you will focus on optimizing each step and ensuring the pipeline is robust, scalable, and
efficient.
spark = SparkSession.builder.appName('HMS_Transformation').getOrCreate()
Project Summary:
1. Data Ingestion: Use Lambda to upload patient data from Oracle/SAP to S3.
2. Data Transformation: Use AWS Glue (PySpark) to clean and standardize data.
3. Data Querying: Use Athena for SQL-based queries on transformed data in S3.
4. Data Loading: Load the transformed data into Redshift using Glue and JDBC.
5. Orchestration: Manage the entire pipeline using Airflow, including task dependencies and scheduling.
6. Automation & Monitoring: Automate tasks with Lambda triggers and use SNS for notifications.
7. Security: Ensure data security with IAM roles and KMS encryption.
8. Visualization: Create interactive dashboards in QuickSight for hospital management insights.
Practical Considerations:
Performance Optimization: Regularly optimize Glue and EMR jobs to handle large volumes of data.
Cost Optimization: Leverage S3 lifecycle policies to archive data and reduce storage costs.
Scalability: Ensure that the pipeline can scale with growing data by partitioning datasets and
optimizing Redshift and Glue configurations.
This end-to-end automated pipeline streamlines the processing of hospital management data, making it
accessible for reporting and decision-making in real-time while ensuring data consistency, security, and
performance at scale.
PySpark Basics
PySpark is the Python API for Apache Spark, enabling Python developers to interact with Spark for big data
processing.
Key Concepts
1. SparkContext (sc):
o The entry point to Spark functionality. It coordinates the execution of Spark jobs.
o Example: sc = SparkContext("local", "AppName")
2. RDD (Resilient Distributed Dataset):
o A fundamental data structure of Spark that is immutable and distributed.
o Operations: map(), filter(), reduce(), collect(), count(), take().
o Transformation: Lazy evaluation (i.e., the computation doesn’t happen until an action is
triggered).
o Example: rdd = sc.parallelize([1, 2, 3])
3. DataFrame:
o A distributed collection of data organized into named columns (similar to a table in a
relational database).
o More optimized than RDDs and easier to work with for most data engineering tasks.
o Example: df = spark.read.csv("s3://path/to/data.csv")
4. Dataset:
o A distributed collection of data, similar to DataFrame, but strongly typed and provides
compile-time type safety.
5. SparkSession:
o The unified entry point to read data, process it, and perform other tasks.
o Example: spark = SparkSession.builder.appName("MyApp").getOrCreate()
PySpark Architecture
1. Driver Program:
o The main entry point of Spark. It runs the user’s Spark job, distributes tasks to worker nodes,
and handles the results.
o It’s where the SparkSession is created.
2. Cluster Manager:
o Manages the resources in the cluster. Examples: YARN, Mesos, or Kubernetes.
o Decides how to distribute the tasks across worker nodes.
3. Executor:
o Runs the individual tasks and stores the data in memory or disk as part of the computation.
o Multiple executors are distributed across the cluster.
4. Worker Nodes:
o These are the machines that execute the actual tasks.
o Each worker runs an executor that manages tasks for one job.
5. Task Scheduler:
o Responsible for scheduling tasks on executors.
o Tasks are broken down into smaller stages (e.g., map, reduce) and scheduled.
How PySpark Works Initialization:
A SparkSession is created to start a PySpark job. This is the entry point for interacting with Spark.
1. Data Loading:
o Data can be loaded from various sources, e.g., S3, Redshift, RDS, or even a local file system.
Spark can handle large-scale datasets in parquet, csv, json, etc.
o Example: df = spark.read.parquet("s3://my-bucket/data")
2. Transformations:
o Transformations are lazily evaluated, meaning Spark won’t execute them until an action is
triggered. Common transformations include map(), filter(), join(), and groupBy().
o These operations are applied in parallel across the distributed dataset.
3. Actions:
o Actions such as collect(), show(), and count() trigger the execution of the transformations
applied to the dataset.
4. Execution Plan:
o Spark constructs a DAG (Directed Acyclic Graph) of stages based on the transformations
applied.
o The DAG Scheduler breaks this plan into smaller stages and assigns them to workers for
execution.
5. Results:
o The results are either stored in a data store like Redshift, S3, or returned to the driver if the
data is small enough.
Conclusion
PySpark is a powerful tool for data engineering, especially in distributed environments like AWS.
Key components like S3, EMR, and Glue make it easy to process, store, and move large-scale data.
The combination of RDDs, DataFrames, and DAG execution provides scalability and efficiency for large
datasets.
In PySpark, both cache() and persist() are used to store intermediate results in memory or on disk to speed up
operations that reuse the same dataset. They help avoid recomputation, particularly in iterative or repetitive
processes. While similar, they have some important differences in terms of flexibility and control over storage
levels.
1. cache()
Definition: cache() is a shorthand method that stores the data in memory only by default.
Storage Level: By default, cache() uses MEMORY_ONLY, meaning it keeps the dataset in memory
without any backup on disk.
Use Case: Use cache() when the dataset can fit in memory and you don’t need custom storage levels.
Example:
python
Copy code
df = spark.read.csv("s3://my-bucket/data.csv")
df.cache() # Caches the DataFrame in memory only
df.count() # Trigger an action to load data into memory
Pros: Quick and simple for storing datasets that fit entirely in memory.
Cons: Limited to memory; if the dataset is too large, some data may be evicted, and recomputation
will be required.
2. persist()
Definition: persist() provides control over storage levels, allowing the dataset to be stored in memory,
on disk, or a combination.
Storage Levels: Offers various options beyond MEMORY_ONLY, such as:
o MEMORY_ONLY: Stores data in memory; if it doesn’t fit, some data is recomputed.
o MEMORY_AND_DISK: Stores data in memory; spills to disk if there’s insufficient memory.
o DISK_ONLY: Stores data only on disk.
o MEMORY_ONLY_SER (serialized): Stores data in a serialized format to save memory.
o MEMORY_AND_DISK_SER (serialized): Serializes data, storing it in memory or spilling it to
disk as needed.
Use Case: Use persist() when you need specific storage control or have a dataset too large for memory
alone.
Example:
python
Copy code
df = spark.read.csv("s3://my-bucket/data.csv")
df.persist(storageLevel=pyspark.StorageLevel.MEMORY_AND_DISK) # Custom storage level
df.count() # Trigger an action to load data into cache or persist
Pros: Greater flexibility with storage level options; can handle larger datasets by spilling to disk.
Cons: A bit more complex and may require more memory/disk tuning.
Key Differences Summary
Feature cache() persist()
Storage Level MEMORY_ONLY (default) Customizable (e.g., MEMORY_AND_DISK)
Memory Usage Only in memory Memory and/or disk
Flexibility Less flexible More flexible
Feature cache() persist()
Use Case Simple caching in memory Large datasets, custom storage levels
When to Use Which
Use cache() for datasets that easily fit in memory and don’t need disk backup.
Use persist() if the dataset is too large for memory or requires specific storage handling, as in long-
running jobs or iterative processes.
Both methods significantly improve performance by avoiding recomputation, and choosing between them
depends on the dataset size and resource constraints.