0% found this document useful (0 votes)
19 views

EliteDataEngineeringProgramCurriculum (1)

The Elite Data Engineering Program, led by Sumit Mittal, offers a comprehensive curriculum spanning 18 weeks, focusing on big data concepts, distributed storage, and processing using Apache Spark. Key topics include data pipelines, Spark SQL, performance tuning, and real-time data processing, along with practical applications and project implementation. The program also covers Git, CI/CD practices, and data modeling, culminating in hands-on experience with structured streaming and Apache Hive.

Uploaded by

connect.shivam29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

EliteDataEngineeringProgramCurriculum (1)

The Elite Data Engineering Program, led by Sumit Mittal, offers a comprehensive curriculum spanning 18 weeks, focusing on big data concepts, distributed storage, and processing using Apache Spark. Key topics include data pipelines, Spark SQL, performance tuning, and real-time data processing, along with practical applications and project implementation. The program also covers Git, CI/CD practices, and data modeling, culminating in hands-on experience with structured streaming and Apache Hive.

Uploaded by

connect.shivam29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Elite Data Engineering Program

YOUR PATH TO DATA ENGINEERING SUCCESS

Newly
Launched
Elite
Program

By
Sumit Mittal
CURRICULUM
WEEK 1 : BIG DATA - THE BIG PICTURE

> INTRODUCTION TO BIG DATA


> COMPARISON BETWEEN MONOLITHIC AND DISTRIBUTED SYSTEMS.
> HADOOP: EVOLUTION, OVERVIEW AND CORE COMPONENTS
> CHALLENGES WITH HADOOP
> COMPARISON BETWEEN ON-PREMISE AND CLOUD
> ADVANTAGES OF CLOUD
> TYPES OF CLOUD Elite Data
> INTRODUCTION TO APACHE SPARK Engineering
Program
> DATABASE VS DATA WAREHOUSE VS DATA LAKE
By
Sumit Sir
> INTRODUCTION TO DATABASE
> INTRODUCTION TO DATA WAREHOUSE
> DATA ENGINEERING FLOW
> DATA PIPELINE ON HADOOP
> DATA PIPELINE WORKFLOW VISUALIZATION FOR ON-PREMISE
> DATA PIPELINE WORKFLOW VISUALIZATION FOR CLOUD
> CATEGORIES OF COMPUTATION
> SERVERLESS COMPUTING
> SERVERFUL COMPUTING
> HDFS ARCHITECTURE
> ROLE OF DATA ENGINEERS
> TRADITIONAL WAYS OF PROCESSING DATA AND ITS CHALLENGES Elite Data
Engineering
Program

By
Sumit Sir
WEEK 2 : DISTRIBUTED STORAGE
FUNDAMENTALS
> HDFS OVERVIEW
- READING FILE FROM HDFS
- BLOCK SIZE IN HDFS
- NAME NODE FEDERATION
- RACK MECHANISM
- FAULT TOLERANCE
> INTRODUCTION TO PRACTICE LAB
> LINUX COMMANDS
- ABSOLUTE VS RELATIVE PATH Elite Data
- NAVIGATING THE FILE SYSTEM Engineering
Program
- VIEWING THE FILE CONTENT
By
- WORKING WITH FILES, SEARCHING AND FILTERING Sumit Sir
> HDFS COMMANDS
- LIST THE CONTENT OF FILE WITH DIFFERENT PARAMETER
- CREATE FOLDERS AND FILES IN HDFS
- MOVE DATA FROM LOCAL TO HDFS
- MOVE DATA FROM HDFS TO LOCAL
- MOVE DATA FROM ONE HDFS LOCATION TO ANOTHER
> HDFS VS CLOUD DATA LAKE
> DISTRIBUTED PROCESSING
- INTRODUCTION TO MAP REDUCE
- PRINCIPLE OF DATA LOCALITY

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 3 : DISTRIBUTED PROCESSING
FUNDAMENTALS - APACHE SPARK
> DISTRIBUTED PROCESSING
- MAP REDUCE
- CHANGING THE NUMBER OF REDUCERS
- WORKFLOW OF A MAPREDUCE JOB
- USE CASE : FINDING MAXIMUM TEMPERATURE USING MAPREDUCE
- WHAT IS SHUFFLE?
- WHAT IS SORT?
- WHAT IS PARTITION?
Elite Data
- ADVANTAGE OF LOCAL AGGREGATION Engineering
Program
- USE CASE : CLASSICAL INDUSTRY USE CASE OF MAPREDUCE
By
- PRACTICAL : HOW TO RUN MAPREDUCE PROGRAM Sumit Sir
> APACHE SPARK:
- CHALLENGES OF MAPREDUCE
- BRIEF INTRODUCTION OF APACHE SPARK
- UNDERSTANDING OF SPARK EXECUTION PLAN
- VISUALIZATION OF RDD
- WHAT IS DAG?
- ADVANTAGE OF SPARK BEING LAZY
- REAL TIME EXAMPLE : EXECUTING WORD COUNT PROGRAM ON
PYSPARK

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 4 : APACHE SPARK CORE API
> PYTHON BASICS
- NORMAL VS LAMBDA FUNCTION
- PYTHON MAP FUNCTION VS SPARK MAP TRANSFORMATION
- HIGHER ORDER FUNCTION
> PYSPARK USE CASE
> UNDERSTANDING MAP, REDUCE, REDUCEBYKEY, FILTER, SORTBY,
DISTINCT, TAKE, COLLECT
> REAL TIME EXAMPLE : FINDING FREQUENCY OF EACH WORD IN FILE
USING PYSPARK
> PARALLELIZE AND ITS USE
Elite Data
> COUNTBYVALUE Engineering
Program
> UNDERSTANDING PARTITIONS IN A RDD
By
> CATEGORIES OF SPARK TRANSFORMATION(WIDE & NARROW) Sumit Sir
> VISUALIZATION OF SPARK JOBS ON HISTORY SERVER
- STAGE, TASK AND JOBS
- RELATION BETWEEN JOBS AND ACTIONS.
- RELATION BETWEEN STAGES AND WIDE TRANSFORMATION.
- RELATION BETWEEN TASK AND PARTITIONS.
> REDUCE VS REDUCEBYKEY
> REDUCEBYKEY VS GROUPBYKEY
> JOINS IN SPARK
> BROADCAST JOIN AND ITS WORKING
> REPARTITION, COALESCE, AND THEIR APPLICATIONS
> UNDERSTANDING OF CACHE

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 5 : SPARK HIGHER LEVEL APIS -
DATAFRAMES & SPARK SQL
> INTRODUCTION TO HIGHER LEVEL API’S IN APACHE SPARK
- DATAFRAMES
- SPARK SQL
> WHY ARE HIGHER LEVEL API’S MORE PERFORMANT
> WORKING OF DATAFRAMES
> CREATION OF SPARK SQL TABLE FROM DATAFRAME AND VICE VERSA
> CREATION OF SPARK TABLE
> TYPES OF TABLE
Elite Data
- MANAGED TABLE Engineering
Program
- EXTERNAL TABLE
By
- MANAGED TABLE VS EXTERNAL TABLE Sumit Sir
> USE CASE OF DATAFRAMES & SPARKSQL
> SPARK OPTIMIZATION
- APPLICATION CODE LEVEL OPTIMIZATION
- CLUSTER LEVEL OPTIMIZATION
- SPARK EXECUTORS
- THIN EXECUTORS
- FAT EXECUTORS
> RIGHT STRATEGY FOR CREATING CONTAINERS

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 6 : SPARK DATAFRAME
TRANSFORMATIONS
> CHALLENGES OF SCHEMA INFERENCE
> INTRODUCTION TO SCHEMA ENFORCEMENT
- SAMPLING RATIO
- WAYS TO ENFORCE SCHEMA(SCHEMA DDL & STRUCT TYPE)
> DIFFERENT WAYS OF HANDLING DATE FORMATS
> DATAFRAME READ MODES
> DIFFERENT WAYS OF CREATING A DATAFRAME
> CONVERSION OF RDD TO DATAFRAME AND ITS DIFFERENT
APPROACHES
> HANDLING NESTED SCHEMA Elite Data
Engineering
> DATAFRAME TRANSFORMATION(SELECT VS SELECTEXPR) Program
> REMOVAL OF DUPLICATES FROM DATAFRAME By
Sumit Sir
> SPARK SESSION
> DEPLOYMENT MODES(CLIENT MODE VS CLUSTER MODE)
WEEK 7 : APACHE SPARK - CACHING
> ACCESSING SPARK UI AND RESOURCE MANAGER
> UNDERSTANDING THE SPARK UI
> UNDERSTANDING CACHE AND PERSIST
> IMPORTANCE OF CACHE
> PRACTICAL APPLICATIONS OF CACHING
> SERIALIZED VS DESERIALIZED
> THE MECHANISM OF CACHING
> UNDERSTANDING SPARK CODE EXECUTION(PARSED, ANALYZED,
OPTIMIZED LOGICAL PLAN)
> IN-MEMORY TABLE CACHE Elite Data
Engineering
> NODE_LOCAL VS PROCESS_LOCAL Program
> SIGNIFICANCE OF DYNAMIC ALLOCATION By
Sumit Sir
> CACHING SPARK TABLE
> SPARK CATALOG, MANAGED & EXTERNAL TABLE
> CACHE PERFORMANCE
> UN-PERSIST AND ITS USE
> IMPORTANCE OF STORING CACHED DATA IN OTHER DATAFRAME
> PREDICATE PUSH DOWN
> DIFFERENT WAYS OF CACHING
> TYPES OF FILE FORMATS
> INTRODUCTION TO PERSIST
> VARIOUS STORAGE LEVELS PROVIDED BY THE PERSIST

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 8 : SPARK ARCHITECTURE & AGGREGATE
FUNCTIONS
> YARN ARCHITECTURE
- RESOURCE MANAGER
- APPLICATION MASTER
- NODE MANAGER
- CONTAINER
- UBER MODE
- YARN RESOURCE MANAGER UI
> WORKING OF SPARK ON YARN ARCHITECTURE
Elite Data
> SPARK ARCHITECTURE Engineering
Program
> SPARK JOB IN CLIENT MODE
By
> SPARK JOB IN CLUSTER MODE Sumit Sir
> WAYS OF ACCESSING COLUMNS IN PYSPARK
- COLUMN STRING
- COLUMN OBJECT
- COLUMN EXPRESSION
> OVERVIEW OF AGGREGATE FUNCTION
> SIMPLE AGGREGATE
> GROUPING AGGREGATE
> WINDOWING AGGREGATE
> WINDOWING FUNCTIONS
- RANK
- DENSE_RANK
- ROW_NUMBER
Elite Data
- LEAD Engineering
Program
- LAG
By
> ANALYZING A LOG FILE? Sumit Sir
> PIVOT TABLE AND ITS CREATION
WEEK 9 : APACHE SPARK INTERNALS &
DATAFRAME PARTITIONS
> READING DATAFRAMES
> DATAFRAME READ MODES
- PERMISSIVE
- DROPMALFORMED
- FAILFAST
> DATAFRAME WRITE MODES
- OVERWRITE
- IGNORE
- APPEND Elite Data
Engineering
- ERRORIFEXISTS Program
> PARTITIONBY CLAUSE By
> UNDERSTANDING OF BUCKETING & ITS PERFORMANCE GAINS Sumit Sir
> ACCESSING SPARK UI IN DATABRICKS COMMUNITY EDITION
> SPARK INTERNALS
> DISABLING DYNAMIC EXECUTOR ALLOCATION
> SPARK-SUBMIT AT A HIGH-LEVEL
> INITIAL NUMBER OF PARTITIONS IN A DATAFRAME
> CALCULATING THE INITIAL NUMBER OF PARTITIONS FOR A SINGLE
NON-SPLITABLE FILE
> CALCULATING THE INITIAL NUMBER OF PARTITIONS FOR MULTIPLE
FILES

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 10 : SPARK OPTIMIZATIONS &
PERFORMANCE TUNING - 1
> INTERNALS OF GROUPBY
> NORMAL VS BROADCAST JOIN
> DIFFERENT TYPES OF JOINS
- INNER JOIN
- LEFT OUTER JOIN
- RIGHT OUTER JOIN
- FULL OUTER JOIN
- LEFT SEMI JOIN
Elite Data
- LEFT ANTI JOIN Engineering
Program
> PARTITION SKEW
By
Sumit Sir
> 3 USE CASES : OPTIMIZATIONS
> SIGNIFICANCE OF AQE(ADAPTIVE QUERY EXECUTION)
> JOIN STRATEGIES IN APACHE SPARK
> BROADCAST HASH JOIN
> SORT MERGE JOIN
> SHUFFLE HASH JOIN
> OPTIMIZING JOIN OF 2 LARGE TABLES - BUCKETING

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 11 : SPARK OPTIMIZATIONS &
PERFORMANCE TUNING - 2
> MEMORY MANAGEMENT IN APACHE SPARK
> SORT AGGREGATE VS HASH AGGREGATE>
> VARIOUS PLANS IN APACHE SPARK
- PARSED LOGICAL PLAN
- ANALYZED LOGICAL PLAN
- OPTIMIZED LOGICAL PLAN
- PHYSICAL PLAN
> CATALYST OPTIMIZER
Elite Data
> INTRODUCTION TO FILE FORMATS & COMPRESSION TECHNIQUE Engineering
Program
- ROW BASED FILE FORMATS
By
- COLUMN BASED FILE FORMATS Sumit Sir
> SPECIALIZED FILE FORMATS
- AVRO
- ORC
- PARQUET
> SCHEMA EVOLUTION
> COMPRESSION TECHNIQUES
- SNAPPY
- LZO
- GZIP
- BZIP

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 12 - 13 : PYSPARK PROJECT
IMPLEMENTATION AND BEST PRACTICES
> KEY ELEMENTS OF A BIG DATA PROJECT
> EXAMPLE PROBLEM STATEMENTS
> AGILE METHODOLOGY
> PYSPARK PROJECT
- FINANCE DOMAIN
- ARCHITECTURAL SOLUTION
- UNDERSTANDING THE DATASETS
- DATA CLEANING
> UNDERSTANDING THE PROJECT IMPLEMENTATION LOGIC Elite Data
Engineering
> PERMANENT TABLE CREATION ON CLEANED DATA Program

By
Sumit Sir
> ACCESS PATTERNS
> IDENTIFYING THE BAD DATA
> SEGREGATING THE IDENTIFIED BAD DATA FROM THE NORMAL DATA
> PROCESSING AND STORING THE FINAL RESULTS
> PROJECT STRUCTURING & EXECUTION
- VIRTUAL ENVIRONMENT SETUP
> UNIT TESTING
- IDENTIFYING AND WRITING UNIT TEST CASES
- FIXTURE
- TEARDOWN | YIELD
- FIXTURE TO CHECK IF THE CALCULATED RESULTS MATCH EXPECTED
RESULTS
Elite Data
- MARKERS Engineering
Program
- PARAMETERIZED GENERIC TEST CASES
By
> IMPLEMENTING LOGGING LEVEL IN APACHE SPARK Sumit Sir
WEEK 14 : GIT | GITHUB | CICD
> OVERVIEW OF GIT & GITHUB
> SETUP
- GITHUB ACCOUNT CREATION
- GIT INSTALLATION
- VS CODE IDE INSTALLATION
> IMPORTANT GIT COMMANDS
> SCENARIO 1 : PROJECT CREATION THROUGH GITHUB (REMOTE)
> SCENARIO 2 - PROJECT CREATION THROUGH GIT (LOCAL)
> BRANCHES IN GIT
> REVERTING BACK TO THE PREVIOUS CODE BASE
Elite Data
> SCENARIO 3 : WORKING ON EXISTING PROJECT(FORK COMMAND) Engineering
Program
> GIT STASH COMMAND
By
> HANDLING MERGE CONFLICTS Sumit Sir
> CONTINUOUS INTEGRATION & CONTINUOUS DEPLOYMENT - CICD
- BRANCHING STRATEGY & STAGES OF CICD
- SETUP : DEPLOYING AND CONFIGURING JENKINS SERVER
- BRANCHING STRUCTURE
- JENKINS CONFIGURATIONS
- CREATING SAMPLE JENKINS PIPELINE
- BUILD | TEST | PACKAGE & DEPLOY : JENKINS PIPELINE FOR PROJECT

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 15 : APACHE HIVE
> INTRODUCTION TO APACHE HIVE & PRACTICALS
> APACHE HIVE TABLES
- MANAGED
- EXTERNAL
> HIVE OPTIMIZATIONS
- PARTITIONING
- BUCKETING
- JOIN OPTIMIZATIONS
> HIVE TRANSACTIONAL TABLES
> ACID PROPERTIES
Elite Data
> SPARK-HIVE INTEGRATION Engineering
Program
> HIVE MSCK REPAIR
By
Sumit Sir
WEEK 16 : DATA MODELING AND SYSTEM
DESIGN
> WHAT IS DATA MODELING | NORMAL FORMS
> NORMALIZATION - OLTP SYSTEMS
> MODELING DATAWAREHOUSE (DWH)
- FACT TABLE
- DIMENSION TABLE
> SURROGATE KEY
> STEPS INVOLVED IN DIMENSIONAL MODELING
> OPTIMIZING THE DATA MODELING PROCESS
- CHOOSING THE RIGHT GRAIN Elite Data
Engineering
> DIMENSION TABLE VS ONE BIG TABLE (OBT) Program
> SLOWLY CHANGING DIMENSIONS (SCD TYPES) By
> SCD TYPE-2 IMPLEMENTATION Sumit Sir
WEEK 17 - 18 : APACHE SPARK STRUCTURED
STREAMING
> KIND OF PROCESSING
> WHAT IS REAL-TIME PROCESSING
> BATCH PROCESSING VS REAL-TIME STREAMING
> SPARK STREAMING DATA
> STRUCTURED STREAMING IN-DEPTH
> BENEFITS OF SPARK STRUCTURED STREAMING
> TYPES OF DATA SOURCES
> STREAMING JOINS
Elite Data
> STREAMING DATAFRAME Engineering
Program
> SPARK DESCRITIZED STREAM : DSTREAM
By
> IS SPARK A REAL-TIME STREAMING ENGINE Sumit Sir
> STREAM PROCESSING IN SPARK
> TRANSFORMED DSTREAM
> UNDERSTANDING PRODUCER AND CONSUMER
> PRACTICAL ON REAL-TIME PROCESSING
> STREAM TRANSFORMATION
> STATELESS TRANSFORMATIONS
> STATEFUL TRANSFORMATIONS
> WINDOW TRANSFORMATIONS
> UPDATESTATEBYKEY, REDUCEBYKEYANDWINDOW,
REDUCEBYWINDOW, COUNTBYWINDOW
> TYPES OF WINDOWS - TUMBLING TIME WINDOW & SLIDING TIME
WINDOW
> WINDOW OPERATIONS | BATCH INTERVAL Elite Data
Engineering
> WINDOW SIZE | SLIDING INTERVAL Program

By
Sumit Sir
> WHAT IS STRUCTURED STREAMING
> REQUIREMENT OF STRUCTURED STREAMING
> LIMITATIONS OF SPARK STREAMING
> BENEFITS OF SPARK STRUCTURED STREAMING
> DYNAMICALLY SETTING THE SHUFFLE PARTITIONS
> DATASTREAM WRITER OUTPUT MODES
> DATASTREAM OUTPUT MODES - APPEND, UPDATE & COMPLETE
> SPARK STREAMING GRACEFUL SHUTDOWN
> HOW DOES SPARK STREAMING CODE EXECUTE INTERNALLY
> TYPES OF TRIGGERS - UNSPECIFIED, TIME-INTERVAL, ONE-TIME &
CONTINUOUS
> TYPES OF DATA SOURCES - SOCKET, RATE, FILE & KAFKA SOURCE
> TYPES OF SPARK STREAMING OUTPUT DATA OPTIONS Elite Data
Engineering
> TYPES OF AGGREGATIONS Program

By
Sumit Sir
WEEK 19 : KAFKA
> INTRODUCTION TO KAFKA - STREAMING PLATFORM
> KAFKA ARCHITECTURE
> KAFKA KEY CONCEPTS
> CLUSTER | NODES | BROKERS | TOPICS
> CONSUMER | PRODUCER | LOGS | PARTITIONS
> INSTALLING MULTI-NODE KAFKA CLUSTER
> WRITING KAFKA PRODUCER AND CONSUMER
> SCALING UP THE KAFKA CLUSTER
> CONCEPT OF PARTITION GROUPS
> LEADER AND FOLLOWER PARTITION Elite Data
Engineering
Program

By
Sumit Sir
> COMMAND LINE PRODUCER AND CONSUMER
> REPLICATION CONCEPT FOR FAULT TOLERANCE
> HOW DATA IS STORED IN BROKERS
> LOG SEGMENTS, MESSAGE OFFSETS, MESSAGE INDEX
> WRITING KAFKA PRODUCER, CONSUMER
> INTEGRATING KAFKA WITH SPARK STRUCTURED STREAMING
> BUILDING STREAMING PIPELINE (STRUCTURED STREAMING WITH
KAFKA)
> END-TO-END REAL-TIME STREAMING USE CASE USING KAFKA

Elite Data
Engineering
Program

By
Sumit Sir
WEEK 20 - 21 : IMPORTANT AWS CLOUD
SERVICES
> AWS EMR (ELASTIC MAPREDUCE)
> LAUNCH EMR CLUSTER USING ADVANCED OPTIONS
> KINDS OF NODES IN CLUSTER
> HOW TO CREATE A VM
> TYPES OF EC2 INSTANCES
> RUNNING SPARK CODE ON EMR
> HOW TO TRACK YOUR JOB
> AWS S3
> AWS STORAGE Elite Data
Engineering
> COPY FILE FROM S3 TO LOCAL Program
> AWS COMMAND LINE INTERFACE By
Sumit Sir
> AWS ATHENA
> WHEN DO WE REQUIRE ATHENA
> WHAT PROBLEM ATHENA SOLVE
> ATHENA PRICING
> ATHENA PRACTICAL DEMONSTRATION
> HOW TO MINIMIZE DATA SCANNING IN ATHENA
> AWS GLUE - DATA CATALOG | CRAWLERS
> INFERING SCHEMA AUTOMATICALLY USING AWS GLUE
> CONNECTING TO DATA STORE
> USING CRAWLERS FOR CATALOG TABLES
> OVERVIEW OF WORKING WITH GLUE JOBS
> ADDING NEW JOBS IN GLUE Elite Data
Engineering
> TRIGGERING JOBS AND SCHEDULING Program

By
Sumit Sir
> AWS REDSHIFT
> BENEFITS AND USE CASES OF REDSHIFT
> REDSHIFT ARCHITECTURE
> TYPES OF NODES
> REDSHIFT SPECTRUM
> REDSHIFT FAULT TOLERANCE
> REDSHIFT SORT KEYS
> REDSHIFT DISTRIBUTION STYLES
> LAMBDA FUNCTIONS
> END-TO-END REAL-TIME USE CASE USING AWS CLOUD SERVICES

Elite Data
Engineering
Program

By
Sumit Sir
GET INTERVIEW READY
> RESUME | LINKEDIN | NAUKRI PROFILE BUILDING
AND OPTIMIZATION
> SAMPLE INTERVIEW QUESTIONS

Elite Data
Engineering
Program

By
Sumit Sir
Contact
[email protected]
https://trendytech.in/

You might also like