2023713662-PythonSQLPyspark
2023713662-PythonSQLPyspark
Day 1
• Installing and setting up python
• Writing your very first program in python
o Printing Hello World
• Operators and Expressions
• Slicing
o Negative slicing
o Using step in slicing
o Slicing backwards
• Strings
o String operators
o String formatting
• Program Flow control in Python
o if statement
o elif
o for loop
o continue and break
o while loop
Day 2
• List and Tuples
o mutable vs Immutable objects
o List
o Sorting a list
o Removing items from list
o Replacing items in list
o What are tuples
o Performing basic functions to a tuple
• Dictionary and Sets
• Functions
o Defining a function
o Parameters and arguments
o Returning values
o Docstring
o *args
2 of 13
Day 3
• Input and Output in python
o Reading and writing to a text file
o Appending to a file
o Object persistence using shelve
• Exception handling in python
• Generators, Decorators and lambda expression
Day 4
• Introduction to external libraries in Python
• Deep dive into libraries:
o NumPy, Pandas and Matplotlib
Assessment
SQL (4 Days)
Fundamentals of SQL
Day 1
• Introduction to SQL
o Introduction
o Work with Schemas
o Explore the structure of SQL Statements DDL, DML, DCL
o Examine the SELECT statements
o Work with data types
o Handle NULLs
Hands-on: Work with SELECT statements
Day 2
• Sort and filter results in SQL
o Sort your results
o Limit the sorted results
o Page results
o Remove duplicates
o Filter data with predicates
• Combine multiple tables with JOINs in SQL
o Understand joins concepts and syntax
o Use Inner joins
o Use Outer joins
o Use Cross joins
o Use Self joins
• Write Subqueries in SQL
o Understand Subqueries
o Use scalar or multi-valued subqueries
o Use self-contained or correlated subqueries
Hands-on: Sort and filter query results Hands-on: Query multiple tables with joins
Hands-on: Use Subqueries
Day 3
• Use built-in functions and GROUP BY in SQL
• Categorize built-in functions
▪ Use aggregate functions - AVG SUM MIN MAX COUNT
▪ Use Mathematical functions - ABS, COS/SIN, ROUND RAND
▪ Use Ranking functions - RANK, DENDE-RANK
▪ Use Analytical function - LAG, LAST_VALUE, LEAD,
PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK
▪ Use Logical functions - CHOOSE, GREATEST, LEAST
o Summarize data with GROUP BY
o Filter groups with HAVING
• Modify data with SQL
o Insert data
o Generate automatic values
o Update data
o Delete data
o Merge data based on multiple tables
Hands-on: Use built-in functions
Hands-on: Modify data
Day 4
• Triggers
• Stored Procedure
o Stored procedures
o Create
o Modify
o Delete
o Execute
o Specify parameters
• Indexes
o Heaps (Tables without Clustered Indexes)
o Clustered & Non-Clustered Indexes
Hands-on: Stored procedure
Hands-on: Indexes
Assessment
4 of 13
PySpark (4 Days)
Pyspark
Day 1
• Fundamentals of PySpark
o A Brief Primer on PySpark
o Brief Introduction to Spark
o Apache Spark Stack
o Spark Execution Process
o Newest Capabilities of PySpark
o Cloning GitHub Repository
• Resilient Distributed Datasets
o Resilient Distributed Datasets
o Creating RDDs
o Schema of an RDD
o Understanding Lazy Execution
o Introducing Transformations – .map(…)
o Introducing Transformations – .filter(…)
o Introducing Transformations – .flatMap(…)
o Introducing Transformations –. distinct(…)
o Introducing Transformations – .sample(…)
o Introducing Transformations – .join(…)
o Introducing Transformations – .repartition(…)
o Project 1: Count Data Project (ingestion of dataset, doing a preprocessing and
exploratory dataset though the data set, applying map, filter, faltmap,
distinct, join and repartition)
o Project 2: Weather Temperature Crunch (ingestion of dataset, doing a
preprocessing and exploratory dataset though the data set, applying map,
filter, faltmap, distinct, join and repartition on instream data)
Day 2
• Resilient Distributed Datasets and Actions
o Introducing Actions – .collect(…)
o Introducing Actions – .reduce(…) and .reduceByKey(…)
o Introducing Actions – .count()
o Introducing Actions – .foreach(…)
o Introducing Actions – .aggregate(…) and .aggregateByKey(…)
o Introducing Actions – .coalesce(…)
o Introducing Actions – .combineByKey(…)
o Introducing Actions – .histogram(…)
o Introducing Actions – .sortBy(…)
o Introducing Actions – Saving Data
o Introducing Actions – Descriptive Statistics
o Project 3: 10 Tasks in Students/Professor University Datasets (ingestion of
dataset, doing a preprocessing and exploratory dataset though the data set,
applying RDD actions.)
o Project 4: 8 Tasks in Customer Data Datasets (ingestion of dataset, doing a
preprocessing and exploratory dataset though the data set, applying RDD
actions through specified applicability)
o Project 5: Movie ratings
• DataFrames and Transformations
o Creating DataFrames
o Specifying Schema of a DataFrame
o Interacting with DataFrames
o The .agg(…) Transformation
o The .sql(…) Transformation
o Creating Temporary Tables
o Joining Two DataFrames
o Performing Statistical Transformations
o The .distinct(…) Transformation
o Project 6: CompanyMegaData (doing all the transformation logics,
columunal logic and aggregation and exploratory data analysis)
o Project 7: University Data (end to end pyspark execution of insight delivery
on University Data)
Day 3
• Collaborative Filtering and Techniques
o Collaborative filtering
o Utility Matrix
o Explicit and Implict Rating
o Expected Results
o Dataset
o Joining Dataframe
o Train and Test Data
o ALS model
o Optimization Hyperparameter tuning and cross validation
o Best model and evaluate prediction
o Project 8: IMDB Rating project (Optimization logics focused on the project
with extensive pyspark logic and clever techniques of manipulation )
• Spark Streaming
o Introduction to spark streaming
o Spark streaming with RDD
o Spark streaming Context
o Spark streaming Reading Data
o Spark streaming Cluster Restart
o Spark streaming RDD Transformation
o Spark streaming DF and Display
o Spark streaming DF Aggregation
o Project 9: Streaming Crunch Dataset(orchestration of a stream pipeline
project of end to end execution of the ingestion of live data)
Day 4
• Spark ETL and Captone project
o Introduction to ETL
o ETL Pipeline
o Dataset
o Preprocessing, extraction, transformation
o Loading Data and cleaning
o RDS Networking
o Downloading PostGres
o Configuration and execution
Project 10: Completion of Captone Project (Full end to end project Streaming
Crunch Dataset of entire pyspark concepts from data exploratory to applying
techniques and finding out the logics to the requirement of the dataset along
with applying multiple ways to solve a solution and figuring out the correct and
most optimized way and efficient way)