0% found this document useful (0 votes)
1K views85 pages

662a5089e0494246e350140dslides - Data Wrangling With SQL

Uploaded by

Saugat Adhikari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views85 pages

662a5089e0494246e350140dslides - Data Wrangling With SQL

Uploaded by

Saugat Adhikari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Fuse AI Fellowship 2024

Data Wrangling with SQL


By: Pallavi Shrestha | Sudip Bala

April 25
CONTENT

SN Topics
Phase I (6:30 PM - 7:15 PM) - By Pallavi Shrestha
1 Introduction and Briefing of the Session

2 Importance of Data Wrangling in Data Science

3 Overview of RDBMS

4 Structured Query Language (SQL)

5 Basic SQL + Demo


6 Intermediate SQL + Demo

8 Some Advanced SQL + Demo

Break (7:15 PM - 7:25 PM)


CONTENT

SN Topics
Phase II (7:25 PM - 8:15 PM) - By Sudip Bala
9 Real World Data Wrangling Project with SQL

a. Exploratory Data Analysis

b. Handling missing and duplicate values

c. Date formats

d. Trimming, CASE, COALESCE, Window Functions

e. Dealing with outliers

f. Regular expressions for string manipulation

8:15 PM - 8:30 PM - Conclusion and QnA


Role of Data Wrangling in
Data Science & Machine Learning
Introduction
Data wrangling, or data munging, involves transforming and mapping raw data into a
structured format, ready for analysis and model building.

It goes beyond merely data cleaning and


more thoroughly transforms, reformats, and
prepares it for eventual downstream needs.

Purpose:
- Improve data quality
- Enable faster analysis
- Enhance model accuracy
SQL is the language for talking to databases.
SQL dialects refers to variations of SQL like T-SQL, PL/SQL, MySQL, PostgreSQL

Installation and Setup

RDBMS: Tools:
● MySQL ● Dbeaver
● PostgreSQL ● MySQL Workbench
For Linux Terminal (Ubuntu/Debian):

1. Install PostgreSQL
sudo apt install postgresql

2. Add and Install DBeaver


sudo apt install wget software-properties-common
wget -O - https://dbeaver.io/debs/dbeaver.gpg.key | sudo apt-key add -
sudo add-apt-repository "deb https://dbeaver.io/debs/dbeaver-ce /"

sudo apt update

sudo apt install dbeaver-ce


For Linux Terminal (Ubuntu/Debian):

1. Install MySQL Server


sudo apt install mysql-server
[During the installation process, you will be prompted to set a password for the MySQL root user]

2. Install MySQL Workbench


[Download the MySQL Workbench Debian package from the official MySQL website & install it using dpkg]
Example:
wget https://dev.mysql.com/get/Downloads/MySQLGUITools/mysql-workbench-community_8.0.28-1ubuntu21.10_amd64.deb

sudo dpkg -i mysql-workbench-community_8.0.28-1ubuntu21.10_amd64.deb


RDBMS Basics
Basics to get started
● Database: a shared collection of data that is stored and organised so that it can be
easily accessed and maintained.
● Database Management System: Collection of programs which enables users to
create and maintain a database

RDBMS examples
● Oracle
● MySQL
● SQLite
● PostgreSQL
● Microsoft SQL Server
● MariaDB
Schema is a logical container that holds a collection of database objects, such as tables,
views, indexes, and more
● defines the structure of the database and the organization of data within it
● helps in segregating different types of data and provides a way to manage permissions
and access control for different user roles

Table is a fundamental database object that stores data in rows and columns, forming a
two-dimensional structure
● each column corresponds to an attribute of the data
● each row represents an individual record or entry
● used to represent different entities in the real world, such as customers, orders, products,
employees, etc.
● relationships between tables are established through keys, such as primary keys and
foreign keys.
Basic SQL Commands
DB Implementation
● DDL (Data Definition Language)
○ for defining and managing the structure of the database, including tables,
relationships, indexes, and constraints
○ DDL commands are used to create, modify, and delete database objects
○ CREATE, DROP, ALTER, TRUNCATE, RENAME, COMMENT

● DML (Data Manipulation Language)


○ To manipulate data within the database present in tables
○ Includes commands for inserting, updating, and deleting records from
tables, as well as querying data.
○ INSERT, UPDATE, DELETE, SELECT
○ MERGE, CALL, EXPLAIN
● DCL (Data Control Language): used to control access to data within the database
○ GRANT: used to give specific privileges or permissions to database users
○ REVOKE: used to revoke previously granted privileges from users

● TCL (Transaction Control Language)


- used to manage transactions within the database
- allow us to control the beginning, ending, and execution of transactions
○ COMMIT: used to permanently save the changes made during the current
transaction to the database
○ ROLLBACK: used to undo all changes made during the current transaction. It
restores the database to its state before the transaction began
○ SAVEPOINT: used to set a savepoint within a transaction. Savepoints allow you to
divide a transaction into smaller parts and rollback to a specific point if necessary
Code demo: DDL, DML
Comparison Operators
● Used to compare two values
● Represented by symbols
● Essential in conditions that determine which records will be selected, updated, or
deleted

Equal to =
Not equal to <> or !=
Greater than >
Less than <
Greater than or equal to >=
Less than or equal to <=
Logical Operators
● Allow us to use multiple comparison operators in one query
● To filter the data by more than one conditions
○ LIKE - to match similar values, instead of exact values.
○ IN - to specify a list of values you'd like to include.
○ BETWEEN - to select only rows within a certain range.
○ IS NULL - to select rows that contain no data in a given column.
○ AND - to select only rows that satisfy two conditions.
○ OR - to select rows that satisfy either of two conditions.
○ NOT - to select rows that do not match a certain condition
Wild Cards
● Used to substitute one or more characters in a string
● Especially useful in database systems for pattern searching within strings
● Used with LIKE operator

Symbol Description Example


% Represents zero or more bl% finds bl, black, blue,
characters and blob

_ Represents a single h_t finds hot, hat, and hit


character
Intermediate SQL
Aggregate Functions
● An aggregate function performs a calculation on a set of values, and returns a single
value
● often used with the GROUP BY clause of the SELECT statement

Function Used for

COUNT counts the number of (non-null) rows in a particular column

SUM adds all values in a particular column

AVG calculates the average of a group of selected values

MIN returns the smallest value of the selected column

MAX returns the largest value of the selected column

● Data types on which each aggregate function works??


GROUP BY
● allows us to separate data into groups, which can be aggregated
independently of one another
● often used with aggregate functions to group the result-set by one or more
columns
HAVING
● The HAVING clause was added to SQL because the WHERE keyword cannot
be used with aggregate functions

Query Clause Order


1. SELECT
SELECT column_name(s) 2. FROM
FROM table_name
3. WHERE
WHERE condition
GROUP BY column_name(s) 4. GROUP BY
HAVING condition
ORDER BY column_name(s); 5. HAVING
6. ORDER BY
Order of SQL Operations
helps in understanding the relationship between filtering, grouping, and aggregating

1. FROM: Source table(s)


2. WHERE: Row-level filtering
3. GROUP BY: Group rows by column values
4. Aggregate functions applied: SUM(), AVG(), etc.
5. HAVING: Filter on the results of aggregate calculations
6. SELECT: Define which columns (or calculations) to return
7. ORDER BY: Sort the result set
8. LIMIT: Limit the number of returned rows
Points to be noted
● Columns in the SELECT statement that are not aggregated must be listed in
the GROUP BY clause
● The WHERE clause filters rows before aggregation
● The HAVING clause is used to filter results after aggregation
● Order of SQL Operations
DISTINCT
● Returns only unique values in a particular column
● Used in SELECT statement
SELECT DISTINCT column_name
FROM table_name;
● If 2 or more columns are included in SELECT DISTINCT clause, result will contain all of
the unique pairs of these columns
SELECT DISTINCT column1, column2, ...
FROM table_name;
● DISTINCT goes inside aggregate function
SELECT COUNT(DISTINCT column_name)
FROM table_name;
Aliases
● used to give a table, or a column, a temporary name
● often used to make column names more readable
● An alias only exists for the duration of that query
● created with the AS keyword

Syntax
SELECT column_name AS alias_name
FROM table_name;
JOINS
Used to combine rows from 2 or more tables based on related columns between them

Type Returns

INNER Rows that have matching values in both tables

LEFT / LEFT OUTER All rows from the left table + only matching rows from the right table

RIGHT / RIGHT OUTER All rows from right table + only matching rows from left table

CROSS JOIN Produces a Cartesian product of the two tables, returning all possible combinations of rows

SELF / Recursive Joins a table with itself to compare rows within the same table

NATURAL Joins two tables on all columns with the same name in both tables

FULL OUTER (postgres) Returns all records when there is a match in either the left or the right table

LATERAL (postgres) Allows a subquery in the FROM clause to refer to columns of the preceding table(s). Useful when
dealing with set-returning functions or other subqueries that need to reference the main query.
JOINS - example
JOINS on multiple keys
● When we need to join tables based on more than one key or condition, we can use
multiple join conditions.
● Common in scenarios where a single pair of keys isn't unique enough to combine
the tables correctly.

SELECT A.*, B.*


FROM table1 A
JOIN table2 B
ON A.key1 = B.key1 AND A.key2 = B.key2;
CASE
● The SQL CASE is a conditional expression that allows you to introduce
decision-making logic in your SQL queries
● used to derive values based on specific conditions, effectively allowing for
"if-then-else" logic within SQL statements
Syntax
CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
WHEN conditionN THEN resultN
ELSE result

END;
UNION
● The UNION operator is used to combine the result sets of two or more SELECT statements.
● Each SELECT statement within the UNION must have
○ same number of columns,
○ columns must also have similar data types,
○ columns must be in the same order.
● By default, UNION removes duplicate rows. Use UNION ALL if you want to keep duplicates

SELECT column1, column2, ...


FROM table1
UNION
SELECT column1, column2, ...
FROM table2;
MINUS
● MySQL
○ does not support the MINUS operator
○ you can use a combination of JOIN or NOT EXISTS to achieve similar results
● PostgreSQL
○ The EXCEPT operator returns all rows from the first SELECT statement that are not
returned by the second SELECT statement

SELECT column1, column2, ...


FROM table1
EXCEPT
SELECT column1, column2, ...
FROM table2;
INDEX
● a database object designed to speed up data retrieval from a table by creating a
searchable physical structure
● enable quick data access without scanning every row each time a table is queried,
making them especially beneficial for large tables
● often used to enhance query performance for operations involving JOINs, WHERE
clauses, or ORDER BY clauses.

Syntax: CREATE INDEX index_name ON table_name (column1, column2, ..., columnN);

● Trade-offs: While indexes significantly improve query speed, they do take up


additional disk space and can slow down write operations (INSERT, UPDATE,
DELETE) because the index also needs to be updated. So it's crucial to choose
which columns to index wisely based on query patterns, as unnecessary indexes
can lead to wasted resources and poor insert/update performance.
Subqueries
● aka inner query or nested query
● a query embedded within another SQL statement
● a SELECT query that is used as part of the main query
● A subquery can return one or more values and can be used in various parts of
the main query, such as in the SELECT, WHERE, and FROM clauses
● Use:
○ used in situations where multiple retrievals are required before the final
result can be obtained
○ useful for breaking down complex problems into simpler, more
manageable pieces
Uses of subqueries
● Some common scenarios include:
○ Retrieving a single summarized value to be used in the main query (e.g.,
an average or maximum).
○ Fetching a list of values to be used as a filter in the main query's WHERE
clause.
○ Creating derived tables on-the-fly which can be utilized by the outer
query.
○ Checking the existence of specific records before performing an action.
○ Comparing each row against a set of rows to determine relationships
(e.g., greater than the average).
● A SQL statement can contain multiple subqueries (even subqueries inside of
subqueries)
Where can subqueries occur?
Subqueries can occur in various parts of an SQL statement, including:
● SELECT clause: Here, it can return a single value for each row processed by
the main query.
● WHERE clause: Used to filter results based on the outcome of the subquery.
Common operators used with subqueries in the WHERE clause include IN,
NOT IN, EXISTS, NOT EXISTS, <, >, <=, >=, =, and <>.
● FROM clause: When used here, the subquery acts as a derived table which
the main query can reference.
● HAVING clause: Useful for filtering results of group by clauses based on the
outcome of a subquery.
● INSERT, UPDATE, DELETE: To determine which records to modify or remove
based on the results of a subquery.
Advanced SQL
Views & Materialized Views
VIEW
● a virtual table based on the result set of a SQL statement
● unlike tables, views do not store data physically
● they pull data from one or multiple tables every time they are queried
● used when data is to be accessed infrequently and the data in a table gets
updated on a frequent basis
● dynamic nature of views allow them to always provide up-to-date results based on
the current data in the underlying tables

CREATE VIEW view_name AS


SELECT column1, column2, ...
FROM table_name
WHERE condition;
Materialized View
● similar to a regular view in that it represents the result of a stored query
● unlike a regular view, a Materialized View stores the result of that query in a
physical table
● a snapshot of the data at the time the MV is created or refreshed
● Use:
○ to optimize query performance
○ For aggregations and computations over large datasets
○ When the underlying data doesn't change frequently but reads are
frequent
○ In data warehousing scenarios where query performance is critical
Syntax
1. Create MV
CREATE MATERIALIZED VIEW view_name AS
SELECT ...
FROM ...
WHERE ...;

2. Refresh MV
REFRESH MATERIALIZED VIEW view_name;
Views VS Materialized Views
● Regular Views:
○ When to Use: Useful when the data needs to be always up-to-date with the underlying
tables
○ Best for situations where the underlying data changes frequently but the view is not
queried often
○ Example: A report that displays current employee data, which is changing constantly
due to new hires, departures, or role changes.
● Materialized Views:
○ When to Use: Best for situations where the data doesn't change frequently but the view is
queried often, especially for complex aggregations or joins that are computationally
expensive.
○ Example: A monthly sales report that aggregates sales data over multiple tables. The
data might only be updated once a day or once a week, but the report might be
accessed frequently. Using a MV can make accessing this report much faster.
Common Table Expressions (CTEs)
Common Table Expression (CTE)
● temporary named result sets that exist for just one query
● can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement
● can be self-referencing or even recursive
● defined using the WITH clause

Why are CTEs used?


● Improved readability, better visibility
● Decomposing Complicated Queries
● Recursive Queries
● Referencing the Same Dataset Multiple Times
Syntax
WITH cte_name AS (
-- CTE query here
)
-- Main query using the CTE
SELECT ...
FROM cte_name ...
Subqueries VS CTE
● CTEs can make queries more readable compared to subqueries, especially when
the same subquery would need to be used in multiple places
● CTEs must always be named. Only PostgreSQL insists that subqueries must have a
name
● Subqueries can be used in any part of the main query, while CTEs are defined at
the beginning and can then be referenced by the main query
● CTEs can be recursive, subqueries cannot
● Subqueries can be used in UPDATE clause or WHERE clause in conjunction with
the keywords IN or EXISTS, but you can't do this with CTEs
Window Functions
Window Function
● a type of function that performs a calculation across a set of table rows that are
somehow related to the current row
● this "set of related rows" is termed as a "window"
● Aka windowing functions, OVER functions or analytics functions

Why use window functions?


● Aggregation Without Grouping
● Flexible Calculations
● Complex Data Analysis Made Simpler
Window functions
1. Aggregation functions: SUM(), AVG(), COUNT(), MIN(), MAX()

2. Ranking functions: ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(n)

3. Value functions: LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE(),

NTH_VALUE()

4. Statistical functions

5. Cumulative distribution and percentile functions


Basic windowing syntax
SELECT coulmn_name1,
window_function(cloumn_name2)
OVER([PARTITION BY column_name1] [ORDER BY column_name3]) AS
new_column
FROM table_name;

window_function= any aggregate or ranking function


column_name1= column to be selected
coulmn_name2= column on which window function is to be applied
column_name3= column on whose basis partition of rows is to be done
new_column= Name of new column
table_name= Name of table
OVER clause
● defines the window function
● determines the range or set of rows used in the window function's calculation for each
row
○ Without OVER: Aggregate functions like SUM, AVG, and COUNT return a single
value for the entire set of rows.
○ With OVER: These same functions return a value for each row based on the
window of rows defined by the OVER clause.
Example:Suppose you want to calculate the running total of total_cases for each state,
ordered by the number of total cases. Furthermore, you'd like to see this running total for
each region, or "partition", based on the active_ratio_percent.

SELECT
state,total_cases,active_ratio_percent,
SUM(total_cases) OVER (PARTITION BY active_ratio_percent
ORDER BY total_cases DESC) as running_total_per_active_ratio
FROM
covid
ORDER BY
active_ratio_percent, total_cases DESC;
PARTITION BY
● divides the query result set into partitions
● the window function is applied separately to each partition
● similar to the GROUP BY clause, but, instead of aggregating the data, it
retains the original rows and computes the function over each partition
● PARTITION BY column_name
● only those columns made available by the FROM clause can be used in
PARTITION BY
● aliases in the SELECT list can’t be used for partition
ORDER BY
● within the defined window, rows can be ordered using this clause
● i.e. it defines the logical order of rows within each partition of the result set
● crucial while calculating running totals or cumulative metrics
● if not specified, default order is ASC & the window function will use all rows in
the partition
● If ORDER BY is specified but frame is not specified, default is:
○ ROWS/ RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
● like PARTITION BY, ORDER BY can also use only those columns specified by
the FROM clause
● integer can’t be specified to represent column name or alias
Frame specification
Determines the range of rows to include in the calculation for the current row
Syntax components:
1. ROWS or RANGE
● ROWS defines the frame by a number of rows
● RANGE defines the frame by value intervals
2. Start and End boundaries
○ UNBOUNDED PRECEDING: from the first row of the partition
○ ‘X’ PRECEDING: ‘x’ number of rows before the current row
○ CURRENT ROW: current row
○ ‘X’ FOLLOWING: ‘x’ number of rows after the current row
○ UNBOUNDED FOLLOWING: from the last row (till the current or
preceding row)
Aggregate window functions
● COUNT()
● SUM()
● AVG()
● MIN()
● MAX() Benefits
● Allows for complex calculations that consider both the
current row and a related set of rows.
● Facilitates the creation of running totals, moving
averages, and other cumulative metrics.
● Provides insights without the need for self-joins or
subqueries.
Aggregate functions VS aggregate window functions
● both perform operations over a set of rows
● the way they present the results and the context in which they operate makes
them distinct
1. Result Set Size:
● Regular aggregate functions reduce/condense multiple rows into a single
result row /per group (often used with GROUP BY clause)
● Aggregate window functions return a value for each row in the result, based
on the window defined by the OVER clause (no need to use GROUP BY) i.e.
they retain the original number of rows

2. Context of Operation
3. Scope of Operation
Ranking Window Functions: ROW_NUMBER( ), RANK( ), DENSE_RANK( ), NTILE(n)

1. ROW_NUMBER( )
● a window function that assigns a unique sequential integer to rows within a partition of a result
set
● starts are 1 and numbers the rows according to the ORDER BY part of the window statement
● within a partition, no two rows can have same row number
● no need to specify a variable within the parentheses

Why is it used?
● To generate unique numbers for each row, even if they have the same values
● Useful for pagination in applications
● Helps in data deduplication by identifying duplicates and retaining only unique rows
More about ROW_NUMBER( )

● No Duplicates: Unlike other ranking functions, ROW_NUMBER() does not


account for duplicates. Two rows with identical values will still get different
row numbers

● Ordering: The order of the numbers depends on the ORDER BY clause. If no


order is defined, the numbering can be arbitrary

● Partitions: ROW_NUMBER() can be reset with different partitions using the


PARTITION BY clause. This allows for numbering within subsets of your data
2. RANK( )
● a window function that assigns a unique rank to each distinct row within
a partition of a result set

Why is it used?
● To rank items in a specific order (e.g., top N items).
● Useful for competitive scenarios where positions might be shared, such
as in leaderboards or sales performance
● Assigns the same rank to rows with identical values (ties) but skips the
subsequent rank(s)
More about RANK( )

● Handling Ties: If two (or more) rows have the same value, RANK() will
give them the same rank. The next rank, however, will be skipped. For
example, if two rows receive a rank of 2, the next row will be ranked 4, not
3
● Ordering: The rank numbers are determined based on the ORDER BY
clause
● Partitions: Using the PARTITION BY clause, RANK() can start over with
each partition, allowing for ranking within specific categories or groups
3. DENSE_RANK( )
● assigns a unique rank within the ordering of a result set
● just like rank function, first row is assigned rank 1 and rows having same
value have same rank
● difference between RANK() and DENSE_RANK() is that in
DENSE_RANK(), for the next rank after two same rank, consecutive
integer is used, no rank is skipped

Why is it used?
● Perfect for scenarios where you want consecutive ranking without gaps
● Essential for data analyses where you don’t want skipped rank numbers
4. NTILE(n)
● distributes the rows in an ordered result set into n approximately equal
parts
● returns the group number for each of the rows in the partition

Why is it used?
● Useful for dividing datasets into quartiles, deciles, percentiles, or any
other sub-partitions
● Helps to analyze the distribution of data
● Flexible Bucketing: NTILE() is a flexible function that can create any
number of buckets as defined
LAG( ) and LEAD()
Often, we might want to compare the value of the current row to that of the
preceding or succeeding row. LAG( ) and LEAD( ) window functions are used
for this purpose

Uses:
● Historical comparison ● Data smoothing
● Pattern / Trend identification ● Data anomaly detection
● Data gap analysis ● Shifting period analysis, etc.
● Sequential analysis
1. LAG( )
● provides access to a row at a specified physical offset which comes
before the current row
● simply, it allows us to get the value from a previous row in the result set
● the default offset for LAG() is 1, but we can specify other offsets
………………………………………
LAG(expression, [offset], [default])
OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ... )
………………………………………
2. LEAD( )
● provides access to a row at a specified physical offset that follows the
current row
● allows us "look forward" in our dataset without actually changing the
position of the current row

Arguments of this function:


1. Value expression: The column or expression whose value from a "leading"
row you want to retrieve
2. Offset (optional): The number of rows ahead of the current row from
which to retrieve a value. The default is 1
3. Default value (optional): A value to return if the function tries to look
beyond the last row in the dataset. If not specified, it returns NULL

● Syntax: LEAD(value_expression [,offset [,default_value]])


Basic Syntax of LEAD( ):

LEAD(column_name [, offset [, default_value]]) OVER (

[PARTITION BY partition_expression, ... ]

ORDER BY sort_expression [ASC | DESC], ...

)
Other window functions

1. PERCENT_RANK( ):
○ returns the relative rank of each row within the result set as a percentage
○ useful in determining the relative position of a value within a dataset
○ eg: PERCENT_RANK() OVER (ORDER BY score) AS percentile_rank

2. FIRST_VALUE( ):
○ returns the first value in the ordered partition
○ useful in comparing each row with the starting point in a dataset
○ eg: FIRST_VALUE(column_name) OVER (ORDER BY date) AS starting_score
3. LAST_VALUE( ):
○ returns the last value in the ordered partition
○ useful in comparing each row to the end point of a dataset
○ eg: LAST_VALUE(score) OVER (ORDER BY date) AS ending_score

4. NTH_VALUE( ):
○ returns the nth value in the ordered partition
○ useful to compare or retrieve a specific position's value within a dataset
○ eg: Nth_VALUE(score, 3) OVER (ORDER BY date) AS third_score

5. CUME_DIST( ):
○ returns the cumulative distribution of a value, as a percentage of all rows
○ helps to understand the relative position of a value in terms of the distribution of values
○ eg: CUME_DIST() OVER (ORDER BY score) AS cumulative_distribution
Key Takeaways

1. Window Functions allow advanced analytics on a set of rows related to the

current row without collapsing them

2. ROW_NUMBER() assigns unique sequential integers

3. RANK() and DENSE_RANK() handle ranking, with the latter avoiding gaps

4. NTILE(n) divides data into 'n' equal parts, ideal for quartiles or deciles

5. LAG() and LEAD() compare data between previous, current, and future rows,

enabling trend analysis


Procedures, Functions & Triggers
1. SQL STORED PROCEDURE
● a precompiled SQL statement which can be stored inside the database, so the code can

be reused over and over again

● aka stored procedure

● a procedure has a name, list of parameters, and compiled SQL statements

● In SQL, a procedure does not return any value, but executes certain tasks or operations

● so if you have an SQL query that you need to write over and over again, save it as a stored

procedure, and then just call it to execute it


2. SQL FUNCTIONS
● a function is a predefined routine that returns a value

Types of SQL Functions

1. Built-in Functions:
● Scalar Functions: return a single value based on input value [eg: UPPER(), LOWER(),
ROUND(), LEN(), ABS(), SUBSTRING()]
● Aggregate Functions: operate on many records and return a summarized result [eg:
SUM(), AVG(), MAX(), MIN(), COUNT()]
● Window Functions: operate over a set of table rows and return a single aggregated value
for each row [eg: ROW_NUMBER(), RANK(), LEAD(), LAG(), FIRST_VALUE(),
LAST_VALUE()]
● Table Functions: return a table as a result [eg: UNNEST]
2. User-Defined Functions (UDF)
Key Function Procedure

A function is used to calculate result using given A procedure is used to perform certain task in order.
Definition
inputs.

Call A function can be called by a procedure. A procedure cannot be called by a function.

DML statements cannot be executed within a DML statements can be executed within a procedure.
DML
function.

SQL, Query A function can be called within a query. A procedure cannot be called within a query.

Whenever a function is called, it is first A procedure is compiled once and can be called
SQL, Call
compiled before being called. multiple times without being compiled.

A function returns a value and control to calling A procedure returns the control but not any value to
SQL, Return
function or code. calling function or code.

try-catch A function has no support for try-catch A procedure has support for try-catch blocks.

SELECT A select statement can have a function call. A select statement can't have a procedure call.

Explicit A function cannot have explicit transaction A procedure can use explicit transaction handling.
Transaction handling.
Handling
SQL UDFs :
● Apart from the built-in functions provided by SQL, we can create our own functions,

known as User-Defined Functions (UDFs)

● These functions encapsulate a series of SQL statements into a single compound

statement, so that the users can use them just like the built-in functions of SQL

CREATE OR REPLACE FUNCTION function_name(parameter_list)


RETURNS return_datatype AS $$
BEGIN
executable_section
END;
$$ LANGUAGE plpgsql;

SELECT function_name(argument_list);

SELECT * FROM function_name(argument_list);


Basic Syntax (with variable declaration and exception handling)
CREATE OR REPLACE FUNCTION function_name(parameter_list)
RETURNS return_datatype AS $$
DECLARE
declaration_section (optional)
BEGIN
executable_section
EXCEPTION
exception_handling (optional)
END;
$$ LANGUAGE plpgsql;
3. SQL TRIGGERS
● a trigger is a function invoked automatically before or after an event on a table or view.
● purpose: maintain data integrity, automate tasks, and set up specific conditions for data
modifications

Types of Triggers (in PostgreSQL)


1. Based on Timing:
● BEFORE Triggers: Execute before the event.
● AFTER Triggers: Execute after the event.
● INSTEAD OF: Used for views in place of the actual operation.
2. Based on Event:
● INSERT Triggers: Activate on data insertion.
● UPDATE Triggers: Activate on data update.
● DELETE Triggers: Activate on data deletion.
● TRUNCATE Triggers: Activate on table truncation.
● when we define a trigger, it is always associated with a function

Basic Syntax to define the trigger function


CREATE OR REPLACE FUNCTION trigger_function_name() RETURNS TRIGGER AS $$
BEGIN
executable_section
END;
$$ LANGUAGE plpgsql;

● to delete a trigger, first we need to delete the trigger_function

DROP FUNCTION trigger_function_name()


● the function gets executed when the trigger fires

Basic Syntax

CREATE TRIGGER trigger_name Example


[BEFORE | AFTER | INSTEAD OF]
CREATE TRIGGER trigger_name
[INSERT | UPDATE | DELETE | TRUNCATE]
BEFORE INSERT
ON table_name ON table_name
[REFERENCING OLD AS old NEW AS new] FOR EACH ROW
[FOR EACH ROW | FOR EACH STATEMENT] EXECUTE FUNCTION trigger_function();
EXECUTE FUNCTION trigger_function();

● after deleting the trigger function, trigger can be deleted as:

DROP TRIGGER IF EXISTS trigger_name ON table_name ;


4. CRON
● CRON is a time-based job scheduler in Unix-like operating systems
● enables users to schedule jobs (commands or scripts) to run periodically at fixed times, dates, or
intervals

How it works:
● Uses a Cron Expression (* * * * *), representing minutes, hours, days, months, and weekdays.
● Example: 30 4 * * * runs a job daily at 4:30 AM

Uses:
● Automating backups, sending emails, system maintenance, reporting, notifications, data sync,
auto-updating content, resource monitoring, batch jobs, automated testing, etc.
● In SQL,
○ CRON can be used to schedule recurring database tasks
○ eg: Regular backups, data cleanup, refreshing materialized views, generating reports
Select Statement demonstrating various features
SELECT
c.Id, c.FirstName, c.LastName, SELECT CLAUSE with Alias
COUNT(o.Id) AS total_orders,
SUM(oi.quantity * oi.UnitPrice) AS total_spent,
AVG(oi.quantity) AS avg_items_per_order,
MAX(o.OrderDate) AS latest_order_date AGGREGATE FUNCTIONS
FROM customer c
JOIN order o ON c.id = o.CustomerId JOINS
LEFT JOIN OrderItem oi ON o.Id = oi.OrderId
WHERE c.country LIKE 'USA' AND o.OrderDate <= '2023-01-01' WHERE CLAUSE, LIKE WILDCARD

GROUP BY c.Id, c.FirstName, c.LastName GROUP BY CLAUSE

--HAVING total_spent > 1000 COMMENT

ORDER BY total_spent DESC SORTING

LIMIT 10; SELECTION LIMIT / TOP


Online Resources
● RDBMS Design, Data Modeling, and Basic Concepts:
○ Overall notes on Database Design and Data Modeling
○ ERD:
■ https://www.lucidchart.com/pages/er-diagrams
■ https://creately.com/blog/diagrams/er-diagrams-tutorial/
■ https://www.guru99.com/er-diagram-tutorial-dbms.html
■ https://www.youtube.com/playlist?list=PL3R9-um41JsxPg4WAPeEZgH6oAk2oti0Q
■ https://drive.google.com/file/d/11JyZ3p2QCmxpJhthKtS9mZbgD37vRD65/view

● SQL
○ W3schools SQL tutorial: https://www.w3schools.com/sql/default.asp
○ W3schools MySQL tutorial: https://www.w3schools.com/mysql/
○ Mode SQL tutorial: https://mode.com/sql-tutorial
○ Difference between Views and Materialized Views | tutorialspoint
○ CTE in SQL | geeksforgeeks
○ CTE vs subquery | learnsql blog
○ Window functions in SQL | geeksforgeeks
○ SQL Window Function Example | learnsql blog
○ Window Functions - Analytics Vidhya
○ Top 5 Most Popular Window Functions | mode
○ Window functions | PostgreSQL documentation
○ SQL Procedural Language
○ Difference between Function and Procedure

● Data Wrangling with SQL:


○ Mode Tutorial: https://mode.com/sql-tutorial/data-wrangling-with-sql
○ Advanced Data Wrangling Techniques with SQL (blog): https://talent500.co/blog/advanced-data-wrangling-techniques-with-sql/
○ MY DATA WRANGLING PROJECT WITH SQL (MYSQL):
https://medium.com/@jennifer.ezeanyim/my-data-wrangling-project-with-sql-mysql-50b6e6e53ba2

● Practice SQL from: https://www.sql-practice.com/


Assignments
● Available in Fuse Classroom
PHASE II : Data Wrangling Project
Demonstration
THANK YOU

You might also like