662a5089e0494246e350140dslides - Data Wrangling With SQL
662a5089e0494246e350140dslides - Data Wrangling With SQL
April 25
CONTENT
SN Topics
Phase I (6:30 PM - 7:15 PM) - By Pallavi Shrestha
1 Introduction and Briefing of the Session
3 Overview of RDBMS
SN Topics
Phase II (7:25 PM - 8:15 PM) - By Sudip Bala
9 Real World Data Wrangling Project with SQL
c. Date formats
Purpose:
- Improve data quality
- Enable faster analysis
- Enhance model accuracy
SQL is the language for talking to databases.
SQL dialects refers to variations of SQL like T-SQL, PL/SQL, MySQL, PostgreSQL
RDBMS: Tools:
● MySQL ● Dbeaver
● PostgreSQL ● MySQL Workbench
For Linux Terminal (Ubuntu/Debian):
1. Install PostgreSQL
sudo apt install postgresql
RDBMS examples
● Oracle
● MySQL
● SQLite
● PostgreSQL
● Microsoft SQL Server
● MariaDB
Schema is a logical container that holds a collection of database objects, such as tables,
views, indexes, and more
● defines the structure of the database and the organization of data within it
● helps in segregating different types of data and provides a way to manage permissions
and access control for different user roles
Table is a fundamental database object that stores data in rows and columns, forming a
two-dimensional structure
● each column corresponds to an attribute of the data
● each row represents an individual record or entry
● used to represent different entities in the real world, such as customers, orders, products,
employees, etc.
● relationships between tables are established through keys, such as primary keys and
foreign keys.
Basic SQL Commands
DB Implementation
● DDL (Data Definition Language)
○ for defining and managing the structure of the database, including tables,
relationships, indexes, and constraints
○ DDL commands are used to create, modify, and delete database objects
○ CREATE, DROP, ALTER, TRUNCATE, RENAME, COMMENT
Equal to =
Not equal to <> or !=
Greater than >
Less than <
Greater than or equal to >=
Less than or equal to <=
Logical Operators
● Allow us to use multiple comparison operators in one query
● To filter the data by more than one conditions
○ LIKE - to match similar values, instead of exact values.
○ IN - to specify a list of values you'd like to include.
○ BETWEEN - to select only rows within a certain range.
○ IS NULL - to select rows that contain no data in a given column.
○ AND - to select only rows that satisfy two conditions.
○ OR - to select rows that satisfy either of two conditions.
○ NOT - to select rows that do not match a certain condition
Wild Cards
● Used to substitute one or more characters in a string
● Especially useful in database systems for pattern searching within strings
● Used with LIKE operator
Syntax
SELECT column_name AS alias_name
FROM table_name;
JOINS
Used to combine rows from 2 or more tables based on related columns between them
Type Returns
LEFT / LEFT OUTER All rows from the left table + only matching rows from the right table
RIGHT / RIGHT OUTER All rows from right table + only matching rows from left table
CROSS JOIN Produces a Cartesian product of the two tables, returning all possible combinations of rows
SELF / Recursive Joins a table with itself to compare rows within the same table
NATURAL Joins two tables on all columns with the same name in both tables
FULL OUTER (postgres) Returns all records when there is a match in either the left or the right table
LATERAL (postgres) Allows a subquery in the FROM clause to refer to columns of the preceding table(s). Useful when
dealing with set-returning functions or other subqueries that need to reference the main query.
JOINS - example
JOINS on multiple keys
● When we need to join tables based on more than one key or condition, we can use
multiple join conditions.
● Common in scenarios where a single pair of keys isn't unique enough to combine
the tables correctly.
END;
UNION
● The UNION operator is used to combine the result sets of two or more SELECT statements.
● Each SELECT statement within the UNION must have
○ same number of columns,
○ columns must also have similar data types,
○ columns must be in the same order.
● By default, UNION removes duplicate rows. Use UNION ALL if you want to keep duplicates
2. Refresh MV
REFRESH MATERIALIZED VIEW view_name;
Views VS Materialized Views
● Regular Views:
○ When to Use: Useful when the data needs to be always up-to-date with the underlying
tables
○ Best for situations where the underlying data changes frequently but the view is not
queried often
○ Example: A report that displays current employee data, which is changing constantly
due to new hires, departures, or role changes.
● Materialized Views:
○ When to Use: Best for situations where the data doesn't change frequently but the view is
queried often, especially for complex aggregations or joins that are computationally
expensive.
○ Example: A monthly sales report that aggregates sales data over multiple tables. The
data might only be updated once a day or once a week, but the report might be
accessed frequently. Using a MV can make accessing this report much faster.
Common Table Expressions (CTEs)
Common Table Expression (CTE)
● temporary named result sets that exist for just one query
● can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement
● can be self-referencing or even recursive
● defined using the WITH clause
NTH_VALUE()
4. Statistical functions
SELECT
state,total_cases,active_ratio_percent,
SUM(total_cases) OVER (PARTITION BY active_ratio_percent
ORDER BY total_cases DESC) as running_total_per_active_ratio
FROM
covid
ORDER BY
active_ratio_percent, total_cases DESC;
PARTITION BY
● divides the query result set into partitions
● the window function is applied separately to each partition
● similar to the GROUP BY clause, but, instead of aggregating the data, it
retains the original rows and computes the function over each partition
● PARTITION BY column_name
● only those columns made available by the FROM clause can be used in
PARTITION BY
● aliases in the SELECT list can’t be used for partition
ORDER BY
● within the defined window, rows can be ordered using this clause
● i.e. it defines the logical order of rows within each partition of the result set
● crucial while calculating running totals or cumulative metrics
● if not specified, default order is ASC & the window function will use all rows in
the partition
● If ORDER BY is specified but frame is not specified, default is:
○ ROWS/ RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
● like PARTITION BY, ORDER BY can also use only those columns specified by
the FROM clause
● integer can’t be specified to represent column name or alias
Frame specification
Determines the range of rows to include in the calculation for the current row
Syntax components:
1. ROWS or RANGE
● ROWS defines the frame by a number of rows
● RANGE defines the frame by value intervals
2. Start and End boundaries
○ UNBOUNDED PRECEDING: from the first row of the partition
○ ‘X’ PRECEDING: ‘x’ number of rows before the current row
○ CURRENT ROW: current row
○ ‘X’ FOLLOWING: ‘x’ number of rows after the current row
○ UNBOUNDED FOLLOWING: from the last row (till the current or
preceding row)
Aggregate window functions
● COUNT()
● SUM()
● AVG()
● MIN()
● MAX() Benefits
● Allows for complex calculations that consider both the
current row and a related set of rows.
● Facilitates the creation of running totals, moving
averages, and other cumulative metrics.
● Provides insights without the need for self-joins or
subqueries.
Aggregate functions VS aggregate window functions
● both perform operations over a set of rows
● the way they present the results and the context in which they operate makes
them distinct
1. Result Set Size:
● Regular aggregate functions reduce/condense multiple rows into a single
result row /per group (often used with GROUP BY clause)
● Aggregate window functions return a value for each row in the result, based
on the window defined by the OVER clause (no need to use GROUP BY) i.e.
they retain the original number of rows
2. Context of Operation
3. Scope of Operation
Ranking Window Functions: ROW_NUMBER( ), RANK( ), DENSE_RANK( ), NTILE(n)
1. ROW_NUMBER( )
● a window function that assigns a unique sequential integer to rows within a partition of a result
set
● starts are 1 and numbers the rows according to the ORDER BY part of the window statement
● within a partition, no two rows can have same row number
● no need to specify a variable within the parentheses
Why is it used?
● To generate unique numbers for each row, even if they have the same values
● Useful for pagination in applications
● Helps in data deduplication by identifying duplicates and retaining only unique rows
More about ROW_NUMBER( )
Why is it used?
● To rank items in a specific order (e.g., top N items).
● Useful for competitive scenarios where positions might be shared, such
as in leaderboards or sales performance
● Assigns the same rank to rows with identical values (ties) but skips the
subsequent rank(s)
More about RANK( )
● Handling Ties: If two (or more) rows have the same value, RANK() will
give them the same rank. The next rank, however, will be skipped. For
example, if two rows receive a rank of 2, the next row will be ranked 4, not
3
● Ordering: The rank numbers are determined based on the ORDER BY
clause
● Partitions: Using the PARTITION BY clause, RANK() can start over with
each partition, allowing for ranking within specific categories or groups
3. DENSE_RANK( )
● assigns a unique rank within the ordering of a result set
● just like rank function, first row is assigned rank 1 and rows having same
value have same rank
● difference between RANK() and DENSE_RANK() is that in
DENSE_RANK(), for the next rank after two same rank, consecutive
integer is used, no rank is skipped
Why is it used?
● Perfect for scenarios where you want consecutive ranking without gaps
● Essential for data analyses where you don’t want skipped rank numbers
4. NTILE(n)
● distributes the rows in an ordered result set into n approximately equal
parts
● returns the group number for each of the rows in the partition
Why is it used?
● Useful for dividing datasets into quartiles, deciles, percentiles, or any
other sub-partitions
● Helps to analyze the distribution of data
● Flexible Bucketing: NTILE() is a flexible function that can create any
number of buckets as defined
LAG( ) and LEAD()
Often, we might want to compare the value of the current row to that of the
preceding or succeeding row. LAG( ) and LEAD( ) window functions are used
for this purpose
Uses:
● Historical comparison ● Data smoothing
● Pattern / Trend identification ● Data anomaly detection
● Data gap analysis ● Shifting period analysis, etc.
● Sequential analysis
1. LAG( )
● provides access to a row at a specified physical offset which comes
before the current row
● simply, it allows us to get the value from a previous row in the result set
● the default offset for LAG() is 1, but we can specify other offsets
………………………………………
LAG(expression, [offset], [default])
OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ... )
………………………………………
2. LEAD( )
● provides access to a row at a specified physical offset that follows the
current row
● allows us "look forward" in our dataset without actually changing the
position of the current row
)
Other window functions
1. PERCENT_RANK( ):
○ returns the relative rank of each row within the result set as a percentage
○ useful in determining the relative position of a value within a dataset
○ eg: PERCENT_RANK() OVER (ORDER BY score) AS percentile_rank
2. FIRST_VALUE( ):
○ returns the first value in the ordered partition
○ useful in comparing each row with the starting point in a dataset
○ eg: FIRST_VALUE(column_name) OVER (ORDER BY date) AS starting_score
3. LAST_VALUE( ):
○ returns the last value in the ordered partition
○ useful in comparing each row to the end point of a dataset
○ eg: LAST_VALUE(score) OVER (ORDER BY date) AS ending_score
4. NTH_VALUE( ):
○ returns the nth value in the ordered partition
○ useful to compare or retrieve a specific position's value within a dataset
○ eg: Nth_VALUE(score, 3) OVER (ORDER BY date) AS third_score
5. CUME_DIST( ):
○ returns the cumulative distribution of a value, as a percentage of all rows
○ helps to understand the relative position of a value in terms of the distribution of values
○ eg: CUME_DIST() OVER (ORDER BY score) AS cumulative_distribution
Key Takeaways
3. RANK() and DENSE_RANK() handle ranking, with the latter avoiding gaps
4. NTILE(n) divides data into 'n' equal parts, ideal for quartiles or deciles
5. LAG() and LEAD() compare data between previous, current, and future rows,
● In SQL, a procedure does not return any value, but executes certain tasks or operations
● so if you have an SQL query that you need to write over and over again, save it as a stored
1. Built-in Functions:
● Scalar Functions: return a single value based on input value [eg: UPPER(), LOWER(),
ROUND(), LEN(), ABS(), SUBSTRING()]
● Aggregate Functions: operate on many records and return a summarized result [eg:
SUM(), AVG(), MAX(), MIN(), COUNT()]
● Window Functions: operate over a set of table rows and return a single aggregated value
for each row [eg: ROW_NUMBER(), RANK(), LEAD(), LAG(), FIRST_VALUE(),
LAST_VALUE()]
● Table Functions: return a table as a result [eg: UNNEST]
2. User-Defined Functions (UDF)
Key Function Procedure
A function is used to calculate result using given A procedure is used to perform certain task in order.
Definition
inputs.
DML statements cannot be executed within a DML statements can be executed within a procedure.
DML
function.
SQL, Query A function can be called within a query. A procedure cannot be called within a query.
Whenever a function is called, it is first A procedure is compiled once and can be called
SQL, Call
compiled before being called. multiple times without being compiled.
A function returns a value and control to calling A procedure returns the control but not any value to
SQL, Return
function or code. calling function or code.
try-catch A function has no support for try-catch A procedure has support for try-catch blocks.
SELECT A select statement can have a function call. A select statement can't have a procedure call.
Explicit A function cannot have explicit transaction A procedure can use explicit transaction handling.
Transaction handling.
Handling
SQL UDFs :
● Apart from the built-in functions provided by SQL, we can create our own functions,
statement, so that the users can use them just like the built-in functions of SQL
SELECT function_name(argument_list);
Basic Syntax
How it works:
● Uses a Cron Expression (* * * * *), representing minutes, hours, days, months, and weekdays.
● Example: 30 4 * * * runs a job daily at 4:30 AM
Uses:
● Automating backups, sending emails, system maintenance, reporting, notifications, data sync,
auto-updating content, resource monitoring, batch jobs, automated testing, etc.
● In SQL,
○ CRON can be used to schedule recurring database tasks
○ eg: Regular backups, data cleanup, refreshing materialized views, generating reports
Select Statement demonstrating various features
SELECT
c.Id, c.FirstName, c.LastName, SELECT CLAUSE with Alias
COUNT(o.Id) AS total_orders,
SUM(oi.quantity * oi.UnitPrice) AS total_spent,
AVG(oi.quantity) AS avg_items_per_order,
MAX(o.OrderDate) AS latest_order_date AGGREGATE FUNCTIONS
FROM customer c
JOIN order o ON c.id = o.CustomerId JOINS
LEFT JOIN OrderItem oi ON o.Id = oi.OrderId
WHERE c.country LIKE 'USA' AND o.OrderDate <= '2023-01-01' WHERE CLAUSE, LIKE WILDCARD
● SQL
○ W3schools SQL tutorial: https://www.w3schools.com/sql/default.asp
○ W3schools MySQL tutorial: https://www.w3schools.com/mysql/
○ Mode SQL tutorial: https://mode.com/sql-tutorial
○ Difference between Views and Materialized Views | tutorialspoint
○ CTE in SQL | geeksforgeeks
○ CTE vs subquery | learnsql blog
○ Window functions in SQL | geeksforgeeks
○ SQL Window Function Example | learnsql blog
○ Window Functions - Analytics Vidhya
○ Top 5 Most Popular Window Functions | mode
○ Window functions | PostgreSQL documentation
○ SQL Procedural Language
○ Difference between Function and Procedure