0% found this document useful (0 votes)

1K views85 pages

662a5089e0494246e350140dslides - Data Wrangling With SQL

Uploaded by

Saugat Adhikari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views85 pages

662a5089e0494246e350140dslides - Data Wrangling With SQL

Uploaded by

Saugat Adhikari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

Fuse AI Fellowship 2024

Data Wrangling with SQL

By: Pallavi Shrestha | Sudip Bala

April 25
CONTENT

SN Topics
Phase I (6:30 PM - 7:15 PM) - By Pallavi Shrestha
1 Introduction and Briefing of the Session

2 Importance of Data Wrangling in Data Science

3 Overview of RDBMS

4 Structured Query Language (SQL)

5 Basic SQL + Demo

6 Intermediate SQL + Demo

8 Some Advanced SQL + Demo

Break (7:15 PM - 7:25 PM)

CONTENT

SN Topics
Phase II (7:25 PM - 8:15 PM) - By Sudip Bala
9 Real World Data Wrangling Project with SQL

a. Exploratory Data Analysis

b. Handling missing and duplicate values

c. Date formats

d. Trimming, CASE, COALESCE, Window Functions

e. Dealing with outliers

f. Regular expressions for string manipulation

8:15 PM - 8:30 PM - Conclusion and QnA

Role of Data Wrangling in
Data Science & Machine Learning
Introduction
Data wrangling, or data munging, involves transforming and mapping raw data into a
structured format, ready for analysis and model building.

It goes beyond merely data cleaning and

more thoroughly transforms, reformats, and
prepares it for eventual downstream needs.

Purpose:
- Improve data quality
- Enable faster analysis
- Enhance model accuracy
SQL is the language for talking to databases.
SQL dialects refers to variations of SQL like T-SQL, PL/SQL, MySQL, PostgreSQL

Installation and Setup

RDBMS: Tools:
● MySQL ● Dbeaver
● PostgreSQL ● MySQL Workbench
For Linux Terminal (Ubuntu/Debian):

1. Install PostgreSQL
sudo apt install postgresql

2. Add and Install DBeaver

sudo apt install wget software-properties-common
wget -O - https://dbeaver.io/debs/dbeaver.gpg.key | sudo apt-key add -
sudo add-apt-repository "deb https://dbeaver.io/debs/dbeaver-ce /"

sudo apt update

sudo apt install dbeaver-ce

For Linux Terminal (Ubuntu/Debian):

1. Install MySQL Server

sudo apt install mysql-server
[During the installation process, you will be prompted to set a password for the MySQL root user]

2. Install MySQL Workbench

[Download the MySQL Workbench Debian package from the ofﬁcial MySQL website & install it using dpkg]
Example:
wget https://dev.mysql.com/get/Downloads/MySQLGUITools/mysql-workbench-community_8.0.28-1ubuntu21.10_amd64.deb

sudo dpkg -i mysql-workbench-community_8.0.28-1ubuntu21.10_amd64.deb

RDBMS Basics
Basics to get started
● Database: a shared collection of data that is stored and organised so that it can be
easily accessed and maintained.
● Database Management System: Collection of programs which enables users to
create and maintain a database

RDBMS examples
● Oracle
● MySQL
● SQLite
● PostgreSQL
● Microsoft SQL Server
● MariaDB
Schema is a logical container that holds a collection of database objects, such as tables,
views, indexes, and more
● deﬁnes the structure of the database and the organization of data within it
● helps in segregating different types of data and provides a way to manage permissions
and access control for different user roles

Table is a fundamental database object that stores data in rows and columns, forming a
two-dimensional structure
● each column corresponds to an attribute of the data
● each row represents an individual record or entry
● used to represent different entities in the real world, such as customers, orders, products,
employees, etc.
● relationships between tables are established through keys, such as primary keys and
foreign keys.
Basic SQL Commands
DB Implementation
● DDL (Data Deﬁnition Language)
○ for deﬁning and managing the structure of the database, including tables,
relationships, indexes, and constraints
○ DDL commands are used to create, modify, and delete database objects
○ CREATE, DROP, ALTER, TRUNCATE, RENAME, COMMENT

● DML (Data Manipulation Language)

○ To manipulate data within the database present in tables
○ Includes commands for inserting, updating, and deleting records from
tables, as well as querying data.
○ INSERT, UPDATE, DELETE, SELECT
○ MERGE, CALL, EXPLAIN
● DCL (Data Control Language): used to control access to data within the database
○ GRANT: used to give speciﬁc privileges or permissions to database users
○ REVOKE: used to revoke previously granted privileges from users

● TCL (Transaction Control Language)

- used to manage transactions within the database
- allow us to control the beginning, ending, and execution of transactions
○ COMMIT: used to permanently save the changes made during the current
transaction to the database
○ ROLLBACK: used to undo all changes made during the current transaction. It
restores the database to its state before the transaction began
○ SAVEPOINT: used to set a savepoint within a transaction. Savepoints allow you to
divide a transaction into smaller parts and rollback to a speciﬁc point if necessary
Code demo: DDL, DML
Comparison Operators
● Used to compare two values
● Represented by symbols
● Essential in conditions that determine which records will be selected, updated, or
deleted

Equal to =
Not equal to <> or !=
Greater than >
Less than <
Greater than or equal to >=
Less than or equal to <=
Logical Operators
● Allow us to use multiple comparison operators in one query
● To ﬁlter the data by more than one conditions
○ LIKE - to match similar values, instead of exact values.
○ IN - to specify a list of values you'd like to include.
○ BETWEEN - to select only rows within a certain range.
○ IS NULL - to select rows that contain no data in a given column.
○ AND - to select only rows that satisfy two conditions.
○ OR - to select rows that satisfy either of two conditions.
○ NOT - to select rows that do not match a certain condition
Wild Cards
● Used to substitute one or more characters in a string
● Especially useful in database systems for pattern searching within strings
● Used with LIKE operator

Symbol Description Example

% Represents zero or more bl% ﬁnds bl, black, blue,
characters and blob

_ Represents a single h_t ﬁnds hot, hat, and hit

character
Intermediate SQL
Aggregate Functions
● An aggregate function performs a calculation on a set of values, and returns a single
value
● often used with the GROUP BY clause of the SELECT statement

Function Used for

COUNT counts the number of (non-null) rows in a particular column

SUM adds all values in a particular column

AVG calculates the average of a group of selected values

MIN returns the smallest value of the selected column

MAX returns the largest value of the selected column

● Data types on which each aggregate function works??

GROUP BY
● allows us to separate data into groups, which can be aggregated
independently of one another
● often used with aggregate functions to group the result-set by one or more
columns
HAVING
● The HAVING clause was added to SQL because the WHERE keyword cannot
be used with aggregate functions

Query Clause Order

1. SELECT
SELECT column_name(s) 2. FROM
FROM table_name
3. WHERE
WHERE condition
GROUP BY column_name(s) 4. GROUP BY
HAVING condition
ORDER BY column_name(s); 5. HAVING
6. ORDER BY
Order of SQL Operations
helps in understanding the relationship between ﬁltering, grouping, and aggregating

1. FROM: Source table(s)

2. WHERE: Row-level filtering
3. GROUP BY: Group rows by column values
4. Aggregate functions applied: SUM(), AVG(), etc.
5. HAVING: Filter on the results of aggregate calculations
6. SELECT: Define which columns (or calculations) to return
7. ORDER BY: Sort the result set
8. LIMIT: Limit the number of returned rows
Points to be noted
● Columns in the SELECT statement that are not aggregated must be listed in
the GROUP BY clause
● The WHERE clause filters rows before aggregation
● The HAVING clause is used to filter results after aggregation
● Order of SQL Operations
DISTINCT
● Returns only unique values in a particular column
● Used in SELECT statement
SELECT DISTINCT column_name
FROM table_name;
● If 2 or more columns are included in SELECT DISTINCT clause, result will contain all of
the unique pairs of these columns
SELECT DISTINCT column1, column2, ...
FROM table_name;
● DISTINCT goes inside aggregate function
SELECT COUNT(DISTINCT column_name)
FROM table_name;
Aliases
● used to give a table, or a column, a temporary name
● often used to make column names more readable
● An alias only exists for the duration of that query
● created with the AS keyword

Syntax
SELECT column_name AS alias_name
FROM table_name;
JOINS
Used to combine rows from 2 or more tables based on related columns between them

Type Returns

INNER Rows that have matching values in both tables

LEFT / LEFT OUTER All rows from the left table + only matching rows from the right table

RIGHT / RIGHT OUTER All rows from right table + only matching rows from left table

CROSS JOIN Produces a Cartesian product of the two tables, returning all possible combinations of rows

SELF / Recursive Joins a table with itself to compare rows within the same table

NATURAL Joins two tables on all columns with the same name in both tables

FULL OUTER (postgres) Returns all records when there is a match in either the left or the right table

LATERAL (postgres) Allows a subquery in the FROM clause to refer to columns of the preceding table(s). Useful when
dealing with set-returning functions or other subqueries that need to reference the main query.
JOINS - example
JOINS on multiple keys
● When we need to join tables based on more than one key or condition, we can use
multiple join conditions.
● Common in scenarios where a single pair of keys isn't unique enough to combine
the tables correctly.

SELECT A., B.

FROM table1 A
JOIN table2 B
ON A.key1 = B.key1 AND A.key2 = B.key2;
CASE
● The SQL CASE is a conditional expression that allows you to introduce
decision-making logic in your SQL queries
● used to derive values based on speciﬁc conditions, effectively allowing for
"if-then-else" logic within SQL statements
Syntax
CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
WHEN conditionN THEN resultN
ELSE result

END;
UNION
● The UNION operator is used to combine the result sets of two or more SELECT statements.
● Each SELECT statement within the UNION must have
○ same number of columns,
○ columns must also have similar data types,
○ columns must be in the same order.
● By default, UNION removes duplicate rows. Use UNION ALL if you want to keep duplicates

SELECT column1, column2, ...

FROM table1
UNION
SELECT column1, column2, ...
FROM table2;
MINUS
● MySQL
○ does not support the MINUS operator
○ you can use a combination of JOIN or NOT EXISTS to achieve similar results
● PostgreSQL
○ The EXCEPT operator returns all rows from the ﬁrst SELECT statement that are not
returned by the second SELECT statement

SELECT column1, column2, ...

FROM table1
EXCEPT
SELECT column1, column2, ...
FROM table2;
INDEX
● a database object designed to speed up data retrieval from a table by creating a
searchable physical structure
● enable quick data access without scanning every row each time a table is queried,
making them especially beneﬁcial for large tables
● often used to enhance query performance for operations involving JOINs, WHERE
clauses, or ORDER BY clauses.

Syntax: CREATE INDEX index_name ON table_name (column1, column2, ..., columnN);

● Trade-offs: While indexes signiﬁcantly improve query speed, they do take up

additional disk space and can slow down write operations (INSERT, UPDATE,
DELETE) because the index also needs to be updated. So it's crucial to choose
which columns to index wisely based on query patterns, as unnecessary indexes
can lead to wasted resources and poor insert/update performance.
Subqueries
● aka inner query or nested query
● a query embedded within another SQL statement
● a SELECT query that is used as part of the main query
● A subquery can return one or more values and can be used in various parts of
the main query, such as in the SELECT, WHERE, and FROM clauses
● Use:
○ used in situations where multiple retrievals are required before the final
result can be obtained
○ useful for breaking down complex problems into simpler, more
manageable pieces
Uses of subqueries
● Some common scenarios include:
○ Retrieving a single summarized value to be used in the main query (e.g.,
an average or maximum).
○ Fetching a list of values to be used as a filter in the main query's WHERE
clause.
○ Creating derived tables on-the-fly which can be utilized by the outer
query.
○ Checking the existence of specific records before performing an action.
○ Comparing each row against a set of rows to determine relationships
(e.g., greater than the average).
● A SQL statement can contain multiple subqueries (even subqueries inside of
subqueries)
Where can subqueries occur?
Subqueries can occur in various parts of an SQL statement, including:
● SELECT clause: Here, it can return a single value for each row processed by
the main query.
● WHERE clause: Used to filter results based on the outcome of the subquery.
Common operators used with subqueries in the WHERE clause include IN,
NOT IN, EXISTS, NOT EXISTS, <, >, <=, >=, =, and <>.
● FROM clause: When used here, the subquery acts as a derived table which
the main query can reference.
● HAVING clause: Useful for filtering results of group by clauses based on the
outcome of a subquery.
● INSERT, UPDATE, DELETE: To determine which records to modify or remove
based on the results of a subquery.
Advanced SQL
Views & Materialized Views
VIEW
● a virtual table based on the result set of a SQL statement
● unlike tables, views do not store data physically
● they pull data from one or multiple tables every time they are queried
● used when data is to be accessed infrequently and the data in a table gets
updated on a frequent basis
● dynamic nature of views allow them to always provide up-to-date results based on
the current data in the underlying tables

CREATE VIEW view_name AS

SELECT column1, column2, ...
FROM table_name
WHERE condition;
Materialized View
● similar to a regular view in that it represents the result of a stored query
● unlike a regular view, a Materialized View stores the result of that query in a
physical table
● a snapshot of the data at the time the MV is created or refreshed
● Use:
○ to optimize query performance
○ For aggregations and computations over large datasets
○ When the underlying data doesn't change frequently but reads are
frequent
○ In data warehousing scenarios where query performance is critical
Syntax
1. Create MV
CREATE MATERIALIZED VIEW view_name AS
SELECT ...
FROM ...
WHERE ...;

2. Refresh MV
REFRESH MATERIALIZED VIEW view_name;
Views VS Materialized Views
● Regular Views:
○ When to Use: Useful when the data needs to be always up-to-date with the underlying
tables
○ Best for situations where the underlying data changes frequently but the view is not
queried often
○ Example: A report that displays current employee data, which is changing constantly
due to new hires, departures, or role changes.
● Materialized Views:
○ When to Use: Best for situations where the data doesn't change frequently but the view is
queried often, especially for complex aggregations or joins that are computationally
expensive.
○ Example: A monthly sales report that aggregates sales data over multiple tables. The
data might only be updated once a day or once a week, but the report might be
accessed frequently. Using a MV can make accessing this report much faster.
Common Table Expressions (CTEs)
Common Table Expression (CTE)
● temporary named result sets that exist for just one query
● can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement
● can be self-referencing or even recursive
● deﬁned using the WITH clause

Why are CTEs used?

● Improved readability, better visibility
● Decomposing Complicated Queries
● Recursive Queries
● Referencing the Same Dataset Multiple Times
Syntax
WITH cte_name AS (
-- CTE query here
)
-- Main query using the CTE
SELECT ...
FROM cte_name ...
Subqueries VS CTE
● CTEs can make queries more readable compared to subqueries, especially when
the same subquery would need to be used in multiple places
● CTEs must always be named. Only PostgreSQL insists that subqueries must have a
name
● Subqueries can be used in any part of the main query, while CTEs are deﬁned at
the beginning and can then be referenced by the main query
● CTEs can be recursive, subqueries cannot
● Subqueries can be used in UPDATE clause or WHERE clause in conjunction with
the keywords IN or EXISTS, but you can't do this with CTEs
Window Functions
Window Function
● a type of function that performs a calculation across a set of table rows that are
somehow related to the current row
● this "set of related rows" is termed as a "window"
● Aka windowing functions, OVER functions or analytics functions

Why use window functions?

● Aggregation Without Grouping
● Flexible Calculations
● Complex Data Analysis Made Simpler
Window functions
1. Aggregation functions: SUM(), AVG(), COUNT(), MIN(), MAX()

2. Ranking functions: ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(n)

3. Value functions: LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE(),

NTH_VALUE()

4. Statistical functions

5. Cumulative distribution and percentile functions

Basic windowing syntax
SELECT coulmn_name1,
window_function(cloumn_name2)
OVER([PARTITION BY column_name1] [ORDER BY column_name3]) AS
new_column
FROM table_name;

window_function= any aggregate or ranking function

column_name1= column to be selected
coulmn_name2= column on which window function is to be applied
column_name3= column on whose basis partition of rows is to be done
new_column= Name of new column
table_name= Name of table
OVER clause
● deﬁnes the window function
● determines the range or set of rows used in the window function's calculation for each
row
○ Without OVER: Aggregate functions like SUM, AVG, and COUNT return a single
value for the entire set of rows.
○ With OVER: These same functions return a value for each row based on the
window of rows deﬁned by the OVER clause.
Example:Suppose you want to calculate the running total of total_cases for each state,
ordered by the number of total cases. Furthermore, you'd like to see this running total for
each region, or "partition", based on the active_ratio_percent.

SELECT
state,total_cases,active_ratio_percent,
SUM(total_cases) OVER (PARTITION BY active_ratio_percent
ORDER BY total_cases DESC) as running_total_per_active_ratio
FROM
covid
ORDER BY
active_ratio_percent, total_cases DESC;
PARTITION BY
● divides the query result set into partitions
● the window function is applied separately to each partition
● similar to the GROUP BY clause, but, instead of aggregating the data, it
retains the original rows and computes the function over each partition
● PARTITION BY column_name
● only those columns made available by the FROM clause can be used in
PARTITION BY
● aliases in the SELECT list can’t be used for partition
ORDER BY
● within the defined window, rows can be ordered using this clause
● i.e. it defines the logical order of rows within each partition of the result set
● crucial while calculating running totals or cumulative metrics
● if not specified, default order is ASC & the window function will use all rows in
the partition
● If ORDER BY is specified but frame is not specified, default is:
○ ROWS/ RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
● like PARTITION BY, ORDER BY can also use only those columns specified by
the FROM clause
● integer can’t be specified to represent column name or alias
Frame specification
Determines the range of rows to include in the calculation for the current row
Syntax components:
1. ROWS or RANGE
● ROWS defines the frame by a number of rows
● RANGE defines the frame by value intervals
2. Start and End boundaries
○ UNBOUNDED PRECEDING: from the first row of the partition
○ ‘X’ PRECEDING: ‘x’ number of rows before the current row
○ CURRENT ROW: current row
○ ‘X’ FOLLOWING: ‘x’ number of rows after the current row
○ UNBOUNDED FOLLOWING: from the last row (till the current or
preceding row)
Aggregate window functions
● COUNT()
● SUM()
● AVG()
● MIN()
● MAX() Benefits
● Allows for complex calculations that consider both the
current row and a related set of rows.
● Facilitates the creation of running totals, moving
averages, and other cumulative metrics.
● Provides insights without the need for self-joins or
subqueries.
Aggregate functions VS aggregate window functions
● both perform operations over a set of rows
● the way they present the results and the context in which they operate makes
them distinct
1. Result Set Size:
● Regular aggregate functions reduce/condense multiple rows into a single
result row /per group (often used with GROUP BY clause)
● Aggregate window functions return a value for each row in the result, based
on the window defined by the OVER clause (no need to use GROUP BY) i.e.
they retain the original number of rows

2. Context of Operation
3. Scope of Operation
Ranking Window Functions: ROW_NUMBER( ), RANK( ), DENSE_RANK( ), NTILE(n)

1. ROW_NUMBER( )
● a window function that assigns a unique sequential integer to rows within a partition of a result
set
● starts are 1 and numbers the rows according to the ORDER BY part of the window statement
● within a partition, no two rows can have same row number
● no need to specify a variable within the parentheses

Why is it used?
● To generate unique numbers for each row, even if they have the same values
● Useful for pagination in applications
● Helps in data deduplication by identifying duplicates and retaining only unique rows
More about ROW_NUMBER( )

● No Duplicates: Unlike other ranking functions, ROW_NUMBER() does not

account for duplicates. Two rows with identical values will still get different
row numbers

● Ordering: The order of the numbers depends on the ORDER BY clause. If no

order is deﬁned, the numbering can be arbitrary

● Partitions: ROW_NUMBER() can be reset with different partitions using the

PARTITION BY clause. This allows for numbering within subsets of your data
2. RANK( )
● a window function that assigns a unique rank to each distinct row within
a partition of a result set

Why is it used?
● To rank items in a speciﬁc order (e.g., top N items).
● Useful for competitive scenarios where positions might be shared, such
as in leaderboards or sales performance
● Assigns the same rank to rows with identical values (ties) but skips the
subsequent rank(s)
More about RANK( )

● Handling Ties: If two (or more) rows have the same value, RANK() will
give them the same rank. The next rank, however, will be skipped. For
example, if two rows receive a rank of 2, the next row will be ranked 4, not
3
● Ordering: The rank numbers are determined based on the ORDER BY
clause
● Partitions: Using the PARTITION BY clause, RANK() can start over with
each partition, allowing for ranking within speciﬁc categories or groups
3. DENSE_RANK( )
● assigns a unique rank within the ordering of a result set
● just like rank function, ﬁrst row is assigned rank 1 and rows having same
value have same rank
● difference between RANK() and DENSE_RANK() is that in
DENSE_RANK(), for the next rank after two same rank, consecutive
integer is used, no rank is skipped

Why is it used?
● Perfect for scenarios where you want consecutive ranking without gaps
● Essential for data analyses where you don’t want skipped rank numbers
4. NTILE(n)
● distributes the rows in an ordered result set into n approximately equal
parts
● returns the group number for each of the rows in the partition

Why is it used?
● Useful for dividing datasets into quartiles, deciles, percentiles, or any
other sub-partitions
● Helps to analyze the distribution of data
● Flexible Bucketing: NTILE() is a ﬂexible function that can create any
number of buckets as deﬁned
LAG( ) and LEAD()
Often, we might want to compare the value of the current row to that of the
preceding or succeeding row. LAG( ) and LEAD( ) window functions are used
for this purpose

Uses:
● Historical comparison ● Data smoothing
● Pattern / Trend identification ● Data anomaly detection
● Data gap analysis ● Shifting period analysis, etc.
● Sequential analysis
1. LAG( )
● provides access to a row at a specified physical offset which comes
before the current row
● simply, it allows us to get the value from a previous row in the result set
● the default offset for LAG() is 1, but we can specify other offsets
………………………………………
LAG(expression, [offset], [default])
OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ... )
………………………………………
2. LEAD( )
● provides access to a row at a specified physical offset that follows the
current row
● allows us "look forward" in our dataset without actually changing the
position of the current row

Arguments of this function:

1. Value expression: The column or expression whose value from a "leading"
row you want to retrieve
2. Offset (optional): The number of rows ahead of the current row from
which to retrieve a value. The default is 1
3. Default value (optional): A value to return if the function tries to look
beyond the last row in the dataset. If not speciﬁed, it returns NULL

● Syntax: LEAD(value_expression [,offset [,default_value]])

Basic Syntax of LEAD( ):

LEAD(column_name [, offset [, default_value]]) OVER (

[PARTITION BY partition_expression, ... ]

ORDER BY sort_expression [ASC | DESC], ...

)
Other window functions

1. PERCENT_RANK( ):
○ returns the relative rank of each row within the result set as a percentage
○ useful in determining the relative position of a value within a dataset
○ eg: PERCENT_RANK() OVER (ORDER BY score) AS percentile_rank

2. FIRST_VALUE( ):
○ returns the ﬁrst value in the ordered partition
○ useful in comparing each row with the starting point in a dataset
○ eg: FIRST_VALUE(column_name) OVER (ORDER BY date) AS starting_score
3. LAST_VALUE( ):
○ returns the last value in the ordered partition
○ useful in comparing each row to the end point of a dataset
○ eg: LAST_VALUE(score) OVER (ORDER BY date) AS ending_score

4. NTH_VALUE( ):
○ returns the nth value in the ordered partition
○ useful to compare or retrieve a speciﬁc position's value within a dataset
○ eg: Nth_VALUE(score, 3) OVER (ORDER BY date) AS third_score

5. CUME_DIST( ):
○ returns the cumulative distribution of a value, as a percentage of all rows
○ helps to understand the relative position of a value in terms of the distribution of values
○ eg: CUME_DIST() OVER (ORDER BY score) AS cumulative_distribution
Key Takeaways

1. Window Functions allow advanced analytics on a set of rows related to the

current row without collapsing them

2. ROW_NUMBER() assigns unique sequential integers

3. RANK() and DENSE_RANK() handle ranking, with the latter avoiding gaps

4. NTILE(n) divides data into 'n' equal parts, ideal for quartiles or deciles

5. LAG() and LEAD() compare data between previous, current, and future rows,

enabling trend analysis

Procedures, Functions & Triggers
1. SQL STORED PROCEDURE
● a precompiled SQL statement which can be stored inside the database, so the code can

be reused over and over again

● aka stored procedure

● a procedure has a name, list of parameters, and compiled SQL statements

● In SQL, a procedure does not return any value, but executes certain tasks or operations

● so if you have an SQL query that you need to write over and over again, save it as a stored

procedure, and then just call it to execute it

2. SQL FUNCTIONS
● a function is a predeﬁned routine that returns a value

Types of SQL Functions

1. Built-in Functions:
● Scalar Functions: return a single value based on input value [eg: UPPER(), LOWER(),
ROUND(), LEN(), ABS(), SUBSTRING()]
● Aggregate Functions: operate on many records and return a summarized result [eg:
SUM(), AVG(), MAX(), MIN(), COUNT()]
● Window Functions: operate over a set of table rows and return a single aggregated value
for each row [eg: ROW_NUMBER(), RANK(), LEAD(), LAG(), FIRST_VALUE(),
LAST_VALUE()]
● Table Functions: return a table as a result [eg: UNNEST]
2. User-Deﬁned Functions (UDF)
Key Function Procedure

A function is used to calculate result using given A procedure is used to perform certain task in order.
Deﬁnition
inputs.

Call A function can be called by a procedure. A procedure cannot be called by a function.

DML statements cannot be executed within a DML statements can be executed within a procedure.
DML
function.

SQL, Query A function can be called within a query. A procedure cannot be called within a query.

Whenever a function is called, it is ﬁrst A procedure is compiled once and can be called
SQL, Call
compiled before being called. multiple times without being compiled.

A function returns a value and control to calling A procedure returns the control but not any value to
SQL, Return
function or code. calling function or code.

try-catch A function has no support for try-catch A procedure has support for try-catch blocks.

SELECT A select statement can have a function call. A select statement can't have a procedure call.

Explicit A function cannot have explicit transaction A procedure can use explicit transaction handling.
Transaction handling.
Handling
SQL UDFs :
● Apart from the built-in functions provided by SQL, we can create our own functions,

known as User-Deﬁned Functions (UDFs)

● These functions encapsulate a series of SQL statements into a single compound

statement, so that the users can use them just like the built-in functions of SQL

CREATE OR REPLACE FUNCTION function_name(parameter_list)

RETURNS return_datatype AS $$
BEGIN
executable_section
END;
$$ LANGUAGE plpgsql;

SELECT function_name(argument_list);

SELECT * FROM function_name(argument_list);

Basic Syntax (with variable declaration and exception handling)
CREATE OR REPLACE FUNCTION function_name(parameter_list)
RETURNS return_datatype AS $$
DECLARE
declaration_section (optional)
BEGIN
executable_section
EXCEPTION
exception_handling (optional)
END;
$$ LANGUAGE plpgsql;
3. SQL TRIGGERS
● a trigger is a function invoked automatically before or after an event on a table or view.
● purpose: maintain data integrity, automate tasks, and set up speciﬁc conditions for data
modiﬁcations

Types of Triggers (in PostgreSQL)

1. Based on Timing:
● BEFORE Triggers: Execute before the event.
● AFTER Triggers: Execute after the event.
● INSTEAD OF: Used for views in place of the actual operation.
2. Based on Event:
● INSERT Triggers: Activate on data insertion.
● UPDATE Triggers: Activate on data update.
● DELETE Triggers: Activate on data deletion.
● TRUNCATE Triggers: Activate on table truncation.
● when we deﬁne a trigger, it is always associated with a function

Basic Syntax to deﬁne the trigger function

CREATE OR REPLACE FUNCTION trigger_function_name() RETURNS TRIGGER AS $$
BEGIN
executable_section
END;
$$ LANGUAGE plpgsql;

● to delete a trigger, ﬁrst we need to delete the trigger_function

DROP FUNCTION trigger_function_name()

● the function gets executed when the trigger ﬁres

Basic Syntax

CREATE TRIGGER trigger_name Example

● after deleting the trigger function, trigger can be deleted as:

DROP TRIGGER IF EXISTS trigger_name ON table_name ;

4. CRON
● CRON is a time-based job scheduler in Unix-like operating systems
● enables users to schedule jobs (commands or scripts) to run periodically at ﬁxed times, dates, or
intervals

How it works:
● Uses a Cron Expression (* * * * *), representing minutes, hours, days, months, and weekdays.
● Example: 30 4 * * * runs a job daily at 4:30 AM

Uses:
● Automating backups, sending emails, system maintenance, reporting, notiﬁcations, data sync,
auto-updating content, resource monitoring, batch jobs, automated testing, etc.
● In SQL,
○ CRON can be used to schedule recurring database tasks
○ eg: Regular backups, data cleanup, refreshing materialized views, generating reports
Select Statement demonstrating various features
SELECT
c.Id, c.FirstName, c.LastName, SELECT CLAUSE with Alias
COUNT(o.Id) AS total_orders,
SUM(oi.quantity * oi.UnitPrice) AS total_spent,
AVG(oi.quantity) AS avg_items_per_order,
MAX(o.OrderDate) AS latest_order_date AGGREGATE FUNCTIONS
FROM customer c
JOIN order o ON c.id = o.CustomerId JOINS
LEFT JOIN OrderItem oi ON o.Id = oi.OrderId
WHERE c.country LIKE 'USA' AND o.OrderDate <= '2023-01-01' WHERE CLAUSE, LIKE WILDCARD

GROUP BY c.Id, c.FirstName, c.LastName GROUP BY CLAUSE

--HAVING total_spent > 1000 COMMENT

ORDER BY total_spent DESC SORTING

LIMIT 10; SELECTION LIMIT / TOP

Online Resources
● RDBMS Design, Data Modeling, and Basic Concepts:
○ Overall notes on Database Design and Data Modeling
○ ERD:
■ https://www.lucidchart.com/pages/er-diagrams
■ https://creately.com/blog/diagrams/er-diagrams-tutorial/
■ https://www.guru99.com/er-diagram-tutorial-dbms.html
■ https://www.youtube.com/playlist?list=PL3R9-um41JsxPg4WAPeEZgH6oAk2oti0Q
■ https://drive.google.com/file/d/11JyZ3p2QCmxpJhthKtS9mZbgD37vRD65/view

● SQL
○ W3schools SQL tutorial: https://www.w3schools.com/sql/default.asp
○ W3schools MySQL tutorial: https://www.w3schools.com/mysql/
○ Mode SQL tutorial: https://mode.com/sql-tutorial
○ Difference between Views and Materialized Views | tutorialspoint
○ CTE in SQL | geeksforgeeks
○ CTE vs subquery | learnsql blog
○ Window functions in SQL | geeksforgeeks
○ SQL Window Function Example | learnsql blog
○ Window Functions - Analytics Vidhya
○ Top 5 Most Popular Window Functions | mode
○ Window functions | PostgreSQL documentation
○ SQL Procedural Language
○ Difference between Function and Procedure

● Data Wrangling with SQL:

○ Mode Tutorial: https://mode.com/sql-tutorial/data-wrangling-with-sql
○ Advanced Data Wrangling Techniques with SQL (blog): https://talent500.co/blog/advanced-data-wrangling-techniques-with-sql/
○ MY DATA WRANGLING PROJECT WITH SQL (MYSQL):
https://medium.com/@jennifer.ezeanyim/my-data-wrangling-project-with-sql-mysql-50b6e6e53ba2

● Practice SQL from: https://www.sql-practice.com/

Assignments
● Available in Fuse Classroom
PHASE II : Data Wrangling Project
Demonstration
THANK YOU

Planning For Estidama
No ratings yet
Planning For Estidama
34 pages
Chapter 3 - SQL Notes
No ratings yet
Chapter 3 - SQL Notes
25 pages
SQL Cheat Sheet
0% (1)
SQL Cheat Sheet
2 pages
Scope Statement For The Time Table Generation System For Thapar University
60% (5)
Scope Statement For The Time Table Generation System For Thapar University
4 pages
PostgreSQL Tutorial
No ratings yet
PostgreSQL Tutorial
25 pages
CUET Computer Notes
No ratings yet
CUET Computer Notes
55 pages
SQLFundamentalspptx 210320 170025
No ratings yet
SQLFundamentalspptx 210320 170025
28 pages
Udacity Challenge Prog - DS With Python - NOTES
No ratings yet
Udacity Challenge Prog - DS With Python - NOTES
84 pages
SQL 1721960421
No ratings yet
SQL 1721960421
131 pages
Unit 6 SQL
No ratings yet
Unit 6 SQL
23 pages
SQL Notes by Dhananjay
100% (1)
SQL Notes by Dhananjay
32 pages
SQL Notes by Apna College
75% (4)
SQL Notes by Apna College
29 pages
Total Content Map SQL
No ratings yet
Total Content Map SQL
1 page
SQL Notes-1
No ratings yet
SQL Notes-1
28 pages
Wa0003.
No ratings yet
Wa0003.
20 pages
SQL For Beginners SQL Made Easy For Data Analysis
No ratings yet
SQL For Beginners SQL Made Easy For Data Analysis
21 pages
SQL Notes
No ratings yet
SQL Notes
29 pages
SQL Notes
No ratings yet
SQL Notes
49 pages
Basic SQL Queries
No ratings yet
Basic SQL Queries
16 pages
SQL Notes by Apna College
No ratings yet
SQL Notes by Apna College
29 pages
Unit IV SQL
No ratings yet
Unit IV SQL
156 pages
SQL Notes by Apna College
No ratings yet
SQL Notes by Apna College
15 pages
SQL Notes
No ratings yet
SQL Notes
66 pages
RDBMS and DBMS Concepts
No ratings yet
RDBMS and DBMS Concepts
5 pages
Session 3 - Data Analysis
No ratings yet
Session 3 - Data Analysis
33 pages
SQL Basic Cheat Sheet
100% (1)
SQL Basic Cheat Sheet
1 page
Shivanesh Dbms
No ratings yet
Shivanesh Dbms
22 pages
SQL Notes by Apna College
No ratings yet
SQL Notes by Apna College
29 pages
SQL Notes - pdf1
No ratings yet
SQL Notes - pdf1
31 pages
PostgreSQL Topics Presentation
No ratings yet
PostgreSQL Topics Presentation
21 pages
Structured Query Language (SQL)
No ratings yet
Structured Query Language (SQL)
29 pages
SQL Tutorial For Beginners
No ratings yet
SQL Tutorial For Beginners
10 pages
Your SQL Quickstart Guide 1694613471
No ratings yet
Your SQL Quickstart Guide 1694613471
32 pages
SQL Operators Functions and Keywords
No ratings yet
SQL Operators Functions and Keywords
9 pages
Database Cheat Sheet
No ratings yet
Database Cheat Sheet
4 pages
Addbase - Prelim: Slide 1 - Review of Theory of Databases
No ratings yet
Addbase - Prelim: Slide 1 - Review of Theory of Databases
6 pages
ADDBASE - Prelim
No ratings yet
ADDBASE - Prelim
6 pages
ADDBASE - Prelim
No ratings yet
ADDBASE - Prelim
6 pages
Addbase - Prelim: Slide 1 - Review of Theory of Databases
No ratings yet
Addbase - Prelim: Slide 1 - Review of Theory of Databases
6 pages
SQL From Zero To Data Hero
No ratings yet
SQL From Zero To Data Hero
43 pages
Reviewer in Database
No ratings yet
Reviewer in Database
12 pages
SQL Guide
No ratings yet
SQL Guide
19 pages
DBMS
No ratings yet
DBMS
49 pages
02 SQL
No ratings yet
02 SQL
7 pages
Postgre SQL Advance Notes
No ratings yet
Postgre SQL Advance Notes
23 pages
Slide 03
No ratings yet
Slide 03
38 pages
Crack The Top 40 SQL Interview Questions by The Educative Team Jun, 2022 Grokking The Tech Interview
No ratings yet
Crack The Top 40 SQL Interview Questions by The Educative Team Jun, 2022 Grokking The Tech Interview
1 page
Data Manipulation (Part - I)
No ratings yet
Data Manipulation (Part - I)
46 pages
What Is SQL?: Insert Into Tablename (Fieldname, Fieldname, Fieldname) Values (@fieldname, @fieldname, @fieldname)
No ratings yet
What Is SQL?: Insert Into Tablename (Fieldname, Fieldname, Fieldname) Values (@fieldname, @fieldname, @fieldname)
20 pages
Hsslive CS Chapt 9 Structured Query Language
No ratings yet
Hsslive CS Chapt 9 Structured Query Language
3 pages
SQL Quick Guide PDF
No ratings yet
SQL Quick Guide PDF
7 pages
SQL Query Updated
No ratings yet
SQL Query Updated
30 pages
SQL Workshop
No ratings yet
SQL Workshop
22 pages
SQL Concepts and Queries
No ratings yet
SQL Concepts and Queries
11 pages
SQL Theory With Query
No ratings yet
SQL Theory With Query
11 pages
SQL
No ratings yet
SQL
9 pages
RDBMS
No ratings yet
RDBMS
49 pages
SQL Basics
No ratings yet
SQL Basics
15 pages
SQL Commands
No ratings yet
SQL Commands
4 pages
Rdbms File Dabba
No ratings yet
Rdbms File Dabba
45 pages
Data Manipulation Language
No ratings yet
Data Manipulation Language
48 pages
SQL Interview Success From Beginner To Pro
From Everand
SQL Interview Success From Beginner To Pro
Shana
No ratings yet
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
No ratings yet
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
1 page
Lab 9
No ratings yet
Lab 9
8 pages
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
No ratings yet
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
1 page
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
No ratings yet
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
1 page
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
No ratings yet
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
1 page
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
No ratings yet
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
1 page
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
No ratings yet
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
1 page
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
No ratings yet
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
1 page
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
No ratings yet
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
1 page
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
No ratings yet
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
1 page
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
No ratings yet
L 5abl Dfwolds Ljbfno: Shree Chhabdi Secondary School
1 page
History Plan Week 6and 7. Term 1
No ratings yet
History Plan Week 6and 7. Term 1
2 pages
كل مذكرات السنة الأولى في الانجليزية
No ratings yet
كل مذكرات السنة الأولى في الانجليزية
32 pages
Ground Floor Containment Overall Layout
No ratings yet
Ground Floor Containment Overall Layout
1 page
FreemanWhite Hybrid Operating Room Design Guide PDF
No ratings yet
FreemanWhite Hybrid Operating Room Design Guide PDF
11 pages
Reflective Essay
No ratings yet
Reflective Essay
4 pages
Contribution of Renewable Energy On Total Energy Capacity
No ratings yet
Contribution of Renewable Energy On Total Energy Capacity
6 pages
Unit 2 Principles of Assessm Ent in Instructional Decision
No ratings yet
Unit 2 Principles of Assessm Ent in Instructional Decision
11 pages
Kel 13 Jurnal Ips
No ratings yet
Kel 13 Jurnal Ips
10 pages
DelcoRemy DiagnosticManual Updated Digital
No ratings yet
DelcoRemy DiagnosticManual Updated Digital
32 pages
EL BR 023 CA EN 0120.1 - PVC Duct DB2 ES2 Pipe Fittings
No ratings yet
EL BR 023 CA EN 0120.1 - PVC Duct DB2 ES2 Pipe Fittings
8 pages
Fullz PDF
No ratings yet
Fullz PDF
3 pages
Dpi Reports
No ratings yet
Dpi Reports
2 pages
Robotic Gripper Using Four Bar Mechanism
No ratings yet
Robotic Gripper Using Four Bar Mechanism
54 pages
General Knowledge For IAS in English
No ratings yet
General Knowledge For IAS in English
4 pages
Building Code Requirements For Structural Concrete Reinforced With Glass FiberReinforced Polymer (GFRP) Bars Code and Commentary 440.11.22 Chapter 22
100% (1)
Building Code Requirements For Structural Concrete Reinforced With Glass FiberReinforced Polymer (GFRP) Bars Code and Commentary 440.11.22 Chapter 22
32 pages
9-Mm Pistol Pmi Training: REF: FM 23 - 35
No ratings yet
9-Mm Pistol Pmi Training: REF: FM 23 - 35
30 pages
Performance and Durability Comparison: Dell Latitude 14 5000 Series vs. HP EliteBook 840 G1
No ratings yet
Performance and Durability Comparison: Dell Latitude 14 5000 Series vs. HP EliteBook 840 G1
20 pages
Stability of Food Emulsions (2) : David Julian Mcclements
No ratings yet
Stability of Food Emulsions (2) : David Julian Mcclements
37 pages
Writing Letter of Apllication and Resume
No ratings yet
Writing Letter of Apllication and Resume
10 pages
S20G Low Headroom Hoist/geared Trolley Combination
No ratings yet
S20G Low Headroom Hoist/geared Trolley Combination
5 pages
Short Term Tender For Supply, Installation and Commissioning of Various Medical Equipments For Covid-19 Pandemic
No ratings yet
Short Term Tender For Supply, Installation and Commissioning of Various Medical Equipments For Covid-19 Pandemic
108 pages
ACPH Formula
No ratings yet
ACPH Formula
4 pages
Gyan Sagar College of Engineering, SAGAR, (M.P.)
No ratings yet
Gyan Sagar College of Engineering, SAGAR, (M.P.)
5 pages
Guidance Mandatory Competence Attainment Report (v7) Final 04072012
No ratings yet
Guidance Mandatory Competence Attainment Report (v7) Final 04072012
8 pages
Communication Aids and Strategies Using Tools of Technology
No ratings yet
Communication Aids and Strategies Using Tools of Technology
32 pages
Current Affairs - Compendium - DMS - IIT - Delhi
No ratings yet
Current Affairs - Compendium - DMS - IIT - Delhi
28 pages
OB Biruktawit Zegeye
No ratings yet
OB Biruktawit Zegeye
6 pages
Unit 8 Year 6 (w21)
No ratings yet
Unit 8 Year 6 (w21)
23 pages

662a5089e0494246e350140dslides - Data Wrangling With SQL

Uploaded by

662a5089e0494246e350140dslides - Data Wrangling With SQL

Uploaded by

Fuse AI Fellowship 2024

Data Wrangling with SQL

2 Importance of Data Wrangling in Data Science

4 Structured Query Language (SQL)

5 Basic SQL + Demo

8 Some Advanced SQL + Demo

Break (7:15 PM - 7:25 PM)

a. Exploratory Data Analysis

b. Handling missing and duplicate values

d. Trimming, CASE, COALESCE, Window Functions

e. Dealing with outliers

f. Regular expressions for string manipulation

8:15 PM - 8:30 PM - Conclusion and QnA

It goes beyond merely data cleaning and

Installation and Setup

2. Add and Install DBeaver

sudo apt update

sudo apt install dbeaver-ce

1. Install MySQL Server

2. Install MySQL Workbench

sudo dpkg -i mysql-workbench-community_8.0.28-1ubuntu21.10_amd64.deb

● DML (Data Manipulation Language)

● TCL (Transaction Control Language)

Symbol Description Example

_ Represents a single h_t ﬁnds hot, hat, and hit

Function Used for

COUNT counts the number of (non-null) rows in a particular column

SUM adds all values in a particular column

AVG calculates the average of a group of selected values

MIN returns the smallest value of the selected column

MAX returns the largest value of the selected column

● Data types on which each aggregate function works??

Query Clause Order

1. FROM: Source table(s)

INNER Rows that have matching values in both tables

SELECT A.*, B.*

SELECT column1, column2, ...

SELECT column1, column2, ...

Syntax: CREATE INDEX index_name ON table_name (column1, column2, ..., columnN);

● Trade-offs: While indexes signiﬁcantly improve query speed, they do take up

CREATE VIEW view_name AS

Why are CTEs used?

Why use window functions?

2. Ranking functions: ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(n)

3. Value functions: LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE(),

5. Cumulative distribution and percentile functions

window_function= any aggregate or ranking function

● No Duplicates: Unlike other ranking functions, ROW_NUMBER() does not

● Ordering: The order of the numbers depends on the ORDER BY clause. If no

● Partitions: ROW_NUMBER() can be reset with different partitions using the

Arguments of this function:

● Syntax: LEAD(value_expression [,offset [,default_value]])

LEAD(column_name [, offset [, default_value]]) OVER (

[PARTITION BY partition_expression, ... ]

ORDER BY sort_expression [ASC | DESC], ...

1. Window Functions allow advanced analytics on a set of rows related to the

current row without collapsing them

2. ROW_NUMBER() assigns unique sequential integers

enabling trend analysis

be reused over and over again

● aka stored procedure

● a procedure has a name, list of parameters, and compiled SQL statements

procedure, and then just call it to execute it

Types of SQL Functions

Call A function can be called by a procedure. A procedure cannot be called by a function.

known as User-Deﬁned Functions (UDFs)

● These functions encapsulate a series of SQL statements into a single compound

CREATE OR REPLACE FUNCTION function_name(parameter_list)

SELECT * FROM function_name(argument_list);

Types of Triggers (in PostgreSQL)

Basic Syntax to deﬁne the trigger function

● to delete a trigger, ﬁrst we need to delete the trigger_function

DROP FUNCTION trigger_function_name()

CREATE TRIGGER trigger_name Example

● after deleting the trigger function, trigger can be deleted as:

DROP TRIGGER IF EXISTS trigger_name ON table_name ;

SELECT A., B.