0% found this document useful (0 votes)
19 views76 pages

Module 3

Uploaded by

Pranjali Kathait
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views76 pages

Module 3

Uploaded by

Pranjali Kathait
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 76

MODULE 3: Relational Databases

Chapter 1: Getting Started with SQL


SQL Basics for Data Analysis

SQL stands for Structured Query Language, and it is a domain-specific


programming language used for managing and manipulating relational databases.
With SQL, you can ask complex questions and get answers quickly, saving you
time and energy.

SQL is important for data science because it is the most commonly used language
for working with relational databases. It allows data scientists to store, integrate,
clean, and analyze large volumes of data efficiently.
SQL provides a powerful set of tools for data manipulation and analysis, making it
an essential skill for any data scientist working with large and complex datasets.
By mastering SQL, data scientists can extract more insights from data and make
more informed data-driven decisions.
Setting up the SQL environment
Modern Relational Database Management Systems:
Modern Relational Database Management Systems (RDBMS) are software systems
designed to manage and manipulate large amounts of data stored in a structured
format.

They provide a way to organize, store, and retrieve data efficiently and are widely
used in industries such as finance, healthcare, and e-commerce.

Some examples of modern RDBMS include:

1. MySQL: An open-source RDBMS that is widely used for web applications and
data-driven websites.
2. PostgreSQL: Another open-source RDBMS that is known for its advanced
features, including support for JSON and geospatial data.
3. Oracle: A proprietary RDBMS that is widely used in enterprise environments
for managing large amounts of data.
4. Microsoft SQL Server: A proprietary RDBMS developed by Microsoft, which is
widely used in business environments for managing and analyzing data.
These modern RDBMS provide a variety of features and tools for managing and
manipulating data, such as support for transactions, indexing, backup and
recovery, and security. They are designed to handle large volumes of data and
provide fast and efficient data access , making them a critical component of
many modern data-driven applications.

Setting Up
Now we are going to set up a SQL environment in PostgreSQL:

1. You can download the latest version of PostgreSQL from the official
website and install it on your system. Click on
https://www.postgresql.org/download/ to download PostgreSQL.
2. Now in the Packagers and Installers section, select your operating system.
And in the page that appears, select your required version.

1
3. The PostgreSQL starts downloading, and when the download is completed,
run it and give the necessary permissions, enter the password you want,
and select the location to install the PostgreSQL.
4. Now open the Pgadmin4 application on your computer. You can find it
using the search bar.
5. Click on ‘Servers’ in the Browser bar of pgadmin4 and select ‘PostgreSQL
15’ .
6. Right-click on the ‘Databases’ option ->’ Create’-> select Database.
7. In the Create-Database dialogue box, enter the name of your database and
Save. ‘dvdrental.tar’ is the database I created.
8. The database created appears in the ‘Databases’ section. Right now, the
database created is empty; you can load the data by right-clicking on the
Database and selecting ‘restore’.
9. The Restore Dialogue box appears. Make sure that the ‘Custom or Tar’ is
shown in the format, and you can select or copy the file path and paste it
into the Filename .
10. Go to the Data/Objects section and make sure you have turned on the Pre-
data, Data, and Post-Data options. Now click on Restore.
11. Now right-click on the File you have created and Select Refresh.
12. To enter your Query, right-click on the database you have created and
select the option ‘Query Tool’; this is where you will enter your queries
throughout this course.

SQL enables Data Scientists to retrieve and analyze data with powerful query
statements, making it a valuable skill for extracting insights.
To establish a SQL environment for practice, one can install database
management systems like PostgreSQL or MySQL on their local machine.
Data manipulation tasks, such as sorting, grouping, and filtering, can be
efficiently accomplished in SQL using the WHERE clause.
In Data Science projects, SQL is often used to perform data cleaning, where
redundant information is eliminated, and data is organized for further analysis.

Basic SQL Commands


SELECT: The SELECT statement retrieves data from one or more tables. It allows
you to specify the columns you want to retrieve and can also include aggregate
functions to perform calculations on the data.
Syntax:
SELECT column1, column2, ...
FROM table_name;
Example: To see all the columns in the film data in our dvdrent.tar file enter the
below command :
Select *
From film;
Where the ‘*’ indicates that we had selected every column and the ‘;’ indicates
termination of the query.

FROM: The FROM clause specifies the table or tables from which to retrieve the
data.
SELECT column1, column2, ...
FROM table_name;

2
WHERE: The WHERE clause filters the results based on a specific condition or set
of conditions.
SELECT column1, column2, ...
FROM table_name
WHERE condition;

Select *
From film
Where rental_duration = 6;

GROUP BY: The GROUP BY clause groups the results based on one or more
columns.
SELECT column1, column2, ...
FROM table_name
WHERE condition
GROUP BY column1, column2, ...;
To get the group of film titles based on their ratings where rental_duration is 6 hrs
HAVING: The HAVING clause filters the results of a GROUP BY query based on a
specific condition or set of conditions.
SELECT column1, column2, ...
FROM table_name
WHERE condition
GROUP BY column1, column2, ...
HAVING condition;
To get the group of film titles, rating, and rental_duration with rental_duration = 6
having a rating of R

ORDER BY: The ORDER BY clause is used to sort the results in either ascending or
descending order based on one or more columns.
SELECT column1, column2, ...
FROM table_name
WHERE condition
GROUP BY column1, column2, ...
HAVING condition;
ORDER BY column1, column2, ... ASC|DESC;

You can use order by, group by, having with or without the other clauses (except select, from).

Creating and Deleting Databases and Tables


Creating tables
Steps to create a Table in PostgreSQL:

1. Right-click on the tables option in the Database dropdown menu and select
‘Create Table’.
2. The Create Table dialogue box appears. Enter the name of the table.
3. Go to the Columns section and add your column name, data type, scale,
etc. or you can inherit the format from an existing table.
4. Click on Save.

You can also create a table using SQL commands.

3
To create a table in SQL, you can use the CREATE TABLE statement, which
allows you to define the table's columns and their data types. Here's an example
of how to create a simple table with two columns:
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(50) NOT NULL);
In this example, we are creating a table called "users" with two columns: "id" and
"name". The "id" column is an integer and is set as the table's primary key using
the PRIMARY KEY constraint. The "name" column is a variable-length string of up
to 50 characters and is set not to allow null values using the NOT NULL
constraint.
You can customize the column definitions to fit your specific needs, including
adding more columns, specifying default values, and setting constraints. Once
you have created your table, you can insert data into it using the INSERT INTO
statement.
Deleting tables
To delete a database right, click on the database name in the servers dropdown
menu and select the ‘Delete/Drop’ option.
To delete a table, go to the tables option in your Database dropdown menu, right-
click on the table you want, and select delete.

Importing and Exporting Data from CSV Files


PostgreSQL is a powerful relational database management system that provides
several ways to import and export data. One of the most popular formats for
exchanging data between different systems is the Comma Separated Values (CSV)
file format.

Importing Data from a CSV File:


To import data from a CSV file into a PostgreSQL database, you can use the COPY
command. The COPY command allows you to copy data from a file into a table in
your database

COPY mytable FROM '/path/to/myfile.csv' WITH (FORMAT csv);

In this example, we use the COPY command to import data into a table called
"mytable". The data is being read from a file located at '/path/to/myfile.csv'. The
"FORMAT csv" option specifies that the file is in CSV format.
If your CSV file has a header row, you can skip it by adding the "HEADER" option
to the COPY command:
COPY mytable FROM '/path/to/myfile.csv' WITH (FORMAT csv, HEADER);
This will skip the first row of the CSV file, which is assumed to contain column
names.

Exporting Data to a CSV File:


To export data from a PostgreSQL database to a CSV file, you can use the COPY
command again, but this time with the "TO" option. The "TO" option specifies the
file to which the data should be written.
Here's an example of how to use the COPY command to export data from a table
to a CSV file:
COPY mytable TO '/path/to/myfile.csv' WITH (FORMAT csv);

4
In this example, we are using the COPY command to export data from a table
called "mytable" to a file located at '/path/to/myfile.csv'. The "FORMAT csv" option
specifies that the data should be written in CSV format.
You can also include a header row in the exported CSV file by adding the
"HEADER" option to the COPY command:
COPY mytable TO '/path/to/myfile.csv' WITH (FORMAT csv, HEADER);

You have been hired as a data analyst for a small startup that sells organic fruits
and vegetables. Your task is to analyze the sales data from the past year and
identify the top-selling products and the most profitable regions. You need to use
SQL to query the data stored in a database to do this.
Follow the steps below to set up your SQL environment:
1. Install PGAdmin, which is a popular open-source administration and
management tool for PostgreSQL.
2. Open PGAdmin and create a new server connection by right-clicking
"Servers" and selecting "Create" > "Server".
3. In the "Create - Server" window, provide a name for the server in the
"Name" field.
4. In the "Connection" tab, provide the following details:
 Host: localhost
 Port: 5432
 Maintenance database: postgres
 Username: <your PostgreSQL username>
 Password: <your PostgreSQL password>
1. Click "Save" to save the server configuration.
2. Now, right-click the server name and select "Create" > "Database" to
create a new database.
3. Provide a name for the database and click "Save" to create it.
4. Next, right-click the database name and select "Query Tool" to open the
SQL query editor.

Task 1: Create the 'sales_data' table in the database using the above SQL
command. Keep the following columns inside the table i.e: product_name, region,
sales_date, units_sold, revenue.

CREATE TABLE sales_data (


product_name VARCHAR(50),
region VARCHAR(50),
sales_date DATE,
units_sold INT,
revenue FLOAT);

Task 2: Insert the following data into the 'sales_data' table:

INSERT INTO sales_data (product_name, region, sales_date, units_sold,


revenue)
VALUES
('Organic Apples', 'North', '2022-01-01', 100, 2000.00),
('Organic Bananas', 'North', '2022-01-01', 150, 3000.00),
('Organic Apples', 'South', '2022-01-01', 50, 1000.00),
('Organic Bananas', 'South', '2022-01-01', 200, 4000.00),

5
('Organic Apples', 'North', '2022-02-01', 75, 1500.00),
('Organic Bananas', 'North', '2022-02-01', 125, 2500.00),
('Organic Apples', 'South', '2022-02-01', 25, 500.00),
('Organic Bananas', 'South', '2022-02-01', 175, 3500.00);

Task 3: Find the total revenue generated by the company.

SELECT product_name, SUM(revenue) AS total_revenue


FROM sales_data
GROUP BY product_name;

Task 4: Find the total number of units sold.

SELECT SUM(units_sold) AS total_units_sold


FROM sales_data;

Task 5: Find the total revenue generated by product.

SELECT product_name, SUM(revenue) AS total_revenue


FROM sales_data
GROUP BY product_name;

Fundamentals of SQL Query


Anatomy of SQL Query:
SQL stands for Structured Query Language. It's a language that allows you to
communicate with databases and query them for information.

The anatomy of an SQL query can be broken down into several parts. The first part
is the SELECT statement. This is where you specify what columns you want to
retrieve from the database. The second part of the SQL query is the FROM
statement. This specifies which table you want to retrieve data from. The third
part of the SQL query is the WHERE statement. This allows you to filter the results
of your query based on specific conditions. The fourth part of the SQL query is the
ORDER BY statement. This allows you to sort the results of your query based on a
particular column. Finally, there's the LIMIT statement. This allows you to specify
a limit on the number of results returned by your query. For example, if you only
wanted to retrieve the first 10 customers in your table, you would use the LIMIT
statement to specify that.

SQL offers a variety of data types, from basic types like integers and strings to
more complex ones like dates and times. Integers for storing whole numbers like
IDs or quantities. Strings for storing text like names or addresses. They also
include specialized types for temporal data, like dates and times, to keep track of
important events like when a document was last modified or when a customer
made a purchase. Spatial data types can also be used to store geographic
coordinates or other location-based data.

6
Datatypes in SQL

1. Integer: An integer data type is used for whole numbers, both positive and
negative, with no fractional part. Examples of integers are 1, -10, and 1000.
In SQL, the INT or INTEGER keyword is used to define this data type.
2. Decimal/Float: These data types are used for numbers with a fractional part.
Decimal and float data types differ in precision and storage size. Examples
of decimals and floats are 3.14, -0.005, and 1.23456. In SQL, the DECIMAL
or FLOAT keyword defines these data types.
3. Char/Varchar: Char and varchar data types store text values. Char data type
has a fixed length, while varchar has a variable length. Examples of char
and varchar values are "hello", "world", and "SQL is awesome". In SQL, the
CHAR and VARCHAR keywords are used to define these data types.
4. Date/Time: Date and time data types are used for storing date and time
values. Examples of date and time values are "2022-03-23" and "14:30:00".
In SQL, the DATE and TIME keywords are used to define these data types.
5. Boolean: The Boolean data type is used to store logical values. It can only
have two values: TRUE or FALSE. In SQL, the BOOLEAN keyword is used to
define this data type.
6. Blob: The Blob data type stores binary data, such as images, audio, or video
files. In SQL, the BLOB keyword is used to define this data type.
7. Text: The text data type is used for storing long text values, such as
comments or descriptions. In SQL, the TEXT keyword is used to define this
data type.

CREATE TABLE employees (

id INT,
name VARCHAR(50),
hire_date DATE,
salary DECIMAL(10, 2),
is_manager BOOLEAN,
photo BLOB,
bio TEXT);

Operators in SQL

Arithmetic Operators

SELECT price + tax AS total_price FROM myTable;


SELECT revenue - expenses AS net_income FROM myTable;
SELECT quantity * price AS total_cost FROM myTable;
SELECT revenue / number_of_customers AS avg_revenue_per_customer FROM
myTable;
SELECT order_number, order_number % 3 AS order_group FROM myTable;

Comparison Operators

SELECT * FROM myTable WHERE name = 'John';


SELECT * FROM myTable WHERE age <> 30; #same as !(not equal to)

7
SELECT * FROM myTable WHERE price < 10;
SELECT * FROM myTable WHERE quantity > 100;
SELECT * FROM myTable WHERE date <= '2022-01-01';
SELECT * FROM myTable WHERE rating >= 4;

Logical Operators

SELECT * FROM myTable WHERE age < 30 AND gender = 'male';


SELECT * FROM myTable WHERE category = 'electronics' OR price < 50;
SELECT * FROM myTable WHERE NOT status = 'completed';
SELECT * FROM myTable WHERE age < 30 AND (gender = 'male' OR gender =
'female') AND salary > 50000;

String Operators

Concatenation Operator (||), Like Operator(%), Length Operator (LENGTH), Upper and Lower
Operators (UPPER, LOWER), Substring Operator (SUBSTRING)

SELECT first_name || ' ' || last_name AS full_name FROM myTable;


SELECT * FROM myTable WHERE email LIKE '%gmail.com';
SELECT * FROM myTable WHERE LENGTH(username) > 5;
SELECT * FROM myTable WHERE UPPER(city) = 'NEW YORK';
SELECT * FROM myTable WHERE SUBSTRING(phone_number, 1, 3) = '555';

Filtering data using the WHERE clause


1. The WHERE clause with an equal operator
2. The WHERE clause with AND, OR, and BETWEEN operators
3. The WHERE clause with LIKE and IN operators
4. The WHERE clause with comparison operators

select * from customer where first_name='LINDA';


select * from customer where address_id = 598 and store_id=1
select * from customer where store_id=1 OR active=0
select * from customer where address_id between 560 and 570;
select * from customer where customer_id in (10,11,12)

Sorting data using ORDER BY clause


SELECT <COLUMN_1>,<COLUMN_2>,.. FROM <TABLE_NAME> ORDER BY <COLUMN_1> DESC
SELECT <COLUMN_1>,<COLUMN_2>,.. FROM <TABLE_NAME> ORDER BY <COLUMN_1> ASC

Aggregate functions in SQL:

COUNT(): This function is used to count the number of rows in a table or the
number of times a particular value appears in a column.
string_agg(): string_agg is an aggregate function in PostgreSQL that allows you to
concatenate multiple values from a column into a single string, with an optional
delimiter. It's commonly used to combine values from multiple rows into a single
value.
SUM(), AVG(), MIN(), MAX()
SELECT COUNT(*) FROM orders;

8
SELECT SUM(amount) FROM orders;
SELECT AVG(amount) FROM orders;
SELECT MIN(amount) FROM orders;
SELECT MAX(amount) FROM orders;
SELECT string_agg(customer_name, ', ') AS concatenated_names
FROM customers;

Dealing With Multiple Tables


In SQL, grouping data is a way to organize data based on one or more columns in
a table. It is a powerful technique that allows us to summarize data and derive
insights from it.
We use the GROUP BY clause in our SELECT statement to group data in SQL.
Example 1: Grouping data by a single column
SELECT customer_id, COUNT(order_id) as order_count
FROM orders
GROUP BY customer_id;
This will give us the total number of orders for each customer in the orders table

Example 2: Grouping data by multiple columns


SELECT customer_id, product_id, SUM(total_price) as total_sales
FROM orders
GROUP BY customer_id, product_id;
This will give us the total sales revenue for each combination of customer_id and
product_id in the orders table.

Example 3: Grouping data using aggregate functions


We can also use aggregate functions like COUNT, SUM, AVG, MAX, and MIN to
summarize data within each group.
SELECT customer_id, AVG(total_price) as avg_order_value
FROM orders
GROUP BY customer_id;

Example 1: Filtering grouped data with HAVING


Suppose we have a table called orders that contains information about customer
orders, including the customer_id and the total price of each order. We want to
group the data by customer_id and find the total revenue generated by each
customer. However, we only want to include customers whose total revenue
exceeds $100.
SELECT customer_id, SUM(price) as total_revenue
FROM orders
GROUP BY customer_id
HAVING SUM(price) > 100;

Example 2: Using aggregate functions in HAVING


Suppose we have a table called sales that contains information about sales
transactions, including the product_id and the sale_price of each sale. We want to
group the data by product_id and find the average sale price for each product.
However, we only want to include products whose average sale price exceeds
$50.
SELECT product_id, AVG(sale_price) as avg_sale_price

9
FROM sales
GROUP BY product_id
HAVING AVG(sale_price) > 50;
This query will return only the products whose average sale price exceeds $50.

Subqueries:
a subquery is a query within a query
Syntax for SQL Subqueries
SELECT column1, column2, ...
FROM table_name
WHERE columnX operator (SELECT columnY FROM table_name WHERE condition);

Example 1: Using a subquery to filter results


Suppose we have a table called orders that contains information about customer
orders, including the customer_id and the price of each order. We want to find all
orders that were placed by customers who live in a specific region, which is
defined in a separate table called customers.
SELECT order_id, customer_id, price
FROM orders
WHERE customer_id IN (SELECT customer_id FROM customers WHERE region =
'West');
the subquery returns a list of customer_id values that match the condition region =
'West' from the customers table. This list is then used as the input for the IN
operator in the outer query's WHERE clause, which filters the orders table to only
include orders placed by customers in the West region.

Example 2: Using a subquery to perform calculations


Subqueries can also be used to perform calculations and return aggregate values.
Let's say we want to find the average order price for customers who have placed
more than 3 orders in the orders table.
SELECT AVG(price) as avg_order_price
FROM orders
WHERE customer_id IN (SELECT customer_id FROM orders GROUP BY customer_id
HAVING COUNT(*) > 3);

Example 3: Using a subquery in a join


Subqueries can also be used in conjunction with joins to create more complex
queries. Let's say we have a table called products that contains information about
products, including the product_id and the price of each product. We want to find all
orders that include products with a price greater than the average price of all
products.
SELECT o.order_id, o.customer_id, o.price
FROM orders o
JOIN (SELECT AVG(price) as avg_price FROM products) p
WHERE o.price > p.avg_price;

To join two or more tables in SQL, you need to use the JOIN keyword and specify the
columns on which the tables are related using the ON keyword. Here is an
example of how to join two tables using the INNER JOIN
SELECT *

10
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;

This statement selects all columns from both tables where there is a match
between the column_name in both tables.
There are different types of SQL joins available to perform this task, including:
1. INNER JOIN: Returns only the rows where there is a match between the
columns in both tables.
2. LEFT JOIN: Returns all the rows from the left table and the matching rows
from the right table. If there is no match, then NULL values are returned for
the right table.
3. RIGHT JOIN: Returns all the rows from the right table and the matching rows
from the left table. If there is no match, then NULL values are returned for
the left table.
4. FULL OUTER JOIN: Returns all the rows from both tables. If there is no
match, then NULL values are returned for the missing data.
5. CROSS JOIN: Returns the Cartesian product of both tables, which means all
the possible combinations of rows from both tables.

Alias in SQL queries


In SQL, an alias is a temporary name assigned to a table or column in a query. It is
used to make the query more readable or concise.
Column alias syntax:
SELECT column_name AS alias_name
FROM table_name;
In this syntax, "column_name" is the name of the column you want to query, and
"alias_name" is the temporary name you want to assign to it. The resulting output
will have this alias as the column header.

Table alias syntax:


SELECT column1, column2, ...
FROM table_name AS alias_name;
In this syntax, the "table_name" is the name of the table you want to query, and
"alias_name" is the temporary name you want to assign to it. You can then refer
to the table using the alias name in the rest of your query.
Aliases can also be used with aggregate functions, subqueries, and views in SQL.

Working with Multiple Tables Using Subqueries


Subqueries can be used to work with multiple tables in SQL using the foreign key.
A subquery is a query nested within another query and can be used to retrieve
data from one or more
tables.

Retrieving data from two tables


using a subquery:

11
Using subqueries to aggregate data from multiple tables:

Using Set Operators (e.g. UNION, INTERSECT, EXCEPT):


SQL (Structured Query Language) supports several set operators that allow you to
combine the results of two or more SELECT statements. The set operators
include:
1. UNION: This operator combines the results of two SELECT statements and
eliminates duplicate rows.
2. UNION ALL: This operator combines the results of two SELECT statements
and includes duplicate rows.
3. INTERSECT: This operator returns only the common rows returned by two
SELECT statements.
4. EXCEPT or MINUS: This operator returns only the distinct rows from the first
SELECT statement that are not in the second SELECT statement.

UNION: To combine the count of films and the count of actors into one table, follow
the below image. Note that to use union on two columns, both the columns must
contain the same number of rows, and they must be of the same datatype.
SELECT count(title) from film
union
SELECT count(first_name) from actor;

12
INTERSECT: To get the common firstname’s of the Actors and Staff, we can use the
intersect operator. Both columns again should be of the same datatype.
SELECT first_name from actor;
intersect
SELECT first_name from staff;

EXCEPT or MINUS: To get the titles of all films except those with a rating ‘R’, we can
use the except or minus operator as shown.
SELECT title from film;
EXCEPT
SELECT title from film where rating='R';

Aggregating Data from Multiple Tables using GROUP BY and HAVING


When working with relational databases, it's often necessary to aggregate data
from multiple tables to gain insight into the data. The SQL GROUP BY and HAVING
clauses are powerful tools for doing this.
The GROUP BY clause groups rows together based on one or more columns. This is
useful when you want to perform an aggregate function, such as COUNT(), SUM(),
AVG(), etc., on each group of rows.

The HAVING clause is used to filter groups based on some condition. This is useful
when you want to limit the results only to include groups that meet certain
criteria.

13
Task 1:
From the dvdrental.tar menu, get the title, description, and category of the films.
Task 2:
Get the count of the films based on their category_id.
Task 3:
Get the average rental_duration of films where the film category is 3.
Task 4:
Get the average rent of films grouped by their rating .The output column name
must be ‘Avg_rent_based_on_rating’ .
Task 5:
Get the titles, rating and film_id of films except PG and PG-13 ratings.

Task 1: We used SQL inner join on the film_id and category_id columns of film and category
tables to accomplish this task.

Task 2:
By using
Sub-queries, we aggregated the data from the tables film_category and category to get the
count of films in each category.

14
Task 3: We have to aggregate the tables as shown below by using group by and having.

Task 4: Use the as clause to create an alias for the avg(rental_rate) column.To get avg of the
respective group, use Group by clause as shown below.

15
Task 5: HINT : Use the EXCEPT clause of SQL to exclude films of rating PG and PG-13.

Advanced SQL Joins

Structured Query Language (SQL) is a popular programming language for

managing and manipulating relational databases. 💻 One of the most important

features of SQL is the ability to join multiple tables together. 🔗 Joins allow you to
combine data from two or more tables into a single result set.

Self Join
A self join is a join in which a table is joined with itself. This is useful when you
have a table that contains hierarchical data, such as an organizational chart.

Table: employees
emp_id name manager_id
1 John 3
2 Jane 3
3 Bob 4
4 Alice null
In this example, the "manager_id" column is a foreign key that references the
"emp_id" column of the same table. To get the name of the manager for each
employee, we can use a self join:

SELECT e.name as employee_name, m.name as manager_name


FROM employees e
INNER JOIN employees m
ON e.manager_id = m.emp_id;
This will give us the following result:
employee_name manager_name
John Bob
Jane Bob
Bob Alice

16
Cross Join
A cross join, also known as a Cartesian join, returns the Cartesian product of the
two tables. This means that every row from the first table is combined with every
row from the second table.

Table: colors
red
blue
green
Table: sizes
S
M
L
SELECT colors.color, sizes.size
FROM colors
CROSS JOIN sizes;
This will give us the following result:
color size
red S
red M
red L
blue S
blue M
blue L
green S
green M
green L
Natural Join
A natural join is a type of join that automatically matches columns with the same
name from two tables. This can be useful when the two tables have columns with
identical names and you want to join them based on those columns

Table: students
id name age
1 John 18
2 Jane 19
3 Bob 20
Table: grades
id grade
1 A
2 B
3 C
SELECT *
FROM students
NATURAL JOIN grades;
id name age grade
1 John 18 A
2 Jane 19 B
3 Bob 20 C

Joining multiple tables


Table: employees

17
emp_id name dept_id
1 John 1
2 Jane 1
3 Bob 2
4 Alice 2
5 Michael 3

Table: departments
dept_id dept_name
1 HR
2 IT
3 Finance
Table: salaries
emp_id salary
1 50000
2 60000
3 70000
4 80000
5 90000

Using Inner Join


SELECT e.name AS employee_name, d.dept_nameAS department_name, s.salary
FROM employees e
INNER JOIN departments d
ON e.dept_id= d.dept_id
INNER JOIN salaries s
ON e.emp_id= s.emp_id;
employee_name department_nam salary
e
John HR 50000
Jane HR 60000
Bob IT 70000
Alice IT 80000
Michael Finance 90000

Using Left Join


A LEFT JOIN returns all records from the left table (the table specified before the
LEFT JOIN keyword) and the matched records from the right table (the table
specified after the LEFT JOIN keyword). If there's no matching record in the right
table, the result will contain NULL values for the columns of that table.

SELECT e.name AS employee_name, d.dept_name AS department_name, s.salary


FROM employees e
LEFT JOIN departments d
ON e.dept_id= d.dept_id
LEFT JOIN salaries s
ON e.emp_id= s.emp_id;
employee_name department_nam salary
e
John HR 50000

18
Jane HR 60000
Bob IT 70000
Alice IT 80000
Michael Finance 90000

Using both Left and Inner Join


SELECT e.name AS employee_name, d.dept_name AS department_name, s.salary
FROM employees e
LEFT JOIN departments d
ON e.dept_id= d.dept_id
LEFT JOIN salaries s
ON e.emp_id= s.emp_id;
name dept_name salary
John HR 50000
Jane HR 60000
Bob IT 70000
Alice IT 80000
Michael Finance 90000

Handling duplicate records and eliminating duplicates


various ways to handle duplicate records in SQL, from identifying duplicates using
the GROUP BY clause and COUNT() function, to eliminating duplicates using the
DISTINCT keyword and UNION operator

Creating a table with duplicate records


CREATE TABLE customers (
id INT PRIMARY KEY,
name VARCHAR(50),
email VARCHAR(50),
city VARCHAR(50));
INSERT INTO customers (id, name, email, city) VALUES
(1, 'John Smith', '[email protected]', 'New York'),
(2, 'Jane Doe', '[email protected]', 'London'),
(3, 'John Smith', '[email protected]', 'Los Angeles'),
(4, 'Sarah Johnson', '[email protected]', 'New York'),
(5, 'Jane Doe', '[email protected]', 'New York');

Using GROUP BY and COUNT()


One way to identify duplicates is to use the GROUP BY clause and the COUNT()
function. The GROUP BY clause groups record with the same value in a specific
column together, while the COUNT() function returns the number of rows in each
group.

To identify the duplicate records in the customers table, we can group the records
by the name, email, and city columns and use the COUNT() function to return the
number of records in each group:

SELECT name, email, COUNT(*)


FROM customers

19
GROUP BY name, email
HAVING COUNT(*) > 1;
name email COUNT(*)
John Smith [email protected] 2
Jane Doe [email protected] 2

Using DISTINCT
The DISTINCT keyword can be used to eliminate duplicates from a result set. It
returns only unique values from the specified column(s)

SELECT DISTINCT name FROM customers;


name
John Smith
Jane Doe
Sarah Johnson

Using UNION
The UNION operator can be used to combine the results of two or more SELECT
statements, eliminating duplicates in the process.

SELECT name FROM customers


UNION
SELECT city FROM customers;
name
Jane Doe
John Smith
London

Using UNION and UNION ALL to combine data from multiple tables
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(50),
email VARCHAR(50));

CREATE TABLE orders (


order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
amount DECIMAL(10, 2));

CREATE TABLE invoices (


invoice_id INT PRIMARY KEY,
customer_id INT,
invoice_date DATE,
amount DECIMAL(10, 2));

INSERT INTO customers VALUES (1, 'John Doe', '[email protected]');


INSERT INTO customers VALUES (2, 'Jane Doe', '[email protected]');
INSERT INTO customers VALUES (3, 'Bob Smith', '[email protected]');

20
INSERT INTO orders VALUES (1, 1, '2022-01-01', 100.00);
INSERT INTO orders VALUES (2, 1, '2022-02-01', 200.00);
INSERT INTO orders VALUES (3, 2, '2022-03-01', 300.00);
INSERT INTO orders VALUES (4, 3, '2022-04-01', 400.00);

INSERT INTO invoices VALUES (1, 1, '2022-01-15', 50.00);


INSERT INTO invoices VALUES (2, 1, '2022-02-15', 75.00);
INSERT INTO invoices VALUES (3, 2, '2022-03-15', 100.00);
INSERT INTO invoices VALUES (4, 2, '2022-04-15', 125.00);
INSERT INTO invoices VALUES (5, 3, '2022-05-15', 150.00);
Using UNION:
SELECT order_id AS transaction_id, customer_id, order_date AS
transaction_date, amount
FROM orders
UNION
SELECT invoice_id AS transaction_id, customer_id, invoice_date AS
transaction_date, amount
FROM invoices;

Using UNION ALL:


SELECT order_id, customer_name, order_date, product_name, quantity, price
FROM sales_2019
UNION ALL
SELECT order_id, customer_name, order_date, product_name, quantity, price
FROM sales_2020;
it includes all rows from each table, including duplicates. This can be useful when
you want to combine data from multiple tables without eliminating duplicates.

CREATE TABLE employees (


id SERIAL PRIMARY KEY,
name VARCHAR(50),
dept_id INT,
salary NUMERIC(8,2),
manager_id INT);

CREATE TABLE departments (


id SERIAL PRIMARY KEY,
name VARCHAR(50),
manager_id INT);

INSERT INTO departments (name, manager_id) VALUES


('Sales', 1),
('Marketing', 2),
('Finance', 3),
('IT', 4),
('Human Resources', 5);

INSERT INTO employees (name, dept_id, salary, manager_id) VALUES


('John', 1, 5000.00, 1),

21
('Mary', 1, 6000.00, 1),
('Joe', 2, 5500.00, 2),
('Sarah', 2, 6500.00, 2),
('Tom', 3, 7000.00, 3),
('Bob', 3, 8000.00, 3),
('Lisa', 4, 4500.00, 4),
('Mike', 4, 5500.00, 4),
('Jane', 5, 6000.00, 5),
('Alex', 5, 7000.00, 5),
('David', 1, 5000.00, 1),
('James', 1, 6000.00, 1),
('Karen', 2, 5500.00, 2),
('Rachel', 2, 6500.00, 2),
('Tim', 3, 7000.00, 3),
('Alice', 3, 8000.00, 3),
('Julie', 4, 4500.00, 4),
('Peter', 4, 5500.00, 4),
('Mark', 5, 6000.00, 5),
('Michelle', 5, 7000.00, 5);

Questions:
1. How many employees are working in each department?
2. Who are the managers of each department?
3. Who are the employees who are reporting to a specific manager?
4. Which department has the highest-paid employees?
5.1. What is the average salary of employees in each department?
5.2 Which employees are reporting to themselves (self-join)?

SELECT dept_id, COUNT(*) as num_employees


FROM employees
GROUP BY dept_id;
dept_id num_employees
1 6
2 4
3 4
4 4
5 2

SELECT d.name AS department, e.name AS manager


FROM departments d
JOIN employees e ON d.manager_id = e.id;
department manager
Sales John
Marketing Mary
Finance Joe
IT Sarah
Human Jane
Resources

SELECT e.name as employee_name, m.name as manager_name

22
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE m.name = 'John';
employee_name manager_name
John John
Mary John
David John
James John

SELECT d.name as department_name, MAX(e.salary) as max_salary


FROM employees e
JOIN departments d ON e.dept_id = d.id
GROUP BY d.name;
department_nam max_salary
e
Finance 8000.00
Human Resources 7000.00
Marketing 6500.00
Sales 6000.00
IT 4500.00

SELECT d.name as department_name, AVG(e.salary) as avg_salary


FROM employees e
JOIN departments d ON e.dept_id = d.id
GROUP BY d.name;
department_nam avg_salary
e
Marketing 6000
Finance 7500
Human 6500
Resources
Sales 5500
IT 5000

SELECT e.name AS employee_name


FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.id = m.manager_id;
employee_name
John

More techniques for handling and eliminating


duplicates

Using INNER JOIN

SELECT DISTINCT table1.column1, table1.column2, table2.column3

23
FROM table1
INNER JOIN table2
ON table1.common_field = table2.common_field;
we use an inner join to combine data from table1 and table2 based on the
common_field column. The DISTINCT keyword is used to return only distinct rows in
the result set.

SELECT table1.column1, COUNT(*) AS count


FROM table1
INNER JOIN table2
ON table1.common_field = table2.common_field
GROUP BY table1.column1
HAVING COUNT(*) > 1;
we join table1 and table2 on the common_field column, and then group the result set
by table1.column1. The COUNT(*) function counts the number of rows in each group,
and the HAVING clause filters the result set to include only groups with more
than one row. This query can be used to identify duplicate values in
table1.column1.

Using GROUP BY and MAX()

This approach can be used when you have duplicate records with different
values in one or more columns, and you want to keep only the record with the
maximum value in a specific column.

SELECT name, MAX(id)


FROM customers
GROUP BY name;
name MAX(id)
Jane Doe 5
John Smith 3
Sarah 4
Johnson

Using ROW_NUMBER()

This function assigns a unique row number to each row in the result set,
which can be used to filter out duplicates. We can use a subquery to assign
row numbers to each record and then filter out the duplicates by selecting
only the rows with a row number of 1.

SELECT id, name, email, city


FROM (
SELECT id, name, email, city,
ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
FROM customers
) t
WHERE rn = 1;
id name email city
2 Jane Doe [email protected] London

24
3 John Smith [email protected] Los Angeles
4 Sarah [email protected] New York
Johnson

Chapter 2: In-Built Functions


Type Casting and Math Functions

SQL has a variety of mathematical functions that can be used to perform


calculations on numerical data. These functions are essential for manipulating
and analyzing data in a database.

ABS(): This function returns the absolute value of a number.


ABS(numeric_value)
Example: ABS(-5) returns 5.

CEILING(): This function returns the smallest integer that is greater than or equal
to a given number.
CEILING(numeric_value)
Example: CEILING(3.14) returns 4.

FLOOR(): This function returns the largest integer that is less than or equal to a
given number.
FLOOR(numeric_value)
Example: FLOOR(3.14) returns 3.

ROUND(): This function rounds a number to a specified number of decimal places.


ROUND(numeric_value, decimal_places)
Example: ROUND(3.14159, 2) returns 3.14.

In addition to these functions, SQL also has trigonometric functions like SIN(),
COS(), and TAN(), as well as logarithmic functions like LOG() and LN(). These
functions are useful for more advanced mathematical calculations.

Type Conversion Functions:


CAST() - converts a value from one datatype to another.
CAST(value AS datatype)
Example: CAST('123' AS INT) converts the string '123' to an integer datatype.

TO_CHAR() - converts a value to a string datatype.


TO_CHAR(value, optional_format)
Example: TO_CHAR(123) converts the number 123 to the string '123'.

TO_DATE() - converts a string to a date datatype.


TO_DATE(string_value, format)
Example: TO_DATE('2022-04-01', 'YYYY-MM-DD') converts the string '2022-04-01' to a
date datatype.

TO_NUMBER() - converts a string to a numeric datatype.

25
TO_NUMBER(string_value, optional_format)
Example: TO_NUMBER('123.45', '999D99') converts the string '123.45' to a numeric
datatype with two decimal places.

The SQL function LEN() can be used to find the length of a string value.
CHAR_LENGTH
The SQL function SIN() returns the sine value of an angle in radians.
COS
The SQL function POWER() is used to raise a number to a specified power.
EXP
The SQL function MOD() returns the remainder after division of one number by
another.
%
The SQL function BITAND() performs a bitwise AND operation between two
numeric values.
&
The SQL function TRUNC() truncates a number to a specified number of decimal
places.
ROUND

Using CASE Statements to Perform Conditional Operations:


In SQL, a CASE statement allows you to perform conditional operations based on
specific conditions.
The basic syntax of the CASE statement is

CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
...
ELSE default_result
END
For Example, under the new guidelines of the DVD Rental Corporation, the rental
rate for films with ‘R’ rating had been increased by 10% and for others by 50% .
Now, we need to calculate the penalty for each film in a new column called the
penalty.

26
The CASE statement in SQL is used to perform conditional operations based on
specific conditions. In the CASE statement, conditions are evaluated in order, and
the first true condition will determine the result. The ELSE clause in the CASE
statement returns a default result if none of the conditions are true. The CASE
statement is a standard SQL feature and is supported by most SQL databases,
including PostgreSQL.

Activity:
The database contains the following tables:
 Products (product_id, product_name, product_price)
 Customers (customer_id, customer_name, customer_email,
customer_phone)
 Sales (sale_id, sale_date, sale_quantity, sale_price, customer_id,
product_id)

CREATE TABLE Products (


product_id SERIAL PRIMARY KEY,
product_name VARCHAR(50) NOT NULL,
product_price NUMERIC(10,2) NOT NULL
);

CREATE TABLE Customers (


customer_id SERIAL PRIMARY KEY,
customer_name VARCHAR(50) NOT NULL,
customer_email VARCHAR(50),
customer_phone VARCHAR(20)
);

CREATE TABLE Sales (


sale_id SERIAL PRIMARY KEY,
sale_date DATE NOT NULL,
sale_quantity INTEGER NOT NULL,
sale_price NUMERIC(10,2) NOT NULL,
customer_id INTEGER REFERENCES Customers(customer_id),
product_id INTEGER REFERENCES Products(product_id)
);

INSERT INTO Products (product_name, product_price) VALUES


('Product A', 9.99),
('Product B', 14.99),
('Product C', 19.99),
('Product D', 24.99);

INSERT INTO Customers (customer_name, customer_email, customer_phone)


VALUES
('John Doe', '[email protected]', '123-456-7890'),

27
('Jane Smith', '[email protected]', '555-555-5555'),
('Bob Johnson', '[email protected]', NULL);

INSERT INTO Sales (sale_date, sale_quantity, sale_price, customer_id,


product_id) VALUES
('2022-01-01', 2, 9.99, 1, 1),
('2022-01-02', 1, 14.99, 2, 2),
('2022-01-03', 3, 19.99, 3, 3),
('2022-01-04', 1, 24.99, 1, 4),
('2022-01-05', 2, 14.99, 2, 2),
('2022-01-06', 1, 9.99, 3, 1),
('2022-01-07', 4, 19.99, 1, 3),
('2022-01-08', 2, 24.99, 2, 4),
('2022-01-09', 1, 14.99, 3, 2),
('2022-01-10', 3, 9.99, 1, 1);

What is the total revenue generated from sales transactions?


SELECT ROUND(SUM(s.sale_quantity * p.product_price), 2) AS "Total Revenue"
FROM sales s
JOIN products p ON s.product_id = p.product_id;

What is the average price of all products?


SELECT CEILING(AVG(product_price)) AS "Average Price"
FROM products;
CEILING() function to round up the average price to the nearest integer.

What is the maximum sale price for each product?


SELECT product_id, MAX(sale_price) AS max_sale_price
FROM sales
GROUP BY product_id;

What is the total revenue generated from sales transactions for each customer?
SELECT customer_id, ROUND(SUM(sale_quantity * sale_price), 2) AS
total_revenue
FROM sales
GROUP BY customer_id

What is the total number of products sold to each customer?


SELECT customer_id, SUM(quantity) AS total_products_sold
FROM Sales
GROUP BY customer_id;

What is the customer name, email, and phone number for the customer who made
the most purchases?
SELECT
c.customer_name,
c.customer_email,
c.customer_phone,
MAX(purchase_count) AS "Max Purchases"

28
FROM
customers c
JOIN (
SELECT
customer_id,
COUNT(*) AS purchase_count
FROM
sales
GROUP BY
customer_id
) p ON c.customer_id = p.customer_id
GROUP BY
c.customer_id
ORDER BY
"Max Purchases" DESC
LIMIT 1;

Mathematical functions: Mathematical functions in SQL are used to perform


operations on numeric data types. These functions include ABS (absolute value),
CEILING (rounds up to the nearest integer), FLOOR (rounds down to the nearest
integer), ROUND (rounds to a specified number of decimal places), and many
more.
Type conversion functions: Type conversion functions in SQL are used to convert
data from one data type to another. These functions include CAST (converts a
value to a specified data type), CONVERT (converts a value to a specified data
type with a specific style), and others.
String manipulation functions: String manipulation functions in SQL are used to
manipulate character data types. These functions include SUBSTRING (extracts a
part of a string), CONCAT (concatenates two or more strings), REPLACE (replaces
a string with another string), and more.
Date/time functions: Date/time functions in SQL are used to perform operations on
date and time data types. These functions include DATEADD (adds a specified
interval to a date), DATEDIFF (calculates the difference between two dates),
GETDATE (returns the current system date and time), and others.
Regular expressions: Regular expressions in SQL are used to search for patterns
in character data types. These patterns are defined using a specific syntax that
allows for complex search queries.
CASE statements: CASE statements in SQL are used to perform conditional
operations. They allow you to specify a condition and a corresponding action to
take if the condition is true. They are useful for creating custom calculations or
categorizing data based on specific criteria.

DateTime & String Functions

Working with date/time data in SQL:


The DATE data type represents a date in the format YYYY-MM-DD. For example,
the date March 29th, 2023 would be represented as '2023-03-29'.
The TIME data type represents a specific time of day in the format HH:MI:SS. For
example, the time 1:30 PM would be represented as '13:30:00'.

29
The TIMESTAMP data type represents a specific date and time in the format
YYYY-MM-DD HH:MI:SS. For example, the date and time March 29th, 2023 at 1:30
PM would be represented as '2023-03-29 13:30:00'.

Example 1: Selecting records from a specific date range


SELECT *
FROM customer_orders
WHERE order_date BETWEEN '2022-01-01' AND '2022-12-31';

Example 2: Extracting specific parts of a date/time value


SELECT EXTRACT(YEAR FROM order_date) AS order_year,
EXTRACT(MONTH FROM order_date) AS order_month,
COUNT(*) AS order_count
FROM customer_orders
GROUP BY order_year, order_month;

Example 3: Converting a date/time value to a different format


SELECT order_id,
TO_CHAR(order_date, 'Month DD, YYYY') AS formatted_date
FROM customer_orders;
Date/time functions:
PostgreSQL provides a wide range of date/time functions that allow you to work
with and manipulate date/time data. These functions can help you extract various
elements of a date/time, convert between different formats, calculate time
durations, and much more.
1. AGE
2. CURRENT_DATE
3. CURRENT_TIME
4. CURRENT_TIMESTAMP
5. EXTRACT
6. LOCALTIME
7. LOCALTIMESTAMP
8. DATE_PART

AGE Function:
The AGE function in PostgreSQL calculates the difference between two dates. It
takes two arguments: the earlier date and the later date. The function returns the
difference between the two dates as an interval.
Suppose we want to calculate a person's age in years, given their date of birth.
SELECT AGE(CURRENT_DATE, '1995-07-20');
This will return the age of the person as an interval. We can extract the number of
years from this interval by using the EXTRACT function
SELECT EXTRACT(YEAR FROM AGE(CURRENT_DATE, '1995-07-20' ));

CURRENT_DATE Function:
The CURRENT_DATE function in PostgreSQL is used to get the current date. It
returns the current date as a date data type.
SELECT CURRENT_DATE;

CURRENT_TIME Function:

30
The CURRENT_TIME function in PostgreSQL is used to get the current time. It
returns the current time as a time data type.
SELECT CURRENT_TIME;

CURRENT_TIMESTAMP Function:
The CURRENT_TIMESTAMP function in PostgreSQL is used to get the current date
and time. It returns the current date and time as a timestamp data type.
SELECT CURRENT_TIMESTAMP;

EXTRACT Function:
The EXTRACT function in PostgreSQL is used to extract a specific element from a
date/time. It takes two arguments: the element to extract and the date/time value.
SELECT EXTRACT(MONTH FROM TIMESTAMP '2022-05-18');

LOCALTIME Function:
The LOCALTIME function in PostgreSQL is used to get the current local time. It
returns the current local time as a time data type.
SELECT LOCALTIME;

CURRENT_TIMESTAMP function:
This function returns the current date and time, including fractional seconds.
SELECT CURRENT_TIMESTAMP;

Similarly, the CURRENT_TIME function returns the current time without the date,
and the LOCALTIMESTAMP function returns the current timestamp without time
zone information.

AGE function- This function takes two timestamps as arguments and returns the
difference between them as an interval.
SELECT AGE('2022-10-19 16:24:31', '2022-10-18 12:00:00');

The EXTRACT function is another useful function for working with dates and times.
It allows you to extract a specific field from a timestamp or interval.
SELECT EXTRACT(YEAR FROM TIMESTAMP '2022-10-19 16:24:31');

DATE_PART function, which is similar to the EXTRACT function but with a slightly
different syntax.
SELECT DATE_PART('YEAR', TIMESTAMP '2022-10-19 16:24:31');

Formatting date/time data:


PostgreSQL provides various built-in functions that allow us to easily manipulate
and format our date/time data.

Understanding Date/Time Formats:


The most common formats are ISO 8601 and the traditional American format. ISO
8601 is a standardized format widely used in computing and recognized
internationally, while the traditional American format is more commonly used in
the United States.
ISO 8601: 2023-04-06T13:45:30Z

31
Traditional American Format: 04/06/2023 01:45:30 PM

Using TO_CHAR for Formatting:


The TO_CHAR function in PostgreSQL allows us to convert date/time data into a
string representation using a specific format.
TO_CHAR(timestamp, format)
The timestamp argument is the date/time value we want to format, and the format
argument is a string that specifies the desired output format.
Suppose we have a table called orders that contains a timestamp column indicating
when each order was placed. We want to display the date in the ISO 8601 format.
We can use the TO_CHAR function to achieve this:
SELECT TO_CHAR(timestamp, 'YYYY-MM-DD"T"HH24:MI:SS"Z"') AS iso_date
FROM orders;

Using EXTRACT for Extracting Specific Components:


The EXTRACT function in PostgreSQL allows us to extract specific components
from a date/time value.
EXTRACT(field FROM timestamp)
The field argument specifies the specific component we want to extract, such as
year, month, day, hour, minute, or second.
Suppose we have a table called events that contains a timestamp column indicating
when each event occurred. We want to extract the year from each timestamp
value. We can use the EXTRACT function to achieve this:
SELECT EXTRACT(YEAR FROM timestamp) AS event_year
FROM events;

Using DATE_TRUNC for Truncating Date/Time Values:


The DATE_TRUNC function in PostgreSQL allows us to truncate date/time values
to a specific unit, such as year, month, day, hour, minute, or second.
DATE_TRUNC('unit', timestamp)
The unit argument specifies the specific unit we want to truncate to, and the
timestamp argument is the date/time value we want to truncate. Suppose we have
a sales table containing a timestamp column indicating when each sale was made.
We want to truncate the timestamp values to the nearest hour. We can use the
DATE_TRUNC function to achieve this.

String manipulation functions (e.g. UPPER, LOWER, LEFT, RIGHT, etc.)


to convert text to uppercase and lowercase
UPPER(string)
LOWER(string)

Example 1: Converting text to uppercase using UPPER


SELECT UPPER('hello, world!') AS uppercase_text;
Example 2: Converting text to lowercase using LOWER
SELECT LOWER('HeLLo, wOrLd!') AS lowercase_text;
the LEFT and RIGHT functions allows us to extract a specified number of
characters from the beginning or end of a string

LEFT(string, length)

32
RIGHT(string, length)
Example 3: Extracting characters from the beginning of a string using LEFT
SELECT LEFT('apple', 3) AS extracted_text;
result will be the string value "app".
Example 4: Extracting characters from the end of a string using RIGHT
SELECT RIGHT('banana', 3) AS extracted_text;
result will be the string value "ana"

CONCAT(string1, string2, ...)


Example 5: Concatenating two strings together using CONCAT
SELECT CONCAT('hello', 'world') AS concatenated_text;

Regular expressions in SQL for string operations


Some of the most commonly used symbols in regular expressions include:
 . - matches any single character
 *- matches zero or more occurrences of the previous character
 + - matches one or more occurrences of the previous character
 ? - matches zero or one occurrence of the previous character
 [] - matches any single character within the brackets
 ^ - matches the beginning of a string
 $ - matches the end of a string

The most commonly used functions are REGEXP_LIKE, REGEXP_REPLACE.


REGEXP_LIKE: This function is used to determine if a string matches a regular
expression pattern. It returns TRUE if the string matches the pattern, and FALSE
otherwise.

Example 1: Matching a string using REGEXP_LIKE


SELECT REGEXP_LIKE('hello', 'h.*o') AS match_result;
In this example, we are using REGEXP_LIKE to match the string 'hello' against the
pattern 'h.*o'. This pattern matches any string that starts with 'h' and ends with
'o'. The result will be TRUE, since the string 'hello' matches this pattern.

REGEXP_REPLACE: In PostgreSQL, REGEXP_REPLACE is a function that allows you


to perform regular expression pattern matching and replacement within a string.
It's used to search for a specified pattern in a string and replace all occurrences
of that pattern with a specified replacement string. This function is particularly
useful when you need to modify or clean up text data based on specific patterns.
REGEXP_REPLACE(input_str, reg_exp, replace_str,[, flags])

Example 1: Use REGEXP_REPLACE Function to Arrange Name


SELECT REGEXP_REPLACE('Anas Kumar', '(.*) (.*)', '\2, \1');
In this example, the code takes the input string "Anas Kumar" and uses a regular
expression to match two parts: the first part "Anas" and the second part "Kumar".
The regular expression (.*) (.*) is split into two groups: (.*) captures the first part
and (.*) captures the second part. The replacement string '\2, \1' reorganizes these
parts, swapping their positions and adding a comma between them. So, the output
of the code will be "Kumar, Anas".

Example 2: Use REGEXP_REPLACE Function to Remove Alphabets


SELECT REGEXP_REPLACE('PostgreSQL51214Database', '[[:alpha:]]','','g');

33
In this example, we use the REGEXP_REPLACE function to manipulate a given
input string, which is "PostgreSQL51214Database". By employing a regular
expression '[[:alpha:]]' – designed to identify alphabetical characters – and an
empty replacement string, the code effectively removes all the letters from the
input. The 'g' flag ensures this removal happens globally throughout the string. As
a result, the output of the code will be "51214", as all alphabetical characters
have been eliminated, leaving only the numeric portion intact.

Example 3: Use REGEXP_REPLACE Function to Remove Digits


SELECT REGEXP_REPLACE('PostgreSQL51214Database', '[[:digit:]]','','g');
In this example, we use the REGEXP_REPLACE function to manipulate the given
input string "PostgreSQL51214Database". By utilizing the regular expression
'[[:digit:]]' – designed to identify numeric characters – and specifying an empty
replacement string, the code effectively removes all the digits from the input. The
'g' flag ensures this replacement happens globally throughout the string.
Consequently, the output of the code will be "PostgreSQLDatabase", as all
numeric characters have been taken out, leaving only the letters and non-numeric
characters.

Using CONCAT_WS to concatenate strings with a separator


The CONCAT_WS function concatenates two or more strings together, with a
specified separator between each string
CONCAT_WS(separator, string1, string2, ..., stringN)

Example 1: Concatenating strings with a separator


SELECT CONCAT_WS('-', '2022', '03', '29') AS date_string;
In this example, we are using the CONCAT_WS function to concatenate the strings
'2022', '03', and '29' together, with a dash ('-') separator between each string. The
result will be the string value "2022-03-29".

Example 2: Handling NULL values


SELECT CONCAT_WS('-', '2022', NULL, '29') AS date_string;
In this example, we are using the CONCAT_WS function to concatenate the strings
'2022', NULL, and '29' together, with a dash ('-') separator between each string.
Notice that there is a NULL value in the second parameter. The result will be the
string value "2022-29". The separator is only included between non-null values.

Example 3: Concatenating multiple columns


SELECT CONCAT_WS(' ', first_name, last_name) AS full_name
FROM employees;
In this example, we are using the CONCAT_WS function to concatenate the
first_name and last_name columns from the employees table, with a space (' ')
separator between each column. The result will be a column of concatenated full
names.

You are working as a data analyst at a video streaming company. The company
has a database that contains information about the movies available on their
platform, including the id, title and release date.
The database schema is as follows:
CREATE TABLE movies (
id INTEGER PRIMARY KEY,
title VARCHAR(255),

34
release_date DATE);

INSERT INTO movies (id, title, release_date)


VALUES
(1, 'The Shawshank Redemption', '1994-10-14'),
(2, 'The Godfather', '1972-03-24'),
(3, 'The Godfather: Part II', '1974-12-20'),
(4, 'The Dark Knight', '2008-07-18'),
(5, '12 Angry Men', '1957-04-10'),
(6, 'The Godfather: Part III', '1990-12-25'),
(7, 'Pulp Fiction', '1994-10-14'),
(8,'The Lord of the Rings: The Fellowship of the Ring','2001-12-19'),
(9, 'The Dark Knight Rises', '2012-07-20'),
(10, 'The Lion King', '1994-06-15'),
(11, 'Forrest Gump', '1994-07-06'),
(12, 'The Silence of the Lambs', '1991-02-14'),
(13, 'Jurassic Park', '1993-06-11'),
(14, 'Titanic', '1997-12-19'),
(15, 'The Matrix', '1999-03-31'),
(16, 'The Avengers', '2012-05-04'),
(17, 'La La Land', '2016-12-09'),
(18, 'Joker', '2019-10-04'),
(19, 'Inception', '2010-07-16'),
(20, 'The Social Network', '2010-10-01');

Write a query to find all movies that were released in the year 1972.
SELECT title, release_date
FROM movies
WHERE EXTRACT(year FROM release_date) = 1972;

Write a query to find all movies that were released in the month April.
SELECT title, release_date
FROM movies
WHERE EXTRACT(month FROM release_date) = 04;

Write a query to find the title and release date of the oldest movie in the
database.
SELECT title, release_date
FROM movies
ORDER BY release_date ASC
LIMIT 1;

Write a query to find the title and release date of the newest movie in the
database.
SELECT title, release_date
FROM movies
ORDER BY release_date DESC
LIMIT 1;

35
Write a query to find the title and release date of all movies released in the 21st
century.
SELECT title, release_date
FROM movies
WHERE release_date >= '2000-01-01' AND release_date < '2100-01-01';

Write a query to find the title and release date of all movies that were released on
a Friday.
SELECT title, release_date
FROM movies
WHERE EXTRACT(dow FROM release_date) = 5;

Write a query to find the title and release date of all movies that have a title
containing the word 'King'.
SELECT title, release_date
FROM movies
WHERE title LIKE '%King%';

Write a query to find the title and release date of all movies that have a title
consisting of only 3 words.
SELECT title, release_date
FROM movies
WHERE length(regexp_replace(title, E'\\S+', '', 'g')) = 2;

Window Functions

A window function is a way to apply a calculation across a set of rows that are
related to the current row. Think of it as a "window" of rows that can move up or
down depending on your function. Common window functions include calculating
running totals or averages, finding the minimum or maximum value within a set of
rows, and ranking rows based on a specific column.
Window functions can be incredibly powerful and useful tools in SQL, allowing you
to perform complex calculations and analyses easily.

Syntax of Windows Function:


1. The function name: This is the name of the window function you want to use,
such as SUM, AVG, COUNT, etc.
2. The OVER clause: This specifies the window or subset of rows over which the
function should be calculated. The OVER clause is typically composed of
three parts:
a. PARTITION BY: This specifies the columns by which the data should be
partitioned, i.e., grouped by. The function will be calculated independently
for each partition.
b. ORDER BY: This specifies the column(s) by which the data should be
sorted within each partition. This determines the order in which the
function is calculated.
c. RANGE or ROWS: This specifies the window frame or subset of rows over
which the function should be calculated. RANGE is used to specify a logical

36
interval based on the values in the ORDER BY column(s), while ROWS is
used to specify a physical interval based on the actual row numbers.
3. The function argument: This specifies the column or expression that should be
used as the input to the function. This is optional for some functions, such
as COUNT.

Here is a breakdown of the code:


 SELECT column_1 - This selects the values of column_1 from the table.
 window_function(column_2) - This applies a window function to the values of
column_2. A window function is a function that computes a value for each
row in a partition of the table defined by the OVER clause.
 OVER([PARTITION BY column_1] [ORDER BY column_3]) - This defines the
partition of the table over which the window function will operate. The
PARTITION BY clause divides the table into partitions based on the distinct
values of column_1. The ORDER BY clause determines the order of rows
within each partition based on the values of column_3.
 AS new_col - This renames the result of the window function as new_col.
 FROM table_name - This specifies the table from which the data will be
selected.

Ranking functions (e.g. ROW_NUMBER, RANK, DENSE_RANK, etc.)


ProductName Sales
Product A 100
Product B 250
Product C 100
Product D 400
Product E 150

ROW_NUMBER(): This function assigns a unique number to each row within a result
set. The numbering starts at 1 and increments by 1 for each subsequent row.
SELECT ROW_NUMBER() OVER (ORDER BY Sales DESC) AS Rank, ProductName, Sales
FROM SalesData;

RANK(): This function assigns a rank to each row within a result set based on the
ordering specified in the ORDER BY clause. If two or more rows have the same
values, they receive the same rank, and the next rank is skipped.
SELECT RANK() OVER (ORDER BY Sales ASC) AS Rank, ProductName, Sales
FROM SalesData;

37
38
DENSE_RANK(): This function assigns a rank to each row within a result set based
on the ordering specified in the ORDER BY clause. If two or more rows have the
same values, they receive the same rank and the next rank is not skipped.
SELECT DENSE_RANK() OVER (ORDER BY Sales ASC) AS Rank, ProductName, Sales
FROM SalesData;

NTILE(): This function divides a result set into a specified number of groups and
assigns a group number to each row. For example, if you specify NTILE(4), the
result set is divided into four groups, and each row is assigned a group number
from 1 to 4.
SELECT NTILE(4) OVER (ORDER BY Sales DESC) AS Quartile, ProductName, Sales
FROM SalesData;

PERCENT_RANK(): This function calculates the relative rank of each row within a
result set. The values range from 0 to 1, with 0 representing the lowest rank and 1
representing the highest rank.
SELECT PERCENT_RANK() OVER (ORDER BY Sales DESC) AS PercentRank,
ProductName, Sales
FROM SalesData;

39
Aggregate functions using windows (e.g. SUM, AVG, MAX, MIN, etc.):
In SQL, window functions are used to perform calculations on a specific subset of
rows within a result set. Aggregate functions can be used with window functions
to calculate aggregate values on that subset of rows.

We will be using the below RegionSalesData table for examples :

Product Name Region Sales


Product A Region 1 100
Product A Region 2 200
Product A Region 3 150
Product B Region 1 50
Product B Region 2 100
Product B Region 3 75
Product C Region 1 75
Product C Region 2 125
Product C Region 3 100

SUM(): This function calculates the sum of a column or expression for each row
within a window.
SELECT SUM(Sales) OVER (PARTITION BY Region) AS RegionTotal, ProductName,
Sales
FROM RegionSalesData;

AVG(): This function calculates the average of a column or expression for each
row within a window.
SELECT AVG(Sales) OVER (PARTITION BY Region) AS RegionAvg, ProductName,
Sales
FROM RegionSalesData;

40
MAX(): This function returns the maximum value of a column or expression for
each row within a window.
SELECT MAX(Sales) OVER (PARTITION BY Region) AS RegionMax, ProductName,
Sales
FROM RegionSalesData;

MIN(): This function returns the minimum value of a column or expression for each
row within a window.
SELECT MIN(Sales) OVER (PARTITION BY Region) AS RegionMin, ProductName,
Sales
FROM RegionSalesData;

COUNT(): This function returns the number of rows within a window.


SELECT COUNT(*) OVER (PARTITION BY Region) AS RegionCount, ProductName,
Sales
FROM RegionSalesData;

41
Partitioning data for window functions:
In SQL, partitioning data is a technique used to divide the data into logical groups
or partitions so that window functions can be applied to each partition separately.
The PARTITION BY clause is used to partition the data based on one or more
columns.
We will be using the below data :
Product Name Region Sales Year
Product A Region 1 100 2022
Product A Region 2 200 2022
Product A Region 3 150 2023
Product B Region 1 50 2022
Product B Region 2 100 2022
Product B Region 3 75 2023
Product C Region 1 75 2022
Product C Region 2 125 2022
Product C Region 3 100 2023

SELECT ProductName, Sales,


SUM(Sales) OVER (PARTITION BY Region) AS RegionTotal
FROM RegionSalesData;

In this example, the data is partitioned by the Region column. The SUM() function
is applied to the Sales column for each partition separately, and the result is
returned in a new column called RegionTotal.

SELECT ProductName, Sales,


SUM(Sales) OVER (PARTITION BY Region, Year) AS RegionYearTotal
FROM RegionSalesData;

42
In this example, the data is partitioned by both the Region and Year columns. The
SUM() function is applied to the Sales column for each combination of Region and
Year, and the result is returned in a new column called RegionYearTotal.

SELECT ProductName, Sales,


SUM(Sales) OVER (PARTITION BY Region ORDER BY Year) AS
RegionTotalByYear
FROM RegionSalesData;

In this example, the data is partitioned by the Region column and then ordered by
the Year column within each partition. The SUM() function is applied to the Sales
column for each partition in the specified order, and the result is returned in a
new column called RegionTotalByYear.

Understanding the difference between row-based and aggregate-based window


functions:

Row-based window functions: Row-based window functions operate on a single row


at a time within a window frame. They return a value for each row in the result
set based on the values of the other rows in the same window frame. Examples of
row-based window functions include ROW_NUMBER(), RANK(), and
DENSE_RANK().
SELECT ProductName, Sales,
ROW_NUMBER() OVER (PARTITION BY Region ORDER BY Sales DESC) AS
RowNum
FROM RegionSalesData;

43
In this example, ROW_NUMBER() is a row-based window function that assigns a
unique number to each row based on the order of Sales values within each
partition of the Region column.

Aggregate-based window functions: Aggregate-based window functions perform


calculations across multiple rows within a window frame and return a single
value for each row in the result set. Examples of aggregate-based window
functions include SUM(), AVG(), and MAX().
SELECT Region, Year, Sales,
SUM(Sales) OVER (PARTITION BY Region ORDER BY Year) AS
RegionYearTotal
FROM RegionSalesData;

In this example, SUM() is an aggregate-based window function that calculates the


total sales within each partition of the Region column and orders the result by the
Year column within each partition.

We have a table called "sales" containing information about store sales


transactions. The table has the following columns:
 id (integer)
 date (date)
 customer_name (text)
 product_name (text)
 quantity (integer)
 price (numeric)
CREATE TABLE sales (
id SERIAL PRIMARY KEY,
date DATE,
customer_name TEXT,

44
product_name TEXT,
quantity INTEGER,
price NUMERIC);

INSERT INTO sales (date, customer_name, product_name, quantity, price)


VALUES
('2022-03-01', 'John', 'Shoes', 2, 100),
('2022-03-02', 'Mike', 'Shirt', 3, 50),
('2022-03-02', 'Mike', 'Shoes', 1, 150),
('2022-03-03', 'Jane', 'Hat', 1, 25),
('2022-03-04', 'John', 'Shirt', 2, 60),
('2022-03-05', 'Jane', 'Shoes', 1, 120),
('2022-03-06', 'Mike', 'Shoes', 2, 150),
('2022-03-06', 'Jane', 'Shirt', 1, 30),
('2022-03-07', 'John', 'Hat', 3, 15),
('2022-03-08', 'Mike', 'Hat', 1, 20),
('2022-03-09', 'Jane', 'Shoes', 2, 120),
('2022-03-09', 'Mike', 'Shirt', 1, 50);

Display the total sales amount for each customer.


SELECT customer_name, SUM(price * quantity) AS total_sales_amount
FROM sales
GROUP BY customer_name

Rank the customers based on their total sales amount.


SELECT customer_name,
RANK() OVER(ORDER BY SUM(price * quantity) DESC) AS sales_rank
FROM sales
GROUP BY customer_name;
We use the RANK function in conjunction with the OVER clause to rank the
customers based on their total sales amount. The ORDER BY clause is used to
specify the ordering of the ranking, in this case descending order based on the
total sales amount. The GROUP BY clause groups the rows by customer name, so
we can compute each customer's total sales amount.

Display the daily sales amount for each product using a window function.
SELECT date, product_name,
SUM(price * quantity) OVER(PARTITION BY date, product_name) AS
daily_sales_amount
FROM sales;

Display the top-selling product for each customer.


SELECT customer_name, product_name, total_sales_amount
FROM (
SELECT customer_name, product_name, SUM(price * quantity) AS
total_sales_amount,
RANK() OVER(PARTITION BY customer_name ORDER BY SUM(price *
quantity) DESC) AS rank

45
FROM sales
GROUP BY customer_name, product_name
) t
WHERE rank = 1;
We first compute the total sales amount for each customer and product using the
SUM aggregate function in a subquery. Then we use the RANKwindow function
with the PARTITION BYclause to rank the products for each customer based on
their total sales amount. Finally, we select only the top-ranked product for each
customer by filtering the rows where the rank is equal to 1.

Display the average quantity of each product sold per day using a window
function.
SELECT date, product_name, ROUND(AVG(quantity) OVER(PARTITION BY
product_name ORDER BY date), 2) AS avg_quantity_per_day
FROM sales
ORDER BY date, product_name;
We use the AVG aggregate function as a window function to compute the average
quantity of each product sold per day. The PARTITION BY clause is used to partition
the data by product name, and the ORDER BY clause is used to order the rows by
sale date. This means that the average is computed over all preceding rows for
each product, up to and including the current row, based on the sale date. The
output includes the product name, sale date, and average quantity of each
product sold per day. The rows are ordered by product name and sale date.

Display the products that were sold above their average quantity for each day.
WITH daily_avg AS (
SELECT date, ROUND(AVG(quantity), 2) AS avg_quantity
FROM sales
GROUP BY date)
SELECT s.product_name, s.date, s.quantity, daily_avg.avg_quantity
FROM sales s
JOIN daily_avg ON s.date = daily_avg.date
WHERE s.quantity > daily_avg.avg_quantity
ORDER BY s.date, s.product_name;
We first compute the average quantity of sales for each day using the AVG
aggregate function and the GROUP BY clause in a common table expression (CTE)
called daily_avg. Then we join the sales table with the daily_avg CTE on the sale_date
column, and compare the quantity of each product sold on each day to the daily
average using a WHERE clause. The output includes the product name, sale date,
quantity, and daily average for each product that was sold above its average
quantity on each day. The rows are ordered by sale date and product name.

Chapter 2: Data Preparation with SQL


Complex Queries using CTE and Pivoting

Common Table Expressions (CTEs) are a powerful tool in SQL that allows you to
define temporary result sets that can be referenced multiple times within a single

46
query. In this way, CTEs provide a more readable and manageable way to handle
complex queries that might otherwise be difficult to write or understand. One
common use case for CTEs is for dealing with hierarchical data, such as
organizational charts or tree structures. Recursive CTEs can be used to traverse
the hierarchy and retrieve information at different levels of the tree.
In addition to simplifying queries for hierarchical data, CTEs can also be used to
simplify complex queries in general. By breaking down a complicated query into
smaller, more manageable pieces, CTEs can help you write cleaner and more
efficient code. CTEs can also be combined with window functions and subqueries
to enhance their functionality further. By using CTEs in conjunction with these
other features, you can create even more powerful queries that can handle a wide
range of data analysis tasks.
However, it's important to note that CTEs can also have performance
implications. While they can be a powerful tool for simplifying queries, they can
also be resource-intensive and slow down your database if used incorrectly.

Common Table Expressions(CTE):


CTEs are temporary result sets that are defined within the execution of a single
SQL statement. They allow you to simplify complex queries by breaking them
down into smaller, more manageable pieces.
WITH cte_name (column_list) AS (
CTE_query_definition )
statement;
Now, let's examine the syntax provided above.
 Initially, we need to assign a name to the CTE along with an optional list of
columns.
 Next, we define a query that generates the desired result set within the
WITH clause. If not explicitly specified, the select list of the
CTE_query_definition will serve as the column list for the CTE.
 Lastly, the CTE can be utilized as a table or view within a statement, which
can be a SELECT, INSERT, UPDATE, or DELETE operation.

Here's the SQL code:


WITH dept_total_salary AS (
SELECT department_id, SUM(salary) AS total_salary
FROM employees
GROUP BY department_id)
SELECT departments.name, dept_total_salary.total_salary /
COUNT(employees.id) AS avg_salary
FROM departments
JOIN employees ON departments.department_id = employees.department_id
JOIN dept_total_salary ON departments.department_id =
dept_total_salary.department_id
GROUP BY departments.name,dept_total_salary.total_salary

We calculated the average salary of employees in each department by first


creating a Common Table Expression (CTE) named "dept_total_salary" that
calculates the total salary of each department using the SUM function on the
"salary" column of the "employees" table, grouped by the "department_id"
column.

47
Then, the main query joins the "departments" and "employees" tables on their
"department_id" columns, and also joins the "dept_total_salary" CTE on its
"department_id" column. The query groups the results by the "departments.name"
and "dept_total_salary.total_salary" columns.
Finally, the query calculates the average salary of employees in each department
by dividing the total salary of the department by the number of employees in the
department using the COUNT function on the "employees.id" column. The resulting
average salary is aliased as "avg_salary".

Recursive CTEs for Hierarchical Data


Recursive CTEs (Common Table Expressions) are a powerful feature in SQL that
allow you to query hierarchical data structures, such as trees or graphs.
WITH RECURSIVE cte_name AS(
CTE_query_definition -- non-recursive term
UNION [ALL]
CTE_query definion -- recursive term
) SELECT * FROM cte_name;

A recursive common table expression (CTE) consists of three components:


1. Non-recursive term: This term refers to a CTE query definition that serves as
the initial result set for the CTE structure.
2. Recursive term: The recursive term comprises one or more CTE query
definitions that are combined with the non-recursive term using the UNION
or UNION ALL operator. The recursive term refers to the CTE name itself.
3. Termination check: The termination check determines when the recursion
should stop. It occurs when no rows are returned from the previous
iteration.
When executing a recursive CTE in PostgreSQL, the following steps are followed:
1. The non-recursive term is executed to generate the base result set (R0).
2. The recursive term is executed, taking Ri as input, to produce the output
result set Ri+1.
3. Step 2 is repeated until an empty set is returned, serving as the
termination check.
4. Finally, the final result set is returned, which is a UNION or UNION ALL of
the result sets R0, R1, ..., Rn.

Here's an example of how you can use a recursive CTE to query a tree structure:
Let's say you have a table called "categories" that has the following columns:
category_id, name, and parent_id. The parent_id column indicates the ID of the

48
parent category for each category. A category with a NULL parent_id is a root
category.
category_id name parent_id
1 Electronics NULL
2 Computers 1
3 Laptops 2
4 Desktops 2
5 Mobiles 1
6 Android 5
7 iOS 5

In this dataset, "Electronics" are root categories, as they have a NULL value in
the "parent_id" column. "Computers" and "Mobiles" are child categories of
"Electronics", with a "parent_id" of 1. "Laptops" and "Desktops" are child
categories of "Computers", with a "parent_id" of 2. And "Android" and "iOS" are
child categories of "Mobiles", with a "parent_id" of 5.

Below code creates the table “categories” and inserts data :

-- Create the categories table


CREATE TABLE categories (
category_id SERIAL PRIMARY KEY,
name VARCHAR(255),
parent_id INTEGER);
-- Insert data into the categories table
INSERT INTO categories (category_id, name, parent_id)
VALUES
(1, 'Electronics', NULL),
(2, 'Computers', 1),
(3, 'Laptops', 2),
(4, 'Desktops', 2),
(5, 'Mobiles', 1),
(6, 'Android', 5),
(7, 'iOS', 5);

Here's how you can use a recursive CTE to query all categories and their
subcategories:

WITH RECURSIVE category_tree(category_id, name, parent_id, level) AS (


-- Base case: select all root categories
SELECT category_id, name, parent_id, 0 FROM categories WHERE parent_id IS
NULL
UNION ALL
-- Recursive case: select all subcategories
SELECT c.category_id, c.name, c.parent_id, ct.level + 1
FROM categories c
JOIN category_tree ct ON c.parent_id = ct.category_id)
SELECT * FROM category_tree;

49
In this example, we define a CTE called "category_tree" that selects all root
categories (i.e., categories with a NULL parent_id) in the base case. Then, in the
recursive case, we join the "categories" table with the "category_tree" CTE on the
parent_id column to select all subcategories. We also increment the level column
by 1 for each recursive step.
Finally, we select all columns from the "category_tree" CTE to get the full tree
structure with all categories and their levels.
With recursive CTEs, you can easily query and manipulate hierarchical data
structures in SQL.

Combining CTEs with Window Functions and Subqueries:


Common Table Expressions (CTEs) can be combined with other SQL features such
as window functions and subqueries to solve complex analytical problems. Here's
an example of how you can combine CTEs with window functions and subqueries:
Let's say you have a table called "sales" that contains data on sales transactions,
including the date, product, and revenue. You want to calculate the cumulative
revenue for each product, as well as the percentage of total revenue that each
product represents.
date product revenue
2022-01-01 Product A 1000
2022-01-02 Product B 2000
2022-01-02 Product C 3000
2022-01-03 Product A 1500
2022-01-03 Product B 2500
2022-01-03 Product C 3500
2022-01-04 Product A 1200
2022-01-04 Product C 3200
2022-01-05 Product B 1800
2022-01-05 Product C 2800

This dataset contains sales data for different products on different dates. The
"date" column represents the date of the sale, the "product" column represents
the product sold, and the "revenue" column represents the revenue generated
from the sale.
-- Create the sales table
CREATE TABLE sales (
date DATE,
product VARCHAR(255),

50
revenue INTEGER);
-- Insert data into the sales table
INSERT INTO sales (date, product, revenue)
VALUES
('2022-01-01', 'Product A', 1000),
('2022-01-02', 'Product B', 2000),
('2022-01-02', 'Product C', 3000),
('2022-01-03', 'Product A', 1500),
('2022-01-03', 'Product B', 2500),
('2022-01-03', 'Product C', 3500),
('2022-01-04', 'Product A', 1200),
('2022-01-04', 'Product C', 3200),
('2022-01-05', 'Product B', 1800),
('2022-01-05', 'Product C', 2800);

Here's how you can use CTEs, window functions, and subqueries to solve this
problem:
WITH product_revenue AS (
SELECT
product,
date,
SUM(revenue) AS revenue
FROM sales
GROUP BY product, date),
product_cumulative_revenue AS (
SELECT
product,
date,
revenue,
SUM(revenue) OVER (PARTITION BY product ORDER BY date) AS
cumulative_revenue
FROM product_revenue),
total_revenue AS (
SELECT SUM(revenue) AS total_revenue
FROM product_cumulative_revenue)
SELECT
product,
date,
revenue,
cumulative_revenue,
revenue / total_revenue * 100 AS percentage_of_total
FROM product_cumulative_revenue
JOIN total_revenue ON 1 = 1;

1. The code uses a common table expression (CTE) named "product_revenue"


to calculate the sum of revenue for each product and date combination
from the "sales" table.
2. Another CTE named "product_cumulative_revenue" is defined to calculate
the cumulative revenue for each product over time. It uses the previously

51
defined "product_revenue" CTE and applies the window function
SUM(revenue) OVER (PARTITION BY product ORDER BY date) to calculate the
running total of revenue for each product.
3. The "total_revenue" CTE calculates the sum of revenue across all products
and dates from the "product_cumulative_revenue" CTE.
4. The final SELECT statement retrieves the product, date, revenue,
cumulative revenue, and percentage of total revenue for each product and
date. It joins the "product_cumulative_revenue" CTE with the
"total_revenue" CTE on a dummy condition "1 = 1" to ensure all rows are
returned.
5. Within the final SELECT statement, the percentage of total revenue is
calculated by dividing the revenue for each row by the total revenue (from
the "total_revenue" CTE) and multiplying by 100.

Understanding the Performance Implications of CTEs:


One of the main advantages of using CTEs is that they can help break down
complex queries into smaller, more manageable parts. This can make queries
easier to understand and maintain, especially if they involve multiple joins and
subqueries. Additionally, CTEs can help optimize query performance by allowing
the database engine to cache the results of each CTE and reuse them in
subsequent queries.
On the other hand, using CTEs can sometimes have negative performance
implications. Because CTEs are temporary tables, they incur additional overhead
and may require additional processing time compared to using subqueries or
views. Additionally, CTEs can sometimes lead to less efficient execution plans,
particularly if they are used in conjunction with other query constructs like
correlated subqueries.
To minimize the performance impact of CTEs, it's important to carefully consider
their use and ensure that they are optimized for your specific query needs. This
may involve experimenting with different query constructs, rearranging the order
of CTEs and other query constructs, or optimizing the underlying data structures
to minimize the amount of processing required. Ultimately, the performance
implications of CTEs will depend on the specific query and database environment,
so it's important to carefully evaluate their use in each case.

You are appointed as a data manager in a DVD rental Company, and you have
access to their dvdrental.tar file, which contains information about
films ,actors ,language ,staff ,rents ,etc.

52
Make a list of average rents of films with respect to their language .

Calculate the average rental duration of films with respect to their language .

The boss wants a list of cumulative payments made by each customer .The output should
contain customer id ,payment date ,amount and cumulative payment.

53
Now return the Percentage of Total payments made by each customers .

54
Problem Statement:
You are working as a data analyst for a retail company that operates in multiple
regions. The company has a database that contains two tables: "Sales" and
"Regions". The "Sales" table stores information about the sales transactions,
including the product name, region, and sales quantity. The "Regions" table
contains the mapping of region names to region IDs.

Sales Table:
| ProductName | RegionID | SalesQuantity |
|-------------|----------|---------------|
| ProductA | 1 | 10 |
| ProductA | 2 | 15 |
| ProductA | 3 | 20 |
| ProductB | 1 | 12 |
| ProductB | 2 | 18 |
| ProductB | 3 | 8 |
| ProductC | 1 | 5 |
| ProductC | 2 | 10 |
| ProductC | 3 | 15 |

Regions Table:
| RegionID | RegionName |
|----------|------------|
| 1 | Region1 |
| 2 | Region2 |
| 3 | Region3 |

Products Table:
| ProductID | ProductName |
|-----------|-------------|
| 1 | ProductA |
| 2 | ProductB |
| 3 | ProductC |

Your task is to write a SQL query that utilizes Common Table Expressions
(CTEs) and pivoting techniques to generate a consolidated sales report for a

55
specific time period. The report should display the total sales quantity for
each product, with columns representing different regions and rows
representing different products.

WITH SalesData AS (
SELECT s.ProductName, r.RegionName, s.SalesQuantity
FROM Sales s
JOIN Regions r ON s.RegionID = r.RegionID)
SELECT *
FROM SalesData
PIVOT (
SUM(SalesQuantity)
FOR RegionName IN ('Region1', 'Region2', 'Region3')
) AS PivotTable;

CTEs provide a way to define temporary result sets that can be referenced
multiple times within a single query, allowing for more readable and
manageable code.
Recursive CTEs can be used to work with hierarchical data structures,
allowing for easier traversal of complex tree structures.
CTEs can be used to simplify complex queries by breaking them down into
smaller, more manageable pieces.
CTEs can be combined with window functions and subqueries to enhance
their functionality and handle a wider range of data analysis tasks.
It's important to understand the potential performance implications of using
CTEs and use them judiciously to avoid negative impacts on database
performance.
CTEs are a powerful tool in SQL, but it's important to understand their
limitations and trade-offs and to use them appropriately in code.
CTEs can make SQL code more readable and maintainable by breaking down
complex queries into smaller, more manageable pieces.
Recursive CTEs are particularly useful for working with hierarchical data, but
it's important to understand their limitations and optimize them for
performance.
Combining CTEs with window functions and subqueries can allow for even
more complex data analysis tasks and provide more powerful query results.
Overall, learning how to use CTEs effectively can help developers write
cleaner, more efficient SQL code that is easier to read and maintain over
time.

Data Exploration With SQL and Python

56
Introduction to SQLite in Python:
you're a person working with a tight schedule, and you need a fast, reliable, and easy-to-use data
storage solution. MySQL or PostgreSQL come to mind, but wait, have you heard of SQLite? This
embedded database library written in C is a game-changer. Unlike other database technologies,
SQLite is integrated into your program, making it a serverless data storage solution. And the best
part? All data is stored in a single file with a .db extension. But that's not all. SQLite's concurrent access
feature allows multiple processes or threads to access the same database, making it ideal for mobile
operating systems such as Android and iOS. So why use large CSV files when you can consolidate all
your data into a single SQLite database? And let's not forget, storing data for your application setup in
SQLite is 35% faster than a file-based solution like a configuration file.

Python SQLite – Connecting to Database:

import sqlite3
connection = sqlite3.connect("aquarium.db")

This easy-to-use module allows Python programs to establish a connection to an SQLite database with
a single line of code. And if your database doesn't yet exist, don't worry—sqlite3.connect() can create
it on the fly. To check whether our connection object was successfully constructed, run the following
command.

print(connection.total_changes)

The total number of database rows that the connection has altered is represented
by connection.total_changes. 0 total changes are accurate because no SQL operations have yet been run.
You can remove the aquarium.db file from your computer if you decide at any point that you'd like to
restart the tutorial.
*Note*: By supplying the specific string ":memory:" to sqlite3.connect, it is also possible to connect to
an SQLite database that only exists in memory (and not in a file).
*Example*: sqlite3.connect (":memory:"). When your Python program ends, a ":memory:" SQLite
database will vanish. This could be useful if you want to use SQLite as a temporary sandbox for testing
purposes and don't need to keep any data around after your program terminates.

SQLite Datatypes and its Corresponding Python Types:


SQLite is a library written in the C programming language that offers a serverless and portable SQL
database engine. It reads from and writes to a disc because of its file-based design. Since SQLite is a
zero-configuration database, there is no requirement for installation or setup before using it. With
Python 2.5.x, SQLite3 has been included by default.

Storage Class in SQLite


A group of related data types might be called a storage class. SQLite offers the following storage
classes:
Storage Class Value Stored
NULL NULL
BLOB (Binary Large Object) Data is stored exactly the way it was input, gen. in binary form
INTEGER Signed Integer(1,2,3,4,5 or 8 bytes depending on magnitude)
REAL Floating Point value(8 byte IEEE floating point numbers)
TEXT TEXT string
Corresponding Python Datatypes

57
Storage Class Python Datatype
NULL None
INTEGER Int
REAL Float
TEXT Str
BLOB bytes
In Python, the type() method can be used to determine an argument's class. The type() function is
used in the software below to print the classes of each value we store in a database.

Python SQLite Queries:


Python has support for several databases because it is a high-level language. Without writing raw
queries in the terminal or shell of that specific database, we can connect to and execute queries for
that database using Python; we only need that database installed on our machine. The SQLite
database is integrated with Python via the SQLite3 module. It offers a basic and easy-to-use interface
for communicating with SQLite databases and is a standardized Python DBI API 2.0.

Cursor Object:
It is an object used to establish a connection to run SQL queries. It serves as middleware between the
SQL query and the SQLite database connection. After connecting to an SQLite database, it is
generated.
Syntax: cursor_object=connection_object.execute(“sql query”);

Example: Writing records into the hotel table using Python code to establish a hotel data
database.

# importing sqlite3 module


import sqlite3
# create connection by using object
# to connect with hotel_data database
connection = sqlite3.connect('hotel_data.db')
# query to create a table named FOOD1
connection.execute(''' CREATE TABLE hotel
(FIND INT PRIMARY KEY NOT NULL,
FNAME TEXT NOT NULL,
COST INT NOT NULL,
WEIGHT INT);
''')
# insert query to insert food details in
# the above table
connection.execute("INSERT INTO hotel VALUES (1, 'cakes',800,10 )")
connection.execute("INSERT INTO hotel VALUES (2, 'biscuits',100,20 )")
connection.execute("INSERT INTO hotel VALUES (3, 'chocos',1000,30 )")
print("All data in food table\n")
# create a coursor object for select query
cursor = connection.execute("SELECT * from hotel ")
# display all data from hotel table
for row in cursor:
print(row)
All data in food table
(1, 'cakes', 800, 10)
(2, 'biscuits', 100, 20)
(3, 'chocos', 1000, 30)

58
Let’s explore using the sqlite3 module within a Python application to create tables, insert data and
access tables in an SQLite database.
*Create Table*
Syntax:
CREATE TABLE database_name.table_name(
column1 datatype PRIMARY KEY(one or more columns),
column2 datatype,
column3 datatype,
…..
columnN datatype);

Insert Data
Now, let’s talk about utilizing the sqlite3 module with Python to insert data into a table in a SQLite
database. To add a new row to a table, use the SQL INSERT INTO statement. The INSERT
INTO statement can be used in one of two ways to insert rows:
*Only values*
In the first approach, the column names are omitted and only the value of the data to be entered is
specified.
Syntax:
INSERT INTO table_name VALUES (value1, value2, value3,…);
table_name: name of the table.
value1, value2,.. : value of first column, second column,… for the new record
*Column names and values both*
In the second approach, we will specify the columns we wish to fill as well as the values that go with
each one, as shown below:
INSERT INTO table_name (column1, column2, column3,..) VALUES ( value1, value2, value3,..);
table_name: name of the table.
column1: name of first column, second column …
value1, value2, value3 : value of first column, second column,… for the new record
*Select Data from Table*
This statement returns the data from the table and is used to obtain data from an SQLite table.
The syntax for a select statement in SQLite is:
SELECT * FROM table_name;
# * : means all the column from the table
# *To select specific column replace * with the column name or column names.*
SELECT column1, column2, columnN FROM table_name;

**Example:** Let us create a table called Student and insert values into it and then use the select
statement to retrieve data .
# Import module
import sqlite3
# Connecting to sqlite
conn = sqlite3.connect('almab.db')
# Creating a cursor object using the
# cursor() method
cursor = conn.cursor()
# Creating table
table ="""CREATE TABLE STUDENTS(NAME VARCHAR(255), CLASS VARCHAR(255),
SECTION VARCHAR(255));"""
cursor.execute(table)
# Queries to INSERT records.
cursor.execute(
'''INSERT INTO STUDENTS (CLASS, SECTION, NAME) VALUES ('7th', 'A',
'Raju')''')
cursor.execute(
'''INSERT INTO STUDENTS (SECTION, NAME, CLASS) VALUES ('B', 'Shyam',
'8th')''')

59
cursor.execute(
'''INSERT INTO STUDENTS (NAME, CLASS, SECTION ) VALUES ('Baburao', '9th',
'C')''')
cursor.execute(
'''INSERT INTO STUDENTS (NAME, CLASS, SECTION ) VALUES ('Ajith', '9th',
'C')''')
cursor.execute(
'''INSERT INTO STUDENTS (NAME, CLASS, SECTION ) VALUES ('Karnan', '8th',
'B')''')
cursor.execute(
'''INSERT INTO STUDENTS (NAME, CLASS, SECTION ) VALUES ('Abhishek',
'10th', 'A')''')
cursor.execute(
'''INSERT INTO STUDENTS (NAME, CLASS, SECTION ) VALUES ('Kishore', '9th',
'C')''')
# Display data inserted
print("Data Inserted in the table: ")
# * is used to refer to all columns
data=cursor.execute('''SELECT * FROM STUDENTS''')
for row in data:
print(row)
# Commit your changes in
# the database
conn.commit()
# Closing the connection
conn.close
Data Inserted in the table:
('Raju', '7th', 'A')
('Shyam', '8th', 'B')
('Baburao', '9th', 'C')
('Ajith', '9th', 'C')
('Karnan', '8th', 'B')
('Abhishek', '10th', 'A')
('Kishore', '9th', 'C')
**Retrieving data using python**
READ Any database operation entails retrieving some valuable data from the database. Using the fetch()
method offered by the sqlite python module, you can retrieve data from MYSQL.
Three methods are available through the sqlite3.Cursor class: fetchall(), fetchmany(), and fetchone().
 The fetchall() method pulls up every row in a query's result set and returns it as a list of tuples. (If we
perform this after retrieving few rows it returns the remaining ones).
 The next row in a query's result is fetched and returned as a tuple using the fetchone() method.
 Similar to fetchone(), fetchmany() obtains the next group of rows from a query's return set rather than
just one item.

*WHERE Clause in SQLite using Python:*


The where clause in SQL/SQLite allows us to go ahead and specify particular requirements that must be met
while retrieving data from the database, allowing us to narrow down the scope of our search results. The where
clause can be used to get, update, or remove a specific data set. If your database tables don't have values
matching the requirement, we most likely didn't receive any results.

WHERE Clause in SQL:

SELECT column_1, column_2,…,column_N


FROM table_name
WHERE [search_condition]
# Here, in this [search_condition] you can use comparison or logical operators to specify conditions.
# For example: = , > , < , != , LIKE, NOT, etc.
Now Let us retrieve data from the Student Table using the Where Clause .
# Example 1 : To retrieve data whose class is 7th

60
connection = sqlite3.connect('almab.db')
cursor = connection.cursor()
# WHERE CLAUSE TO RETRIEVE DATA
cursor.execute("SELECT * FROM STUDENTS WHERE Class = '7th'")
# printing the cursor data
print(cursor.fetchall())
connection.commit()
connection.close()

[('Raju', '7th', 'A')]


# Example 2 : Updating a student's information whose Student Section is B.
import sqlite3
connection = sqlite3.connect('almab.db')
cursor = connection.cursor()
# WHERE CLAUSE TO UPDATE DATA
cursor.execute("UPDATE STUDENTS SET Class ='10th' WHERE Section = 'B'")
# printing the cursor data
cursor.execute("SELECT * from STUDENTS")
print(cursor.fetchall())
connection.commit()
connection.close()

[('Raju', '7th', 'A'), ('Shyam', '10th', 'B'), ('Baburao', '9th', 'C'),


('Ajith', '9th', 'C'), ('Karnan', '10th', 'B'), ('Abhishek', '10th', 'A'),
('Kishore', '9th', 'C')]
# Example 3 : To delete a student's data whose name is Baburao.
connection = sqlite3.connect('almab.db')
cursor = connection.cursor()
# WHERE CLAUSE TO DELETE DATA
cursor.execute("DELETE from STUDENTS WHERE Name = 'Baburao'")
#printing the cursor data
cursor.execute("SELECT * from STUDENTS")
print(cursor.fetchall())
connection.commit()
connection.close()
[('Raju', '7th', 'A'), ('Shyam', '10th', 'B'), ('Ajith', '9th', 'C'),
('Karnan', '10th', 'B'), ('Abhishek', '10th', 'A'), ('Kishore', '9th',
'C')]

SQLite via Pandas :


Using both Pandas and SQLite is a nice option. With the help of Pandas' read_sql_query method, you
can obtain the data as a Pandas data frame. From there, manipulating the data in Pandas will be
simpler. Once more, we must first create a connection to our database. The output can then be saved
as a data frame called df_employees using pd.read_sql_query.
#import library
import pandas as pd
con = sqlite3.connect('almab.db')
df_students = pd.read_sql_query('select * from STUDENTS', con)
df_students
NAME CLASS SECTION
0 RAJU 7 A
1 SHYAM 10 B
2 AJITH 9 C
3 KARNAN 10 B
4 ABHISHEK 10 A
5 KISHORE 9 C

61
Let's imagine, for instance, that Kishore (the last guy at the table) wasn't intended to be employed and
that we, therefore, need to remove him from the group.
df_new = df_students[:-1]
df_new

NAME CLASS SECTION


0 RAJU 7 A
1 SHYAM 10 B
2 AJITH 9 C
3 KARNAN 10 B
4 ABHISHEK 10 A
We can replace our previous Student table by writing our new dataframe back to SQLite. We employ
the to_sql() technique to write to SQLite on the fresh data frame.

df_new.to_sql("STUDENTS", con, if_exists="replace")


5
In this approach, we offer the following three parameters:

 The SQL table's name


 The connection with the database
 What to do if the table is already there. The original table will be removed if "replace" is
selected. In Python, the word "Fail" will cause a value error. The table's "Append" command
will add the data as new rows.
We can do another query on the table to see if Kishore has been taken out to make sure everything
went as planned. It has!
pd.read_sql_query ('select * from STUDENTS', con)

NAME CLASS SECTION


0 RAJU 7 A
1 SHYAM 10 B
2 AJITH 9 C
3 KARNAN 10 B
4 ABHISHEK 10 A
Don’t forget to close the connection once you’re done.
con.close()

PandaSQL:
PandasSQL is a Python library that allows users to execute SQL queries on Pandas data frames. It provides a
SQL-like interface to manipulate the data frames, which is useful when working with large datasets that may not
fit into memory. In this article, we will explore the benefits and drawbacks of using PandasSQL, how to set up
an environment to use the library, and perform basic and advanced SQL queries on realistic data.

Benefits of PandasSQL: 👍🐼
1. Familiar Interface: If you are already familiar with SQL, using PandasSQL should be easy for you as
it provides a similar interface to perform operations on data frames.
2. Large Datasets: PandasSQL can be useful when working with large datasets that cannot fit into
memory. It uses the Pandas library, which allows users to work with datasets that are larger than
memory.
3. Data Manipulation: PandasSQL provides a variety of functions to manipulate data frames such as
sorting, filtering, and grouping.

Drawbacks of PandasSQL:👎🤔
1. Slower than Native Pandas Operations: While PandasSQL can be useful for large datasets, it is
generally slower than native Pandas operations.

62
2. SQL Syntax Limitations: PandasSQL does not support all SQL syntax, which means some complex
queries may not be possible.

Creating the Environment:

To start using PandasSQL, we need to install it using pip. Open your terminal and enter the following command:
pip install pandasql
pip install SQLAlchemy==1.4.46
from pandasql import sqldf
import pandas as pd

df = pd.DataFrame({
'Name': ['John', 'Jane', 'Bob', 'Mary'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'London', 'Paris', 'Sydney']})
Now that we have our data frame, we can use PandasSQL to execute SQL queries on it. Here's an
example of a simple query to select all the data from our table:
query = "SELECT * FROM df"
result = sqldf(query)
print(result)

Name Age City


0 John 25 New York
1 Jane 30 London
2 Bob 35 Paris
3 Mary 40 Sydney
We can also use SQL to filter our data using the WHERE clause. For example, to select all rows where Age is
greater than 30:
query = """ SELECT * FROM df WHERE Age > 30 """
result = sqldf(query)
print(result)

Name Age City


0 Bob 35 Paris
1 Mary 40 Sydney

Problem Statement: You are working for a coffee shop and have been given a dataset containing
information about the sales transactions. The dataset contains information such as the date and time
of each transaction, the items purchased, each item's price, and the transaction's total amount. Your
task is to use SQL queries to analyze the data and provide insights that can help improve the sales and
operations of the coffee shop.
Dataset: Here is a sample dataset containing information about the sales transactions:
Transaction date time item price quantity Total_amo
_id unt
1 2022-01- 9:00 Coffee 2.50 1 2.50
01
2 2022-01- 10:30 Croissant 1.75 2 3.50
01
3 2022-01- 11:15 Cappucci 3.00 1 3.00
01 no
4 2022-01- 8:45 Latte 3.50 2 7.00
02

63
5 2022-01- 10:00 Muffin 2.25 1 2.25
02
6 2022-01- 12:15 Espresso 2.75 1 2.75
02
7 2022-01- 9:30 Croissant 1.75 3 5.25
03
8 2022-01- 11:00 Iced 3.50 2 7.00
03 Coffee
9 2022-01- 12:45 Hot 3.25 1 3.25
03 Chocol.
10 2022-01- 10:15 Cappucci 3.00 2 6.00
04 no

# Importing Libraries
import pandas as pd
import sqlite3
from pandasql import sqldf
mysql = lambda q: sqldf(q, globals())
# Connection to DataBase
conn = sqlite3.connect('coffee_shop.db')
c = conn.cursor()
# Closing the connection
conn.close

Create a new table named "transactions" in the database with the columns: "transaction_id", "date",
"time", "item", "price", "quantity", and "total_amount".

# Query 1: Create table


c.execute('''DROP TABLE IF EXISTS sales;''')
c.execute('''
CREATE TABLE transactions (
transaction_id INTEGER PRIMARY KEY,
date DATE,
time TIME,
item TEXT,
price REAL,
quantity INTEGER,
total_amount REAL
);
''')

Insert the data from the sample dataset into the "transactions" table.

#Query 2: Insert data into table


c.execute('''
INSERT INTO transactions (
transaction_id, date, time, item, price, quantity, total_amount)
VALUES
(1, '2022-01-01', '09:00', 'Coffee', 2.50, 1, 2.50),
(2, '2022-01-01', '10:30', 'Croissant', 1.75, 2, 3.50),
(3, '2022-01-01', '11:15', 'Cappuccino', 3.00, 1, 3.00),
(4, '2022-01-02', '08:45', 'Latte', 3.50, 2, 7.00),
(5, '2022-01-02', '10:00', 'Muffin', 2.25, 1, 2.25),
(6, '2022-01-02', '12:15', 'Espresso', 2.75, 1, 2.75),
(7, '2022-01-03', '09:30', 'Croissant', 1.75, 3, 5.25),
(8, '2022-01-03', '11:00', 'Iced Coffee', 3.50, 2, 7.00),

64
(9, '2022-01-03', '12:45', 'Hot Chocolate', 3.25, 1, 3.25),
(10, '2022-01-04', '10:15', 'Cappuccino', 3.00, 2, 6.00);
''')

Write a query to find the total number of transactions in the dataset.

# Query 3: Find the total number of transactions in the dataset


pd.read_sql_query('''
SELECT COUNT(*) AS total_transactions
FROM transactions;
''',conn)

Write a query to find the total amount of sales for each day in the dataset.

# Query 4: Find the total amount of sales for each day in the dataset
pd.read_sql_query('''
SELECT date, SUM(total_amount) AS total_sales
FROM transactions
GROUP BY date;''',conn)

Date Total_sales
0 2022-01-01 9.0
1 2022-01-02 12.0
2 2022-01-03 15.5
3 2022-01-04 6.00

Database Management and Schema Design

A Database Management System (DBMS) is a software system that allows users


to create, store, and manage data in a structured and organized manner. It
provides an interface for users to interact with the data stored in the system and
includes tools for data querying, reporting, and analysis.
Examples of DBMS include:
1. MySQL: A popular open-source DBMS that's widely used for web
applications and content management systems.
2. Microsoft SQL Server: A powerful DBMS that's commonly used for enterprise-
level applications and data warehousing.
3. Oracle Database: A widely-used DBMS that's known for its scalability,
reliability, and high performance.
4. PostgreSQL: A robust and feature-rich open-source DBMS that's often used
in scientific and research applications.
5. MongoDB: A popular NoSQL DBMS that's designed to handle unstructured
data such as documents, images, and videos.

Understanding The Relational Model and Database Schema Design:


Imagine you have a bunch of puzzle pieces that you need to put together to create
a beautiful picture. That's similar to how the relational model works in database
schema design.
The relational model is a way of organizing data into tables, where each table
represents a different type of information. Each table consists of rows and
columns, and the relationships between these tables are defined by the data they
share.

65
Think of the schema as the blueprint for your database. It outlines the structure
of the tables, the columns within them, and how they relate to each other. It's like
a map that tells you where everything is located and how to get there.
Now, let's talk about database schema design. It's like creating a puzzle that's
not only beautiful but also efficient and effective. To do this, you need to carefully
consider how the tables and their columns are organized, what type of data is
stored in each column, and how the tables relate to each other.

Let's consider an example of a database schema design for an e-commerce


website that sells products to customers. The schema will consist of several
tables, each representing a different aspect of the business.
Table 1: Customers Columns: customer_id, first_name, last_name, email, address,
city, state, zip_code
Table 2: Orders Columns: order_id, customer_id, order_date, total_amount
Table 3: Order_Items Columns: order_item_id, order_id, product_id, quantity, price
Table 4: Products Columns: product_id, product_name, description, price,
category_id
Table 5: Categories Columns: category_id, category_name, description
By structuring the data in this way, we can efficiently query and analyze the data
to answer important business questions, such as:
 What is the total revenue generated by each product category?
 How many orders were placed by each customer?
 What are the top-selling products by category?
 How many orders were placed in a particular time period?

ERD:
ERD stands for Entity-Relationship Diagram. It visually represents the entities and
the relationships between them in a database schema. ERD diagrams are used in
the design phase of database development to identify and represent the entities
and their attributes, as well as the relationships between them.

66
In an ERD diagram, entities are represented by rectangles, and the relationships
between them are represented by lines connecting the rectangles. The attributes
of each entity are listed inside the rectangle. There are three types of
relationships between entities: one-to-one, one-to-many, and many-to-many.
These relationships are represented by different types of lines connecting the
entities in the ERD diagram.

Now let's create an ERD for the e-commerce website database schema
STEP 1: Create a new Database in PostgreSQL called e-commerce.

STEP 2: Create the Customers, Orders,


Order_items, Products, and Categories tables with columns as mentioned above in the real-
world example.

STEP 3: Now right-click on the e-commerce database and select ‘ERD For Database’

STEP 4: The interface appears below in which


You can select and drag the rectangles to
change their position in the sheet.

67
You can edit the table by selecting it and clicking the pencil option.
Add one-to-many or many-to-many relations by clicking the 1M and MM options .
Customize by adding colors by using the fill color and text color options .

STEP 5 : Let's add relationships to our tables. Select the Orders table and click one to
many options. A dialogue box appears as shown below, we are connecting the Orders and
Customers tables. Hence as the foreign key is customers_id we have selected it.

STEP 6 : Click on Save . You can now see that the orders table and customers table are
connected by a 1M line.

68
STEP 7: Similarly, create relations between others using the foreign keys. You should see
output below.

Normalization and Denormalization of


Database Tables
Normalization and denormalization are techniques used in database design to
optimize the structure of tables and improve performance.
Normalization is the process of organizing data in a database so that it is
structured efficiently, without any redundant or duplicated data. It involves
dividing a table into smaller, more manageable tables and establishing
relationships between them. Normalization aims to reduce data redundancy
and improve data consistency and integrity.
On the other hand, denormalization is the process of intentionally adding
redundant data to a table to improve query performance. It involves
combining tables or adding columns to a table to reduce the number of joins
required to retrieve data. The goal of denormalization is to optimize query
performance by reducing the amount of time and resources needed to retrieve
data

69
To understand the concept of normalization and denormalization more intuitively,
consider the example of a table that contains information about customers and their
orders:
CustomerID | CustomerName | OrderID | OrderDate | ProductID | ProductName | Quantity
-----------------------------------------------------------------------------------
1 | John Smith | 1 | 01/01/2023 | 101 | Widget A | 2
1 | John Smith | 2 | 02/01/2023 | 102 | Widget B | 1
2 | Jane Doe | 3 | 03/01/2023 | 101 | Widget A | 3
2 | Jane Doe | 4 | 04/01/2023 | 103 | Widget C | 2
This table contains redundant data, as the customer information is duplicated
for each order they make. To normalize the table, we could split it into two
separate tables: one for customers and one for orders. The tables would be
related through a foreign key in the orders table that references the customer
ID in the customer table. The resulting tables might look something like this:

Customers Table
---------------
CustomerID | CustomerName
-------------------------
1 | John Smith
2 | Jane Doe

Orders Table
------------
OrderID | OrderDate | CustomerID | ProductID | Quantity
--------------------------------------------------------
1 | 01/01/2023 | 1 | 101 | 2
2 | 02/01/2023 | 1 | 102 | 1
3 | 03/01/2023 | 2 | 101 | 3
4 | 04/01/2023 | 2 | 103 | 2

By splitting the table into two separate tables, we have eliminated redundant
customer information, resulting in a more efficient and manageable structure.
Now, we want to retrieve all orders made by a particular customer, including
the name of the product ordered. With the normalized structure, we would
need to perform a join between the two tables to retrieve the necessary data:
SELECT Customers.CustomerName, Orders.OrderID, Orders.OrderDate,
Orders.ProductID, Orders.Quantity, Products.ProductName
FROM Customers
INNER JOIN Orders ON Customers.CustomerID = Orders.CustomerID
INNER JOIN Products ON Orders.ProductID = Products.ProductID
WHERE Customers.CustomerID = 1

This query would retrieve all orders made by customers with ID 1, including
the name of the product ordered. However, the join operation can be
resource-intensive and slow down query performance, especially when
dealing with large tables.
To optimize query performance, we can use denormalization by adding the
ProductName column to the Orders table. This way, we can retrieve all
necessary data in a single table without the need for joins:

70
Orders Table
------------
OrderID | OrderDate | CustomerID

Database Administration Tasks (e.g. backup


and recovery, security, performance tuning,
etc.)
Security: Ensuring the safety and privacy of your data is crucial. Think of
database security as a lock on your diary - you want to keep your thoughts
private and secure from unauthorized access. Database security tasks
include setting up user access controls, encryption, and firewalls to protect
against cyber attacks.
Backup and Recovery: Like making a photocopy of important documents,
backing up your database is important to prevent data loss due to hardware
failures or disasters. Recovery tasks involve restoring the database to a
previous state or point in time to recover data in case of unexpected data
loss. Think of backup and recovery as a "time machine" for your data.
Performance Tuning: The speed at which your database performs can affect the
overall performance of your application. Performance tuning tasks involve
optimizing the database configuration, indexes, and queries to improve
response time and throughput. Think of performance tuning as giving your car
a tune-up before a long road trip.
Database Maintenance: Like regular maintenance on your car, database
maintenance involves routine tasks to keep your database running smoothly.
This includes tasks such as monitoring database health, updating software,
and removing unnecessary data. Think of database maintenance as a regular
check-up with your doctor.
Reporting and Analytics: Database administrators may also be responsible for
generating reports and analyzing data to provide insights into business
operations. Think of reporting and analytics as a dashboard in your car that
shows you real-time information on your vehicle's performance.

Implementing Indexes and Constraints for


Data Integrity
Indexes and constraints are important tools in SQL for maintaining data
integrity and optimizing query performance.

Creating Indexes: Indexes are used to speed up queries by allowing the


database to locate the data being searched for quickly. To create an index on
a table, you can use the CREATE INDEX statement, followed by the name of
the index, the table name, and the columns to be indexed. For example:
CREATE INDEX idx_customer ON customer (last_name, first_name);
This creates an index called "idx_customer" on the "customer" table, indexing
the "last_name" and "first_name" columns.

Adding Constraints: Constraints are used to ensure data integrity by enforcing


rules on the data in a table.

71
In SQL, different types of constraints can be used to specify rules for the data
that can be entered into a table:
NOT NULL Constraint: This constraint ensures that a column does not accept
null values, which means that every row must have a value for that column.
UNIQUE Constraint: This constraint ensures that the values in a column are
unique across all the rows in the table. A table can have multiple unique
constraints on different columns.
PRIMARY KEY Constraint: This constraint combines the NOT NULL and UNIQUE
constraints. It ensures that the values in a column or a combination of
columns uniquely identify each row in the table.
FOREIGN KEY Constraint: This constraint ensures that the values in a column or
a combination of columns in one table match the values in a column or a
combination of columns in another table. This is used to enforce referential
integrity between related tables.
CHECK Constraint: used to specify a condition that must be met for a column's
value to be inserted or updated in a table. The condition can be any Boolean
expression that evaluates to true or false.
DEFAULT Constraint: is used to specify a default value for a column when no
value is explicitly provided during an insert operation.
These constraints are used to maintain data integrity and prevent data
inconsistencies in a database. The database management system enforces
them and ensures that only valid data is stored in the database.
To add a constraint to a table, you can use the ALTER TABLE statement,
followed by the name of the table and the constraint to be added. For
example:

ALTER TABLE orders ADD CONSTRAINT fk_customer FOREIGN KEY (customer_id)


REFERENCES customer (id);
This adds a foreign key constraint called "fk_customer" to the "orders" table,
referencing the "id" column in the "customer" table.
By using indexes and constraints, you can ensure that your data is accurate
and accessible and that your queries run as efficiently as possible.

Designing Efficient Database Queries for


Performance Optimization
Designing efficient database queries is essential for optimizing performance
in SQL. Here are some tips to help you design efficient queries:

Use Indexes: 🔍 Indexes can speed up queries by allowing the database to locate
the data being searched quickly. Use indexed columns in your WHERE and
JOIN clauses when designing your queries.

Avoid SELECT *: 🚫 Avoid using SELECT * in your queries, as this can cause
unnecessary overhead by returning all columns, even those that aren't
needed. Instead, explicitly list the columns you need in your SELECT
statement.

72
Use JOINs: 🔗 JOINs are a powerful tool for combining data from multiple tables.
When using JOINs, be sure to join on indexed columns and use INNER JOIN
instead of OUTER JOIN where possible.

Use Subqueries: 🔍 Subqueries can be used to break down complex queries into
smaller, more manageable pieces. When using subqueries, try to use EXISTS
or IN instead of NOT EXISTS or NOT IN, as the latter can be less efficient.

Limit Results: 🔢 Limiting the number of results returned by your queries can help
improve performance. Use LIMIT or TOP to limit the number of results
returned, and use WHERE clauses to filter out unnecessary data.
By following these tips, you can design efficient database queries that
optimize performance and improve the overall user experience.

You are working in an online retail store as a data manager, and you are
provided with the following Dataset that contains four tables :

Customers :
Customer_ID Firstname Lastname Email Phone
1 John Doe [email protected] 555-1234
2 Jane Smith [email protected] 555-5678
3 Bob Johnson [email protected] 555-9012
Products
ProductID ProductName Description Category Price
1 Product A This is product A Category 1 10.00
2 Product B This is product B Category 1 20.00
3 Product C This is product C Category 2 30.00
4 Product D This is product D Category 2 40.00
5 Product E This is product E Category 3 50.00
Orders
OrderID CustomerID OrderDate ShipDate TotalAmount
1 1 2022-01-01 2022-01-05 50.00
2 2 2022-01-02 2022-01-06 70.00
3 3 2022-01-03 2022-01-07 90.00
OrderDetails
OrderID ProductID Quantity Price
1 1 2 10.00
1 3 1 30.00
2 2 3 20.00
2 4 2 40.00
3 1 1 10.00
3 2 2 20.00
3 3 1 30.00
3 4 1 40.00
3 5 1 50.00

Upload the tables in a database in SQL .And also describe what you understood of the
tables .

CREATE TABLE Customers (


CustomerID INT PRIMARY KEY,

73
FirstName VARCHAR(50),
LastName VARCHAR(50),
Email VARCHAR(50),
Phone VARCHAR(20)
);

CREATE TABLE Products (


ProductID INT PRIMARY KEY,
ProductName VARCHAR(50),
Description TEXT,
Category VARCHAR(50),
Price DECIMAL(10,2)
);

CREATE TABLE Orders (


OrderID INT PRIMARY KEY,
CustomerID INT,
OrderDate DATE,
ShipDate DATE,
TotalAmount DECIMAL(10,2),
FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);

CREATE TABLE OrderDetails (


OrderID INT,
ProductID INT,
Quantity INT,
Price DECIMAL(10,2),
FOREIGN KEY (OrderID) REFERENCES Orders(OrderID),
FOREIGN KEY (ProductID) REFERENCES Products(ProductID)
);

-- Insert sample data


INSERT INTO Customers (CustomerID, FirstName, LastName, Email, Phone)
VALUES
(1, 'John', 'Doe', '[email protected]', '555-1234'),
(2, 'Jane', 'Smith', '[email protected]', '555-5678'),
(3, 'Bob', 'Johnson', '[email protected]', '555-9012');

INSERT INTO Products (ProductID, ProductName, Description, Category,


Price) VALUES
(1, 'Product A', 'This is product A', 'Category 1', 10.00),
(2, 'Product B', 'This is product B', 'Category 1', 20.00),
(3, 'Product C', 'This is product C', 'Category 2', 30.00),
(4, 'Product D', 'This is product D', 'Category 2', 40.00),
(5, 'Product E', 'This is product E', 'Category 3', 50.00);

74
INSERT INTO Orders (OrderID, CustomerID, OrderDate, ShipDate,
TotalAmount) VALUES
(1, 1, '2022-01-01', '2022-01-05', 50.00),
(2, 2, '2022-01-02', '2022-01-06', 70.00),
(3, 3, '2022-01-03', '2022-01-07', 90.00);

INSERT INTO OrderDetails (OrderID, ProductID, Quantity, Price) VALUES


(1, 1, 2, 10.00),
(1, 3, 1, 30.00),
(2, 2, 3, 20.00),
(2, 4, 2, 40.00),
(3, 1, 1, 10.00),
(3, 2, 2, 20.00),
(3, 3, 1, 30.00),
(3, 4, 1, 40.00),
(3, 5, 1, 50.00);

Display the orders by each customer .

Ensure that all the emails of the customers are unique.

75
It is important for you to know that your data is secured. So which SQL server do you
think offers the best security features?

The level of security offered by SQL servers can depend on many factors,
including the version of the server, the security features that are
implemented, and the specific configuration of the server. With that said, here
are a few SQL servers that are often considered to offer high levels of
security:
1. Microsoft SQL Server: Microsoft SQL Server is a popular relational
database management system that provides several security features,
including advanced encryption options, fine-grained access controls,
and auditing and compliance tools. Additionally, Microsoft regularly
releases security updates and patches to address known vulnerabilities
and exploits.
2. Oracle Database: Oracle Database is another widely-used database
management system that provides a range of security features,
including secure data transmission, access controls, and database
auditing. Oracle also offers several security tools and features that help
administrators to identify and respond to security threats.
3. PostgreSQL: PostgreSQL is an open-source relational database
management system that provides advanced security features,
including SSL/TLS encryption, strong authentication mechanisms, and
flexible access controls. PostgreSQL also has a strong security
community that is dedicated to identifying and addressing security
issues.
4. IBM Db2: IBM Db2 is a database management system that provides
several security features, including advanced encryption options,
secure user authentication, and fine-grained access controls. Db2 also
provides auditing and compliance features to help administrators to
monitor and manage security risks.
Ultimately, the choice of SQL server will depend on various factors, including
the organization's specific security requirements, the administrators'
expertise, and the available resources. However, all of the SQL servers
mentioned above can provide strong security features and protections for
sensitive data.

76

You might also like