Module 3
Module 3
SQL is important for data science because it is the most commonly used language
for working with relational databases. It allows data scientists to store, integrate,
clean, and analyze large volumes of data efficiently.
SQL provides a powerful set of tools for data manipulation and analysis, making it
an essential skill for any data scientist working with large and complex datasets.
By mastering SQL, data scientists can extract more insights from data and make
more informed data-driven decisions.
Setting up the SQL environment
Modern Relational Database Management Systems:
Modern Relational Database Management Systems (RDBMS) are software systems
designed to manage and manipulate large amounts of data stored in a structured
format.
They provide a way to organize, store, and retrieve data efficiently and are widely
used in industries such as finance, healthcare, and e-commerce.
1. MySQL: An open-source RDBMS that is widely used for web applications and
data-driven websites.
2. PostgreSQL: Another open-source RDBMS that is known for its advanced
features, including support for JSON and geospatial data.
3. Oracle: A proprietary RDBMS that is widely used in enterprise environments
for managing large amounts of data.
4. Microsoft SQL Server: A proprietary RDBMS developed by Microsoft, which is
widely used in business environments for managing and analyzing data.
These modern RDBMS provide a variety of features and tools for managing and
manipulating data, such as support for transactions, indexing, backup and
recovery, and security. They are designed to handle large volumes of data and
provide fast and efficient data access , making them a critical component of
many modern data-driven applications.
Setting Up
Now we are going to set up a SQL environment in PostgreSQL:
1. You can download the latest version of PostgreSQL from the official
website and install it on your system. Click on
https://www.postgresql.org/download/ to download PostgreSQL.
2. Now in the Packagers and Installers section, select your operating system.
And in the page that appears, select your required version.
1
3. The PostgreSQL starts downloading, and when the download is completed,
run it and give the necessary permissions, enter the password you want,
and select the location to install the PostgreSQL.
4. Now open the Pgadmin4 application on your computer. You can find it
using the search bar.
5. Click on ‘Servers’ in the Browser bar of pgadmin4 and select ‘PostgreSQL
15’ .
6. Right-click on the ‘Databases’ option ->’ Create’-> select Database.
7. In the Create-Database dialogue box, enter the name of your database and
Save. ‘dvdrental.tar’ is the database I created.
8. The database created appears in the ‘Databases’ section. Right now, the
database created is empty; you can load the data by right-clicking on the
Database and selecting ‘restore’.
9. The Restore Dialogue box appears. Make sure that the ‘Custom or Tar’ is
shown in the format, and you can select or copy the file path and paste it
into the Filename .
10. Go to the Data/Objects section and make sure you have turned on the Pre-
data, Data, and Post-Data options. Now click on Restore.
11. Now right-click on the File you have created and Select Refresh.
12. To enter your Query, right-click on the database you have created and
select the option ‘Query Tool’; this is where you will enter your queries
throughout this course.
SQL enables Data Scientists to retrieve and analyze data with powerful query
statements, making it a valuable skill for extracting insights.
To establish a SQL environment for practice, one can install database
management systems like PostgreSQL or MySQL on their local machine.
Data manipulation tasks, such as sorting, grouping, and filtering, can be
efficiently accomplished in SQL using the WHERE clause.
In Data Science projects, SQL is often used to perform data cleaning, where
redundant information is eliminated, and data is organized for further analysis.
FROM: The FROM clause specifies the table or tables from which to retrieve the
data.
SELECT column1, column2, ...
FROM table_name;
2
WHERE: The WHERE clause filters the results based on a specific condition or set
of conditions.
SELECT column1, column2, ...
FROM table_name
WHERE condition;
Select *
From film
Where rental_duration = 6;
GROUP BY: The GROUP BY clause groups the results based on one or more
columns.
SELECT column1, column2, ...
FROM table_name
WHERE condition
GROUP BY column1, column2, ...;
To get the group of film titles based on their ratings where rental_duration is 6 hrs
HAVING: The HAVING clause filters the results of a GROUP BY query based on a
specific condition or set of conditions.
SELECT column1, column2, ...
FROM table_name
WHERE condition
GROUP BY column1, column2, ...
HAVING condition;
To get the group of film titles, rating, and rental_duration with rental_duration = 6
having a rating of R
ORDER BY: The ORDER BY clause is used to sort the results in either ascending or
descending order based on one or more columns.
SELECT column1, column2, ...
FROM table_name
WHERE condition
GROUP BY column1, column2, ...
HAVING condition;
ORDER BY column1, column2, ... ASC|DESC;
You can use order by, group by, having with or without the other clauses (except select, from).
1. Right-click on the tables option in the Database dropdown menu and select
‘Create Table’.
2. The Create Table dialogue box appears. Enter the name of the table.
3. Go to the Columns section and add your column name, data type, scale,
etc. or you can inherit the format from an existing table.
4. Click on Save.
3
To create a table in SQL, you can use the CREATE TABLE statement, which
allows you to define the table's columns and their data types. Here's an example
of how to create a simple table with two columns:
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(50) NOT NULL);
In this example, we are creating a table called "users" with two columns: "id" and
"name". The "id" column is an integer and is set as the table's primary key using
the PRIMARY KEY constraint. The "name" column is a variable-length string of up
to 50 characters and is set not to allow null values using the NOT NULL
constraint.
You can customize the column definitions to fit your specific needs, including
adding more columns, specifying default values, and setting constraints. Once
you have created your table, you can insert data into it using the INSERT INTO
statement.
Deleting tables
To delete a database right, click on the database name in the servers dropdown
menu and select the ‘Delete/Drop’ option.
To delete a table, go to the tables option in your Database dropdown menu, right-
click on the table you want, and select delete.
In this example, we use the COPY command to import data into a table called
"mytable". The data is being read from a file located at '/path/to/myfile.csv'. The
"FORMAT csv" option specifies that the file is in CSV format.
If your CSV file has a header row, you can skip it by adding the "HEADER" option
to the COPY command:
COPY mytable FROM '/path/to/myfile.csv' WITH (FORMAT csv, HEADER);
This will skip the first row of the CSV file, which is assumed to contain column
names.
4
In this example, we are using the COPY command to export data from a table
called "mytable" to a file located at '/path/to/myfile.csv'. The "FORMAT csv" option
specifies that the data should be written in CSV format.
You can also include a header row in the exported CSV file by adding the
"HEADER" option to the COPY command:
COPY mytable TO '/path/to/myfile.csv' WITH (FORMAT csv, HEADER);
You have been hired as a data analyst for a small startup that sells organic fruits
and vegetables. Your task is to analyze the sales data from the past year and
identify the top-selling products and the most profitable regions. You need to use
SQL to query the data stored in a database to do this.
Follow the steps below to set up your SQL environment:
1. Install PGAdmin, which is a popular open-source administration and
management tool for PostgreSQL.
2. Open PGAdmin and create a new server connection by right-clicking
"Servers" and selecting "Create" > "Server".
3. In the "Create - Server" window, provide a name for the server in the
"Name" field.
4. In the "Connection" tab, provide the following details:
Host: localhost
Port: 5432
Maintenance database: postgres
Username: <your PostgreSQL username>
Password: <your PostgreSQL password>
1. Click "Save" to save the server configuration.
2. Now, right-click the server name and select "Create" > "Database" to
create a new database.
3. Provide a name for the database and click "Save" to create it.
4. Next, right-click the database name and select "Query Tool" to open the
SQL query editor.
Task 1: Create the 'sales_data' table in the database using the above SQL
command. Keep the following columns inside the table i.e: product_name, region,
sales_date, units_sold, revenue.
5
('Organic Apples', 'North', '2022-02-01', 75, 1500.00),
('Organic Bananas', 'North', '2022-02-01', 125, 2500.00),
('Organic Apples', 'South', '2022-02-01', 25, 500.00),
('Organic Bananas', 'South', '2022-02-01', 175, 3500.00);
The anatomy of an SQL query can be broken down into several parts. The first part
is the SELECT statement. This is where you specify what columns you want to
retrieve from the database. The second part of the SQL query is the FROM
statement. This specifies which table you want to retrieve data from. The third
part of the SQL query is the WHERE statement. This allows you to filter the results
of your query based on specific conditions. The fourth part of the SQL query is the
ORDER BY statement. This allows you to sort the results of your query based on a
particular column. Finally, there's the LIMIT statement. This allows you to specify
a limit on the number of results returned by your query. For example, if you only
wanted to retrieve the first 10 customers in your table, you would use the LIMIT
statement to specify that.
SQL offers a variety of data types, from basic types like integers and strings to
more complex ones like dates and times. Integers for storing whole numbers like
IDs or quantities. Strings for storing text like names or addresses. They also
include specialized types for temporal data, like dates and times, to keep track of
important events like when a document was last modified or when a customer
made a purchase. Spatial data types can also be used to store geographic
coordinates or other location-based data.
6
Datatypes in SQL
1. Integer: An integer data type is used for whole numbers, both positive and
negative, with no fractional part. Examples of integers are 1, -10, and 1000.
In SQL, the INT or INTEGER keyword is used to define this data type.
2. Decimal/Float: These data types are used for numbers with a fractional part.
Decimal and float data types differ in precision and storage size. Examples
of decimals and floats are 3.14, -0.005, and 1.23456. In SQL, the DECIMAL
or FLOAT keyword defines these data types.
3. Char/Varchar: Char and varchar data types store text values. Char data type
has a fixed length, while varchar has a variable length. Examples of char
and varchar values are "hello", "world", and "SQL is awesome". In SQL, the
CHAR and VARCHAR keywords are used to define these data types.
4. Date/Time: Date and time data types are used for storing date and time
values. Examples of date and time values are "2022-03-23" and "14:30:00".
In SQL, the DATE and TIME keywords are used to define these data types.
5. Boolean: The Boolean data type is used to store logical values. It can only
have two values: TRUE or FALSE. In SQL, the BOOLEAN keyword is used to
define this data type.
6. Blob: The Blob data type stores binary data, such as images, audio, or video
files. In SQL, the BLOB keyword is used to define this data type.
7. Text: The text data type is used for storing long text values, such as
comments or descriptions. In SQL, the TEXT keyword is used to define this
data type.
id INT,
name VARCHAR(50),
hire_date DATE,
salary DECIMAL(10, 2),
is_manager BOOLEAN,
photo BLOB,
bio TEXT);
Operators in SQL
Arithmetic Operators
Comparison Operators
7
SELECT * FROM myTable WHERE price < 10;
SELECT * FROM myTable WHERE quantity > 100;
SELECT * FROM myTable WHERE date <= '2022-01-01';
SELECT * FROM myTable WHERE rating >= 4;
Logical Operators
String Operators
Concatenation Operator (||), Like Operator(%), Length Operator (LENGTH), Upper and Lower
Operators (UPPER, LOWER), Substring Operator (SUBSTRING)
COUNT(): This function is used to count the number of rows in a table or the
number of times a particular value appears in a column.
string_agg(): string_agg is an aggregate function in PostgreSQL that allows you to
concatenate multiple values from a column into a single string, with an optional
delimiter. It's commonly used to combine values from multiple rows into a single
value.
SUM(), AVG(), MIN(), MAX()
SELECT COUNT(*) FROM orders;
8
SELECT SUM(amount) FROM orders;
SELECT AVG(amount) FROM orders;
SELECT MIN(amount) FROM orders;
SELECT MAX(amount) FROM orders;
SELECT string_agg(customer_name, ', ') AS concatenated_names
FROM customers;
9
FROM sales
GROUP BY product_id
HAVING AVG(sale_price) > 50;
This query will return only the products whose average sale price exceeds $50.
Subqueries:
a subquery is a query within a query
Syntax for SQL Subqueries
SELECT column1, column2, ...
FROM table_name
WHERE columnX operator (SELECT columnY FROM table_name WHERE condition);
To join two or more tables in SQL, you need to use the JOIN keyword and specify the
columns on which the tables are related using the ON keyword. Here is an
example of how to join two tables using the INNER JOIN
SELECT *
10
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;
This statement selects all columns from both tables where there is a match
between the column_name in both tables.
There are different types of SQL joins available to perform this task, including:
1. INNER JOIN: Returns only the rows where there is a match between the
columns in both tables.
2. LEFT JOIN: Returns all the rows from the left table and the matching rows
from the right table. If there is no match, then NULL values are returned for
the right table.
3. RIGHT JOIN: Returns all the rows from the right table and the matching rows
from the left table. If there is no match, then NULL values are returned for
the left table.
4. FULL OUTER JOIN: Returns all the rows from both tables. If there is no
match, then NULL values are returned for the missing data.
5. CROSS JOIN: Returns the Cartesian product of both tables, which means all
the possible combinations of rows from both tables.
11
Using subqueries to aggregate data from multiple tables:
UNION: To combine the count of films and the count of actors into one table, follow
the below image. Note that to use union on two columns, both the columns must
contain the same number of rows, and they must be of the same datatype.
SELECT count(title) from film
union
SELECT count(first_name) from actor;
12
INTERSECT: To get the common firstname’s of the Actors and Staff, we can use the
intersect operator. Both columns again should be of the same datatype.
SELECT first_name from actor;
intersect
SELECT first_name from staff;
EXCEPT or MINUS: To get the titles of all films except those with a rating ‘R’, we can
use the except or minus operator as shown.
SELECT title from film;
EXCEPT
SELECT title from film where rating='R';
The HAVING clause is used to filter groups based on some condition. This is useful
when you want to limit the results only to include groups that meet certain
criteria.
13
Task 1:
From the dvdrental.tar menu, get the title, description, and category of the films.
Task 2:
Get the count of the films based on their category_id.
Task 3:
Get the average rental_duration of films where the film category is 3.
Task 4:
Get the average rent of films grouped by their rating .The output column name
must be ‘Avg_rent_based_on_rating’ .
Task 5:
Get the titles, rating and film_id of films except PG and PG-13 ratings.
Task 1: We used SQL inner join on the film_id and category_id columns of film and category
tables to accomplish this task.
Task 2:
By using
Sub-queries, we aggregated the data from the tables film_category and category to get the
count of films in each category.
14
Task 3: We have to aggregate the tables as shown below by using group by and having.
Task 4: Use the as clause to create an alias for the avg(rental_rate) column.To get avg of the
respective group, use Group by clause as shown below.
15
Task 5: HINT : Use the EXCEPT clause of SQL to exclude films of rating PG and PG-13.
features of SQL is the ability to join multiple tables together. 🔗 Joins allow you to
combine data from two or more tables into a single result set.
Self Join
A self join is a join in which a table is joined with itself. This is useful when you
have a table that contains hierarchical data, such as an organizational chart.
Table: employees
emp_id name manager_id
1 John 3
2 Jane 3
3 Bob 4
4 Alice null
In this example, the "manager_id" column is a foreign key that references the
"emp_id" column of the same table. To get the name of the manager for each
employee, we can use a self join:
16
Cross Join
A cross join, also known as a Cartesian join, returns the Cartesian product of the
two tables. This means that every row from the first table is combined with every
row from the second table.
Table: colors
red
blue
green
Table: sizes
S
M
L
SELECT colors.color, sizes.size
FROM colors
CROSS JOIN sizes;
This will give us the following result:
color size
red S
red M
red L
blue S
blue M
blue L
green S
green M
green L
Natural Join
A natural join is a type of join that automatically matches columns with the same
name from two tables. This can be useful when the two tables have columns with
identical names and you want to join them based on those columns
Table: students
id name age
1 John 18
2 Jane 19
3 Bob 20
Table: grades
id grade
1 A
2 B
3 C
SELECT *
FROM students
NATURAL JOIN grades;
id name age grade
1 John 18 A
2 Jane 19 B
3 Bob 20 C
17
emp_id name dept_id
1 John 1
2 Jane 1
3 Bob 2
4 Alice 2
5 Michael 3
Table: departments
dept_id dept_name
1 HR
2 IT
3 Finance
Table: salaries
emp_id salary
1 50000
2 60000
3 70000
4 80000
5 90000
18
Jane HR 60000
Bob IT 70000
Alice IT 80000
Michael Finance 90000
To identify the duplicate records in the customers table, we can group the records
by the name, email, and city columns and use the COUNT() function to return the
number of records in each group:
19
GROUP BY name, email
HAVING COUNT(*) > 1;
name email COUNT(*)
John Smith [email protected] 2
Jane Doe [email protected] 2
Using DISTINCT
The DISTINCT keyword can be used to eliminate duplicates from a result set. It
returns only unique values from the specified column(s)
Using UNION
The UNION operator can be used to combine the results of two or more SELECT
statements, eliminating duplicates in the process.
Using UNION and UNION ALL to combine data from multiple tables
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(50),
email VARCHAR(50));
20
INSERT INTO orders VALUES (1, 1, '2022-01-01', 100.00);
INSERT INTO orders VALUES (2, 1, '2022-02-01', 200.00);
INSERT INTO orders VALUES (3, 2, '2022-03-01', 300.00);
INSERT INTO orders VALUES (4, 3, '2022-04-01', 400.00);
21
('Mary', 1, 6000.00, 1),
('Joe', 2, 5500.00, 2),
('Sarah', 2, 6500.00, 2),
('Tom', 3, 7000.00, 3),
('Bob', 3, 8000.00, 3),
('Lisa', 4, 4500.00, 4),
('Mike', 4, 5500.00, 4),
('Jane', 5, 6000.00, 5),
('Alex', 5, 7000.00, 5),
('David', 1, 5000.00, 1),
('James', 1, 6000.00, 1),
('Karen', 2, 5500.00, 2),
('Rachel', 2, 6500.00, 2),
('Tim', 3, 7000.00, 3),
('Alice', 3, 8000.00, 3),
('Julie', 4, 4500.00, 4),
('Peter', 4, 5500.00, 4),
('Mark', 5, 6000.00, 5),
('Michelle', 5, 7000.00, 5);
Questions:
1. How many employees are working in each department?
2. Who are the managers of each department?
3. Who are the employees who are reporting to a specific manager?
4. Which department has the highest-paid employees?
5.1. What is the average salary of employees in each department?
5.2 Which employees are reporting to themselves (self-join)?
22
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE m.name = 'John';
employee_name manager_name
John John
Mary John
David John
James John
23
FROM table1
INNER JOIN table2
ON table1.common_field = table2.common_field;
we use an inner join to combine data from table1 and table2 based on the
common_field column. The DISTINCT keyword is used to return only distinct rows in
the result set.
This approach can be used when you have duplicate records with different
values in one or more columns, and you want to keep only the record with the
maximum value in a specific column.
Using ROW_NUMBER()
This function assigns a unique row number to each row in the result set,
which can be used to filter out duplicates. We can use a subquery to assign
row numbers to each record and then filter out the duplicates by selecting
only the rows with a row number of 1.
24
3 John Smith [email protected] Los Angeles
4 Sarah [email protected] New York
Johnson
CEILING(): This function returns the smallest integer that is greater than or equal
to a given number.
CEILING(numeric_value)
Example: CEILING(3.14) returns 4.
FLOOR(): This function returns the largest integer that is less than or equal to a
given number.
FLOOR(numeric_value)
Example: FLOOR(3.14) returns 3.
In addition to these functions, SQL also has trigonometric functions like SIN(),
COS(), and TAN(), as well as logarithmic functions like LOG() and LN(). These
functions are useful for more advanced mathematical calculations.
25
TO_NUMBER(string_value, optional_format)
Example: TO_NUMBER('123.45', '999D99') converts the string '123.45' to a numeric
datatype with two decimal places.
The SQL function LEN() can be used to find the length of a string value.
CHAR_LENGTH
The SQL function SIN() returns the sine value of an angle in radians.
COS
The SQL function POWER() is used to raise a number to a specified power.
EXP
The SQL function MOD() returns the remainder after division of one number by
another.
%
The SQL function BITAND() performs a bitwise AND operation between two
numeric values.
&
The SQL function TRUNC() truncates a number to a specified number of decimal
places.
ROUND
CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
...
ELSE default_result
END
For Example, under the new guidelines of the DVD Rental Corporation, the rental
rate for films with ‘R’ rating had been increased by 10% and for others by 50% .
Now, we need to calculate the penalty for each film in a new column called the
penalty.
26
The CASE statement in SQL is used to perform conditional operations based on
specific conditions. In the CASE statement, conditions are evaluated in order, and
the first true condition will determine the result. The ELSE clause in the CASE
statement returns a default result if none of the conditions are true. The CASE
statement is a standard SQL feature and is supported by most SQL databases,
including PostgreSQL.
Activity:
The database contains the following tables:
Products (product_id, product_name, product_price)
Customers (customer_id, customer_name, customer_email,
customer_phone)
Sales (sale_id, sale_date, sale_quantity, sale_price, customer_id,
product_id)
27
('Jane Smith', '[email protected]', '555-555-5555'),
('Bob Johnson', '[email protected]', NULL);
What is the total revenue generated from sales transactions for each customer?
SELECT customer_id, ROUND(SUM(sale_quantity * sale_price), 2) AS
total_revenue
FROM sales
GROUP BY customer_id
What is the customer name, email, and phone number for the customer who made
the most purchases?
SELECT
c.customer_name,
c.customer_email,
c.customer_phone,
MAX(purchase_count) AS "Max Purchases"
28
FROM
customers c
JOIN (
SELECT
customer_id,
COUNT(*) AS purchase_count
FROM
sales
GROUP BY
customer_id
) p ON c.customer_id = p.customer_id
GROUP BY
c.customer_id
ORDER BY
"Max Purchases" DESC
LIMIT 1;
29
The TIMESTAMP data type represents a specific date and time in the format
YYYY-MM-DD HH:MI:SS. For example, the date and time March 29th, 2023 at 1:30
PM would be represented as '2023-03-29 13:30:00'.
AGE Function:
The AGE function in PostgreSQL calculates the difference between two dates. It
takes two arguments: the earlier date and the later date. The function returns the
difference between the two dates as an interval.
Suppose we want to calculate a person's age in years, given their date of birth.
SELECT AGE(CURRENT_DATE, '1995-07-20');
This will return the age of the person as an interval. We can extract the number of
years from this interval by using the EXTRACT function
SELECT EXTRACT(YEAR FROM AGE(CURRENT_DATE, '1995-07-20' ));
CURRENT_DATE Function:
The CURRENT_DATE function in PostgreSQL is used to get the current date. It
returns the current date as a date data type.
SELECT CURRENT_DATE;
CURRENT_TIME Function:
30
The CURRENT_TIME function in PostgreSQL is used to get the current time. It
returns the current time as a time data type.
SELECT CURRENT_TIME;
CURRENT_TIMESTAMP Function:
The CURRENT_TIMESTAMP function in PostgreSQL is used to get the current date
and time. It returns the current date and time as a timestamp data type.
SELECT CURRENT_TIMESTAMP;
EXTRACT Function:
The EXTRACT function in PostgreSQL is used to extract a specific element from a
date/time. It takes two arguments: the element to extract and the date/time value.
SELECT EXTRACT(MONTH FROM TIMESTAMP '2022-05-18');
LOCALTIME Function:
The LOCALTIME function in PostgreSQL is used to get the current local time. It
returns the current local time as a time data type.
SELECT LOCALTIME;
CURRENT_TIMESTAMP function:
This function returns the current date and time, including fractional seconds.
SELECT CURRENT_TIMESTAMP;
Similarly, the CURRENT_TIME function returns the current time without the date,
and the LOCALTIMESTAMP function returns the current timestamp without time
zone information.
AGE function- This function takes two timestamps as arguments and returns the
difference between them as an interval.
SELECT AGE('2022-10-19 16:24:31', '2022-10-18 12:00:00');
The EXTRACT function is another useful function for working with dates and times.
It allows you to extract a specific field from a timestamp or interval.
SELECT EXTRACT(YEAR FROM TIMESTAMP '2022-10-19 16:24:31');
DATE_PART function, which is similar to the EXTRACT function but with a slightly
different syntax.
SELECT DATE_PART('YEAR', TIMESTAMP '2022-10-19 16:24:31');
31
Traditional American Format: 04/06/2023 01:45:30 PM
LEFT(string, length)
32
RIGHT(string, length)
Example 3: Extracting characters from the beginning of a string using LEFT
SELECT LEFT('apple', 3) AS extracted_text;
result will be the string value "app".
Example 4: Extracting characters from the end of a string using RIGHT
SELECT RIGHT('banana', 3) AS extracted_text;
result will be the string value "ana"
33
In this example, we use the REGEXP_REPLACE function to manipulate a given
input string, which is "PostgreSQL51214Database". By employing a regular
expression '[[:alpha:]]' – designed to identify alphabetical characters – and an
empty replacement string, the code effectively removes all the letters from the
input. The 'g' flag ensures this removal happens globally throughout the string. As
a result, the output of the code will be "51214", as all alphabetical characters
have been eliminated, leaving only the numeric portion intact.
You are working as a data analyst at a video streaming company. The company
has a database that contains information about the movies available on their
platform, including the id, title and release date.
The database schema is as follows:
CREATE TABLE movies (
id INTEGER PRIMARY KEY,
title VARCHAR(255),
34
release_date DATE);
Write a query to find all movies that were released in the year 1972.
SELECT title, release_date
FROM movies
WHERE EXTRACT(year FROM release_date) = 1972;
Write a query to find all movies that were released in the month April.
SELECT title, release_date
FROM movies
WHERE EXTRACT(month FROM release_date) = 04;
Write a query to find the title and release date of the oldest movie in the
database.
SELECT title, release_date
FROM movies
ORDER BY release_date ASC
LIMIT 1;
Write a query to find the title and release date of the newest movie in the
database.
SELECT title, release_date
FROM movies
ORDER BY release_date DESC
LIMIT 1;
35
Write a query to find the title and release date of all movies released in the 21st
century.
SELECT title, release_date
FROM movies
WHERE release_date >= '2000-01-01' AND release_date < '2100-01-01';
Write a query to find the title and release date of all movies that were released on
a Friday.
SELECT title, release_date
FROM movies
WHERE EXTRACT(dow FROM release_date) = 5;
Write a query to find the title and release date of all movies that have a title
containing the word 'King'.
SELECT title, release_date
FROM movies
WHERE title LIKE '%King%';
Write a query to find the title and release date of all movies that have a title
consisting of only 3 words.
SELECT title, release_date
FROM movies
WHERE length(regexp_replace(title, E'\\S+', '', 'g')) = 2;
Window Functions
A window function is a way to apply a calculation across a set of rows that are
related to the current row. Think of it as a "window" of rows that can move up or
down depending on your function. Common window functions include calculating
running totals or averages, finding the minimum or maximum value within a set of
rows, and ranking rows based on a specific column.
Window functions can be incredibly powerful and useful tools in SQL, allowing you
to perform complex calculations and analyses easily.
36
interval based on the values in the ORDER BY column(s), while ROWS is
used to specify a physical interval based on the actual row numbers.
3. The function argument: This specifies the column or expression that should be
used as the input to the function. This is optional for some functions, such
as COUNT.
ROW_NUMBER(): This function assigns a unique number to each row within a result
set. The numbering starts at 1 and increments by 1 for each subsequent row.
SELECT ROW_NUMBER() OVER (ORDER BY Sales DESC) AS Rank, ProductName, Sales
FROM SalesData;
RANK(): This function assigns a rank to each row within a result set based on the
ordering specified in the ORDER BY clause. If two or more rows have the same
values, they receive the same rank, and the next rank is skipped.
SELECT RANK() OVER (ORDER BY Sales ASC) AS Rank, ProductName, Sales
FROM SalesData;
37
38
DENSE_RANK(): This function assigns a rank to each row within a result set based
on the ordering specified in the ORDER BY clause. If two or more rows have the
same values, they receive the same rank and the next rank is not skipped.
SELECT DENSE_RANK() OVER (ORDER BY Sales ASC) AS Rank, ProductName, Sales
FROM SalesData;
NTILE(): This function divides a result set into a specified number of groups and
assigns a group number to each row. For example, if you specify NTILE(4), the
result set is divided into four groups, and each row is assigned a group number
from 1 to 4.
SELECT NTILE(4) OVER (ORDER BY Sales DESC) AS Quartile, ProductName, Sales
FROM SalesData;
PERCENT_RANK(): This function calculates the relative rank of each row within a
result set. The values range from 0 to 1, with 0 representing the lowest rank and 1
representing the highest rank.
SELECT PERCENT_RANK() OVER (ORDER BY Sales DESC) AS PercentRank,
ProductName, Sales
FROM SalesData;
39
Aggregate functions using windows (e.g. SUM, AVG, MAX, MIN, etc.):
In SQL, window functions are used to perform calculations on a specific subset of
rows within a result set. Aggregate functions can be used with window functions
to calculate aggregate values on that subset of rows.
SUM(): This function calculates the sum of a column or expression for each row
within a window.
SELECT SUM(Sales) OVER (PARTITION BY Region) AS RegionTotal, ProductName,
Sales
FROM RegionSalesData;
AVG(): This function calculates the average of a column or expression for each
row within a window.
SELECT AVG(Sales) OVER (PARTITION BY Region) AS RegionAvg, ProductName,
Sales
FROM RegionSalesData;
40
MAX(): This function returns the maximum value of a column or expression for
each row within a window.
SELECT MAX(Sales) OVER (PARTITION BY Region) AS RegionMax, ProductName,
Sales
FROM RegionSalesData;
MIN(): This function returns the minimum value of a column or expression for each
row within a window.
SELECT MIN(Sales) OVER (PARTITION BY Region) AS RegionMin, ProductName,
Sales
FROM RegionSalesData;
41
Partitioning data for window functions:
In SQL, partitioning data is a technique used to divide the data into logical groups
or partitions so that window functions can be applied to each partition separately.
The PARTITION BY clause is used to partition the data based on one or more
columns.
We will be using the below data :
Product Name Region Sales Year
Product A Region 1 100 2022
Product A Region 2 200 2022
Product A Region 3 150 2023
Product B Region 1 50 2022
Product B Region 2 100 2022
Product B Region 3 75 2023
Product C Region 1 75 2022
Product C Region 2 125 2022
Product C Region 3 100 2023
In this example, the data is partitioned by the Region column. The SUM() function
is applied to the Sales column for each partition separately, and the result is
returned in a new column called RegionTotal.
42
In this example, the data is partitioned by both the Region and Year columns. The
SUM() function is applied to the Sales column for each combination of Region and
Year, and the result is returned in a new column called RegionYearTotal.
In this example, the data is partitioned by the Region column and then ordered by
the Year column within each partition. The SUM() function is applied to the Sales
column for each partition in the specified order, and the result is returned in a
new column called RegionTotalByYear.
43
In this example, ROW_NUMBER() is a row-based window function that assigns a
unique number to each row based on the order of Sales values within each
partition of the Region column.
44
product_name TEXT,
quantity INTEGER,
price NUMERIC);
Display the daily sales amount for each product using a window function.
SELECT date, product_name,
SUM(price * quantity) OVER(PARTITION BY date, product_name) AS
daily_sales_amount
FROM sales;
45
FROM sales
GROUP BY customer_name, product_name
) t
WHERE rank = 1;
We first compute the total sales amount for each customer and product using the
SUM aggregate function in a subquery. Then we use the RANKwindow function
with the PARTITION BYclause to rank the products for each customer based on
their total sales amount. Finally, we select only the top-ranked product for each
customer by filtering the rows where the rank is equal to 1.
Display the average quantity of each product sold per day using a window
function.
SELECT date, product_name, ROUND(AVG(quantity) OVER(PARTITION BY
product_name ORDER BY date), 2) AS avg_quantity_per_day
FROM sales
ORDER BY date, product_name;
We use the AVG aggregate function as a window function to compute the average
quantity of each product sold per day. The PARTITION BY clause is used to partition
the data by product name, and the ORDER BY clause is used to order the rows by
sale date. This means that the average is computed over all preceding rows for
each product, up to and including the current row, based on the sale date. The
output includes the product name, sale date, and average quantity of each
product sold per day. The rows are ordered by product name and sale date.
Display the products that were sold above their average quantity for each day.
WITH daily_avg AS (
SELECT date, ROUND(AVG(quantity), 2) AS avg_quantity
FROM sales
GROUP BY date)
SELECT s.product_name, s.date, s.quantity, daily_avg.avg_quantity
FROM sales s
JOIN daily_avg ON s.date = daily_avg.date
WHERE s.quantity > daily_avg.avg_quantity
ORDER BY s.date, s.product_name;
We first compute the average quantity of sales for each day using the AVG
aggregate function and the GROUP BY clause in a common table expression (CTE)
called daily_avg. Then we join the sales table with the daily_avg CTE on the sale_date
column, and compare the quantity of each product sold on each day to the daily
average using a WHERE clause. The output includes the product name, sale date,
quantity, and daily average for each product that was sold above its average
quantity on each day. The rows are ordered by sale date and product name.
Common Table Expressions (CTEs) are a powerful tool in SQL that allows you to
define temporary result sets that can be referenced multiple times within a single
46
query. In this way, CTEs provide a more readable and manageable way to handle
complex queries that might otherwise be difficult to write or understand. One
common use case for CTEs is for dealing with hierarchical data, such as
organizational charts or tree structures. Recursive CTEs can be used to traverse
the hierarchy and retrieve information at different levels of the tree.
In addition to simplifying queries for hierarchical data, CTEs can also be used to
simplify complex queries in general. By breaking down a complicated query into
smaller, more manageable pieces, CTEs can help you write cleaner and more
efficient code. CTEs can also be combined with window functions and subqueries
to enhance their functionality further. By using CTEs in conjunction with these
other features, you can create even more powerful queries that can handle a wide
range of data analysis tasks.
However, it's important to note that CTEs can also have performance
implications. While they can be a powerful tool for simplifying queries, they can
also be resource-intensive and slow down your database if used incorrectly.
47
Then, the main query joins the "departments" and "employees" tables on their
"department_id" columns, and also joins the "dept_total_salary" CTE on its
"department_id" column. The query groups the results by the "departments.name"
and "dept_total_salary.total_salary" columns.
Finally, the query calculates the average salary of employees in each department
by dividing the total salary of the department by the number of employees in the
department using the COUNT function on the "employees.id" column. The resulting
average salary is aliased as "avg_salary".
Here's an example of how you can use a recursive CTE to query a tree structure:
Let's say you have a table called "categories" that has the following columns:
category_id, name, and parent_id. The parent_id column indicates the ID of the
48
parent category for each category. A category with a NULL parent_id is a root
category.
category_id name parent_id
1 Electronics NULL
2 Computers 1
3 Laptops 2
4 Desktops 2
5 Mobiles 1
6 Android 5
7 iOS 5
In this dataset, "Electronics" are root categories, as they have a NULL value in
the "parent_id" column. "Computers" and "Mobiles" are child categories of
"Electronics", with a "parent_id" of 1. "Laptops" and "Desktops" are child
categories of "Computers", with a "parent_id" of 2. And "Android" and "iOS" are
child categories of "Mobiles", with a "parent_id" of 5.
Here's how you can use a recursive CTE to query all categories and their
subcategories:
49
In this example, we define a CTE called "category_tree" that selects all root
categories (i.e., categories with a NULL parent_id) in the base case. Then, in the
recursive case, we join the "categories" table with the "category_tree" CTE on the
parent_id column to select all subcategories. We also increment the level column
by 1 for each recursive step.
Finally, we select all columns from the "category_tree" CTE to get the full tree
structure with all categories and their levels.
With recursive CTEs, you can easily query and manipulate hierarchical data
structures in SQL.
This dataset contains sales data for different products on different dates. The
"date" column represents the date of the sale, the "product" column represents
the product sold, and the "revenue" column represents the revenue generated
from the sale.
-- Create the sales table
CREATE TABLE sales (
date DATE,
product VARCHAR(255),
50
revenue INTEGER);
-- Insert data into the sales table
INSERT INTO sales (date, product, revenue)
VALUES
('2022-01-01', 'Product A', 1000),
('2022-01-02', 'Product B', 2000),
('2022-01-02', 'Product C', 3000),
('2022-01-03', 'Product A', 1500),
('2022-01-03', 'Product B', 2500),
('2022-01-03', 'Product C', 3500),
('2022-01-04', 'Product A', 1200),
('2022-01-04', 'Product C', 3200),
('2022-01-05', 'Product B', 1800),
('2022-01-05', 'Product C', 2800);
Here's how you can use CTEs, window functions, and subqueries to solve this
problem:
WITH product_revenue AS (
SELECT
product,
date,
SUM(revenue) AS revenue
FROM sales
GROUP BY product, date),
product_cumulative_revenue AS (
SELECT
product,
date,
revenue,
SUM(revenue) OVER (PARTITION BY product ORDER BY date) AS
cumulative_revenue
FROM product_revenue),
total_revenue AS (
SELECT SUM(revenue) AS total_revenue
FROM product_cumulative_revenue)
SELECT
product,
date,
revenue,
cumulative_revenue,
revenue / total_revenue * 100 AS percentage_of_total
FROM product_cumulative_revenue
JOIN total_revenue ON 1 = 1;
51
defined "product_revenue" CTE and applies the window function
SUM(revenue) OVER (PARTITION BY product ORDER BY date) to calculate the
running total of revenue for each product.
3. The "total_revenue" CTE calculates the sum of revenue across all products
and dates from the "product_cumulative_revenue" CTE.
4. The final SELECT statement retrieves the product, date, revenue,
cumulative revenue, and percentage of total revenue for each product and
date. It joins the "product_cumulative_revenue" CTE with the
"total_revenue" CTE on a dummy condition "1 = 1" to ensure all rows are
returned.
5. Within the final SELECT statement, the percentage of total revenue is
calculated by dividing the revenue for each row by the total revenue (from
the "total_revenue" CTE) and multiplying by 100.
You are appointed as a data manager in a DVD rental Company, and you have
access to their dvdrental.tar file, which contains information about
films ,actors ,language ,staff ,rents ,etc.
52
Make a list of average rents of films with respect to their language .
Calculate the average rental duration of films with respect to their language .
The boss wants a list of cumulative payments made by each customer .The output should
contain customer id ,payment date ,amount and cumulative payment.
53
Now return the Percentage of Total payments made by each customers .
54
Problem Statement:
You are working as a data analyst for a retail company that operates in multiple
regions. The company has a database that contains two tables: "Sales" and
"Regions". The "Sales" table stores information about the sales transactions,
including the product name, region, and sales quantity. The "Regions" table
contains the mapping of region names to region IDs.
Sales Table:
| ProductName | RegionID | SalesQuantity |
|-------------|----------|---------------|
| ProductA | 1 | 10 |
| ProductA | 2 | 15 |
| ProductA | 3 | 20 |
| ProductB | 1 | 12 |
| ProductB | 2 | 18 |
| ProductB | 3 | 8 |
| ProductC | 1 | 5 |
| ProductC | 2 | 10 |
| ProductC | 3 | 15 |
Regions Table:
| RegionID | RegionName |
|----------|------------|
| 1 | Region1 |
| 2 | Region2 |
| 3 | Region3 |
Products Table:
| ProductID | ProductName |
|-----------|-------------|
| 1 | ProductA |
| 2 | ProductB |
| 3 | ProductC |
Your task is to write a SQL query that utilizes Common Table Expressions
(CTEs) and pivoting techniques to generate a consolidated sales report for a
55
specific time period. The report should display the total sales quantity for
each product, with columns representing different regions and rows
representing different products.
WITH SalesData AS (
SELECT s.ProductName, r.RegionName, s.SalesQuantity
FROM Sales s
JOIN Regions r ON s.RegionID = r.RegionID)
SELECT *
FROM SalesData
PIVOT (
SUM(SalesQuantity)
FOR RegionName IN ('Region1', 'Region2', 'Region3')
) AS PivotTable;
CTEs provide a way to define temporary result sets that can be referenced
multiple times within a single query, allowing for more readable and
manageable code.
Recursive CTEs can be used to work with hierarchical data structures,
allowing for easier traversal of complex tree structures.
CTEs can be used to simplify complex queries by breaking them down into
smaller, more manageable pieces.
CTEs can be combined with window functions and subqueries to enhance
their functionality and handle a wider range of data analysis tasks.
It's important to understand the potential performance implications of using
CTEs and use them judiciously to avoid negative impacts on database
performance.
CTEs are a powerful tool in SQL, but it's important to understand their
limitations and trade-offs and to use them appropriately in code.
CTEs can make SQL code more readable and maintainable by breaking down
complex queries into smaller, more manageable pieces.
Recursive CTEs are particularly useful for working with hierarchical data, but
it's important to understand their limitations and optimize them for
performance.
Combining CTEs with window functions and subqueries can allow for even
more complex data analysis tasks and provide more powerful query results.
Overall, learning how to use CTEs effectively can help developers write
cleaner, more efficient SQL code that is easier to read and maintain over
time.
56
Introduction to SQLite in Python:
you're a person working with a tight schedule, and you need a fast, reliable, and easy-to-use data
storage solution. MySQL or PostgreSQL come to mind, but wait, have you heard of SQLite? This
embedded database library written in C is a game-changer. Unlike other database technologies,
SQLite is integrated into your program, making it a serverless data storage solution. And the best
part? All data is stored in a single file with a .db extension. But that's not all. SQLite's concurrent access
feature allows multiple processes or threads to access the same database, making it ideal for mobile
operating systems such as Android and iOS. So why use large CSV files when you can consolidate all
your data into a single SQLite database? And let's not forget, storing data for your application setup in
SQLite is 35% faster than a file-based solution like a configuration file.
import sqlite3
connection = sqlite3.connect("aquarium.db")
This easy-to-use module allows Python programs to establish a connection to an SQLite database with
a single line of code. And if your database doesn't yet exist, don't worry—sqlite3.connect() can create
it on the fly. To check whether our connection object was successfully constructed, run the following
command.
print(connection.total_changes)
The total number of database rows that the connection has altered is represented
by connection.total_changes. 0 total changes are accurate because no SQL operations have yet been run.
You can remove the aquarium.db file from your computer if you decide at any point that you'd like to
restart the tutorial.
*Note*: By supplying the specific string ":memory:" to sqlite3.connect, it is also possible to connect to
an SQLite database that only exists in memory (and not in a file).
*Example*: sqlite3.connect (":memory:"). When your Python program ends, a ":memory:" SQLite
database will vanish. This could be useful if you want to use SQLite as a temporary sandbox for testing
purposes and don't need to keep any data around after your program terminates.
57
Storage Class Python Datatype
NULL None
INTEGER Int
REAL Float
TEXT Str
BLOB bytes
In Python, the type() method can be used to determine an argument's class. The type() function is
used in the software below to print the classes of each value we store in a database.
Cursor Object:
It is an object used to establish a connection to run SQL queries. It serves as middleware between the
SQL query and the SQLite database connection. After connecting to an SQLite database, it is
generated.
Syntax: cursor_object=connection_object.execute(“sql query”);
Example: Writing records into the hotel table using Python code to establish a hotel data
database.
58
Let’s explore using the sqlite3 module within a Python application to create tables, insert data and
access tables in an SQLite database.
*Create Table*
Syntax:
CREATE TABLE database_name.table_name(
column1 datatype PRIMARY KEY(one or more columns),
column2 datatype,
column3 datatype,
…..
columnN datatype);
Insert Data
Now, let’s talk about utilizing the sqlite3 module with Python to insert data into a table in a SQLite
database. To add a new row to a table, use the SQL INSERT INTO statement. The INSERT
INTO statement can be used in one of two ways to insert rows:
*Only values*
In the first approach, the column names are omitted and only the value of the data to be entered is
specified.
Syntax:
INSERT INTO table_name VALUES (value1, value2, value3,…);
table_name: name of the table.
value1, value2,.. : value of first column, second column,… for the new record
*Column names and values both*
In the second approach, we will specify the columns we wish to fill as well as the values that go with
each one, as shown below:
INSERT INTO table_name (column1, column2, column3,..) VALUES ( value1, value2, value3,..);
table_name: name of the table.
column1: name of first column, second column …
value1, value2, value3 : value of first column, second column,… for the new record
*Select Data from Table*
This statement returns the data from the table and is used to obtain data from an SQLite table.
The syntax for a select statement in SQLite is:
SELECT * FROM table_name;
# * : means all the column from the table
# *To select specific column replace * with the column name or column names.*
SELECT column1, column2, columnN FROM table_name;
**Example:** Let us create a table called Student and insert values into it and then use the select
statement to retrieve data .
# Import module
import sqlite3
# Connecting to sqlite
conn = sqlite3.connect('almab.db')
# Creating a cursor object using the
# cursor() method
cursor = conn.cursor()
# Creating table
table ="""CREATE TABLE STUDENTS(NAME VARCHAR(255), CLASS VARCHAR(255),
SECTION VARCHAR(255));"""
cursor.execute(table)
# Queries to INSERT records.
cursor.execute(
'''INSERT INTO STUDENTS (CLASS, SECTION, NAME) VALUES ('7th', 'A',
'Raju')''')
cursor.execute(
'''INSERT INTO STUDENTS (SECTION, NAME, CLASS) VALUES ('B', 'Shyam',
'8th')''')
59
cursor.execute(
'''INSERT INTO STUDENTS (NAME, CLASS, SECTION ) VALUES ('Baburao', '9th',
'C')''')
cursor.execute(
'''INSERT INTO STUDENTS (NAME, CLASS, SECTION ) VALUES ('Ajith', '9th',
'C')''')
cursor.execute(
'''INSERT INTO STUDENTS (NAME, CLASS, SECTION ) VALUES ('Karnan', '8th',
'B')''')
cursor.execute(
'''INSERT INTO STUDENTS (NAME, CLASS, SECTION ) VALUES ('Abhishek',
'10th', 'A')''')
cursor.execute(
'''INSERT INTO STUDENTS (NAME, CLASS, SECTION ) VALUES ('Kishore', '9th',
'C')''')
# Display data inserted
print("Data Inserted in the table: ")
# * is used to refer to all columns
data=cursor.execute('''SELECT * FROM STUDENTS''')
for row in data:
print(row)
# Commit your changes in
# the database
conn.commit()
# Closing the connection
conn.close
Data Inserted in the table:
('Raju', '7th', 'A')
('Shyam', '8th', 'B')
('Baburao', '9th', 'C')
('Ajith', '9th', 'C')
('Karnan', '8th', 'B')
('Abhishek', '10th', 'A')
('Kishore', '9th', 'C')
**Retrieving data using python**
READ Any database operation entails retrieving some valuable data from the database. Using the fetch()
method offered by the sqlite python module, you can retrieve data from MYSQL.
Three methods are available through the sqlite3.Cursor class: fetchall(), fetchmany(), and fetchone().
The fetchall() method pulls up every row in a query's result set and returns it as a list of tuples. (If we
perform this after retrieving few rows it returns the remaining ones).
The next row in a query's result is fetched and returned as a tuple using the fetchone() method.
Similar to fetchone(), fetchmany() obtains the next group of rows from a query's return set rather than
just one item.
60
connection = sqlite3.connect('almab.db')
cursor = connection.cursor()
# WHERE CLAUSE TO RETRIEVE DATA
cursor.execute("SELECT * FROM STUDENTS WHERE Class = '7th'")
# printing the cursor data
print(cursor.fetchall())
connection.commit()
connection.close()
61
Let's imagine, for instance, that Kishore (the last guy at the table) wasn't intended to be employed and
that we, therefore, need to remove him from the group.
df_new = df_students[:-1]
df_new
PandaSQL:
PandasSQL is a Python library that allows users to execute SQL queries on Pandas data frames. It provides a
SQL-like interface to manipulate the data frames, which is useful when working with large datasets that may not
fit into memory. In this article, we will explore the benefits and drawbacks of using PandasSQL, how to set up
an environment to use the library, and perform basic and advanced SQL queries on realistic data.
Benefits of PandasSQL: 👍🐼
1. Familiar Interface: If you are already familiar with SQL, using PandasSQL should be easy for you as
it provides a similar interface to perform operations on data frames.
2. Large Datasets: PandasSQL can be useful when working with large datasets that cannot fit into
memory. It uses the Pandas library, which allows users to work with datasets that are larger than
memory.
3. Data Manipulation: PandasSQL provides a variety of functions to manipulate data frames such as
sorting, filtering, and grouping.
Drawbacks of PandasSQL:👎🤔
1. Slower than Native Pandas Operations: While PandasSQL can be useful for large datasets, it is
generally slower than native Pandas operations.
62
2. SQL Syntax Limitations: PandasSQL does not support all SQL syntax, which means some complex
queries may not be possible.
To start using PandasSQL, we need to install it using pip. Open your terminal and enter the following command:
pip install pandasql
pip install SQLAlchemy==1.4.46
from pandasql import sqldf
import pandas as pd
df = pd.DataFrame({
'Name': ['John', 'Jane', 'Bob', 'Mary'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'London', 'Paris', 'Sydney']})
Now that we have our data frame, we can use PandasSQL to execute SQL queries on it. Here's an
example of a simple query to select all the data from our table:
query = "SELECT * FROM df"
result = sqldf(query)
print(result)
Problem Statement: You are working for a coffee shop and have been given a dataset containing
information about the sales transactions. The dataset contains information such as the date and time
of each transaction, the items purchased, each item's price, and the transaction's total amount. Your
task is to use SQL queries to analyze the data and provide insights that can help improve the sales and
operations of the coffee shop.
Dataset: Here is a sample dataset containing information about the sales transactions:
Transaction date time item price quantity Total_amo
_id unt
1 2022-01- 9:00 Coffee 2.50 1 2.50
01
2 2022-01- 10:30 Croissant 1.75 2 3.50
01
3 2022-01- 11:15 Cappucci 3.00 1 3.00
01 no
4 2022-01- 8:45 Latte 3.50 2 7.00
02
63
5 2022-01- 10:00 Muffin 2.25 1 2.25
02
6 2022-01- 12:15 Espresso 2.75 1 2.75
02
7 2022-01- 9:30 Croissant 1.75 3 5.25
03
8 2022-01- 11:00 Iced 3.50 2 7.00
03 Coffee
9 2022-01- 12:45 Hot 3.25 1 3.25
03 Chocol.
10 2022-01- 10:15 Cappucci 3.00 2 6.00
04 no
# Importing Libraries
import pandas as pd
import sqlite3
from pandasql import sqldf
mysql = lambda q: sqldf(q, globals())
# Connection to DataBase
conn = sqlite3.connect('coffee_shop.db')
c = conn.cursor()
# Closing the connection
conn.close
Create a new table named "transactions" in the database with the columns: "transaction_id", "date",
"time", "item", "price", "quantity", and "total_amount".
Insert the data from the sample dataset into the "transactions" table.
64
(9, '2022-01-03', '12:45', 'Hot Chocolate', 3.25, 1, 3.25),
(10, '2022-01-04', '10:15', 'Cappuccino', 3.00, 2, 6.00);
''')
Write a query to find the total amount of sales for each day in the dataset.
# Query 4: Find the total amount of sales for each day in the dataset
pd.read_sql_query('''
SELECT date, SUM(total_amount) AS total_sales
FROM transactions
GROUP BY date;''',conn)
Date Total_sales
0 2022-01-01 9.0
1 2022-01-02 12.0
2 2022-01-03 15.5
3 2022-01-04 6.00
65
Think of the schema as the blueprint for your database. It outlines the structure
of the tables, the columns within them, and how they relate to each other. It's like
a map that tells you where everything is located and how to get there.
Now, let's talk about database schema design. It's like creating a puzzle that's
not only beautiful but also efficient and effective. To do this, you need to carefully
consider how the tables and their columns are organized, what type of data is
stored in each column, and how the tables relate to each other.
ERD:
ERD stands for Entity-Relationship Diagram. It visually represents the entities and
the relationships between them in a database schema. ERD diagrams are used in
the design phase of database development to identify and represent the entities
and their attributes, as well as the relationships between them.
66
In an ERD diagram, entities are represented by rectangles, and the relationships
between them are represented by lines connecting the rectangles. The attributes
of each entity are listed inside the rectangle. There are three types of
relationships between entities: one-to-one, one-to-many, and many-to-many.
These relationships are represented by different types of lines connecting the
entities in the ERD diagram.
Now let's create an ERD for the e-commerce website database schema
STEP 1: Create a new Database in PostgreSQL called e-commerce.
STEP 3: Now right-click on the e-commerce database and select ‘ERD For Database’
67
You can edit the table by selecting it and clicking the pencil option.
Add one-to-many or many-to-many relations by clicking the 1M and MM options .
Customize by adding colors by using the fill color and text color options .
STEP 5 : Let's add relationships to our tables. Select the Orders table and click one to
many options. A dialogue box appears as shown below, we are connecting the Orders and
Customers tables. Hence as the foreign key is customers_id we have selected it.
STEP 6 : Click on Save . You can now see that the orders table and customers table are
connected by a 1M line.
68
STEP 7: Similarly, create relations between others using the foreign keys. You should see
output below.
69
To understand the concept of normalization and denormalization more intuitively,
consider the example of a table that contains information about customers and their
orders:
CustomerID | CustomerName | OrderID | OrderDate | ProductID | ProductName | Quantity
-----------------------------------------------------------------------------------
1 | John Smith | 1 | 01/01/2023 | 101 | Widget A | 2
1 | John Smith | 2 | 02/01/2023 | 102 | Widget B | 1
2 | Jane Doe | 3 | 03/01/2023 | 101 | Widget A | 3
2 | Jane Doe | 4 | 04/01/2023 | 103 | Widget C | 2
This table contains redundant data, as the customer information is duplicated
for each order they make. To normalize the table, we could split it into two
separate tables: one for customers and one for orders. The tables would be
related through a foreign key in the orders table that references the customer
ID in the customer table. The resulting tables might look something like this:
Customers Table
---------------
CustomerID | CustomerName
-------------------------
1 | John Smith
2 | Jane Doe
Orders Table
------------
OrderID | OrderDate | CustomerID | ProductID | Quantity
--------------------------------------------------------
1 | 01/01/2023 | 1 | 101 | 2
2 | 02/01/2023 | 1 | 102 | 1
3 | 03/01/2023 | 2 | 101 | 3
4 | 04/01/2023 | 2 | 103 | 2
By splitting the table into two separate tables, we have eliminated redundant
customer information, resulting in a more efficient and manageable structure.
Now, we want to retrieve all orders made by a particular customer, including
the name of the product ordered. With the normalized structure, we would
need to perform a join between the two tables to retrieve the necessary data:
SELECT Customers.CustomerName, Orders.OrderID, Orders.OrderDate,
Orders.ProductID, Orders.Quantity, Products.ProductName
FROM Customers
INNER JOIN Orders ON Customers.CustomerID = Orders.CustomerID
INNER JOIN Products ON Orders.ProductID = Products.ProductID
WHERE Customers.CustomerID = 1
This query would retrieve all orders made by customers with ID 1, including
the name of the product ordered. However, the join operation can be
resource-intensive and slow down query performance, especially when
dealing with large tables.
To optimize query performance, we can use denormalization by adding the
ProductName column to the Orders table. This way, we can retrieve all
necessary data in a single table without the need for joins:
70
Orders Table
------------
OrderID | OrderDate | CustomerID
71
In SQL, different types of constraints can be used to specify rules for the data
that can be entered into a table:
NOT NULL Constraint: This constraint ensures that a column does not accept
null values, which means that every row must have a value for that column.
UNIQUE Constraint: This constraint ensures that the values in a column are
unique across all the rows in the table. A table can have multiple unique
constraints on different columns.
PRIMARY KEY Constraint: This constraint combines the NOT NULL and UNIQUE
constraints. It ensures that the values in a column or a combination of
columns uniquely identify each row in the table.
FOREIGN KEY Constraint: This constraint ensures that the values in a column or
a combination of columns in one table match the values in a column or a
combination of columns in another table. This is used to enforce referential
integrity between related tables.
CHECK Constraint: used to specify a condition that must be met for a column's
value to be inserted or updated in a table. The condition can be any Boolean
expression that evaluates to true or false.
DEFAULT Constraint: is used to specify a default value for a column when no
value is explicitly provided during an insert operation.
These constraints are used to maintain data integrity and prevent data
inconsistencies in a database. The database management system enforces
them and ensures that only valid data is stored in the database.
To add a constraint to a table, you can use the ALTER TABLE statement,
followed by the name of the table and the constraint to be added. For
example:
Use Indexes: 🔍 Indexes can speed up queries by allowing the database to locate
the data being searched quickly. Use indexed columns in your WHERE and
JOIN clauses when designing your queries.
Avoid SELECT *: 🚫 Avoid using SELECT * in your queries, as this can cause
unnecessary overhead by returning all columns, even those that aren't
needed. Instead, explicitly list the columns you need in your SELECT
statement.
72
Use JOINs: 🔗 JOINs are a powerful tool for combining data from multiple tables.
When using JOINs, be sure to join on indexed columns and use INNER JOIN
instead of OUTER JOIN where possible.
Use Subqueries: 🔍 Subqueries can be used to break down complex queries into
smaller, more manageable pieces. When using subqueries, try to use EXISTS
or IN instead of NOT EXISTS or NOT IN, as the latter can be less efficient.
Limit Results: 🔢 Limiting the number of results returned by your queries can help
improve performance. Use LIMIT or TOP to limit the number of results
returned, and use WHERE clauses to filter out unnecessary data.
By following these tips, you can design efficient database queries that
optimize performance and improve the overall user experience.
You are working in an online retail store as a data manager, and you are
provided with the following Dataset that contains four tables :
Customers :
Customer_ID Firstname Lastname Email Phone
1 John Doe [email protected] 555-1234
2 Jane Smith [email protected] 555-5678
3 Bob Johnson [email protected] 555-9012
Products
ProductID ProductName Description Category Price
1 Product A This is product A Category 1 10.00
2 Product B This is product B Category 1 20.00
3 Product C This is product C Category 2 30.00
4 Product D This is product D Category 2 40.00
5 Product E This is product E Category 3 50.00
Orders
OrderID CustomerID OrderDate ShipDate TotalAmount
1 1 2022-01-01 2022-01-05 50.00
2 2 2022-01-02 2022-01-06 70.00
3 3 2022-01-03 2022-01-07 90.00
OrderDetails
OrderID ProductID Quantity Price
1 1 2 10.00
1 3 1 30.00
2 2 3 20.00
2 4 2 40.00
3 1 1 10.00
3 2 2 20.00
3 3 1 30.00
3 4 1 40.00
3 5 1 50.00
Upload the tables in a database in SQL .And also describe what you understood of the
tables .
73
FirstName VARCHAR(50),
LastName VARCHAR(50),
Email VARCHAR(50),
Phone VARCHAR(20)
);
74
INSERT INTO Orders (OrderID, CustomerID, OrderDate, ShipDate,
TotalAmount) VALUES
(1, 1, '2022-01-01', '2022-01-05', 50.00),
(2, 2, '2022-01-02', '2022-01-06', 70.00),
(3, 3, '2022-01-03', '2022-01-07', 90.00);
75
It is important for you to know that your data is secured. So which SQL server do you
think offers the best security features?
The level of security offered by SQL servers can depend on many factors,
including the version of the server, the security features that are
implemented, and the specific configuration of the server. With that said, here
are a few SQL servers that are often considered to offer high levels of
security:
1. Microsoft SQL Server: Microsoft SQL Server is a popular relational
database management system that provides several security features,
including advanced encryption options, fine-grained access controls,
and auditing and compliance tools. Additionally, Microsoft regularly
releases security updates and patches to address known vulnerabilities
and exploits.
2. Oracle Database: Oracle Database is another widely-used database
management system that provides a range of security features,
including secure data transmission, access controls, and database
auditing. Oracle also offers several security tools and features that help
administrators to identify and respond to security threats.
3. PostgreSQL: PostgreSQL is an open-source relational database
management system that provides advanced security features,
including SSL/TLS encryption, strong authentication mechanisms, and
flexible access controls. PostgreSQL also has a strong security
community that is dedicated to identifying and addressing security
issues.
4. IBM Db2: IBM Db2 is a database management system that provides
several security features, including advanced encryption options,
secure user authentication, and fine-grained access controls. Db2 also
provides auditing and compliance features to help administrators to
monitor and manage security risks.
Ultimately, the choice of SQL server will depend on various factors, including
the organization's specific security requirements, the administrators'
expertise, and the available resources. However, all of the SQL servers
mentioned above can provide strong security features and protections for
sensitive data.
76