SQL for Data Science
SQL for Data Science
Database - Types
DBMS vs RDBMS
ER Diagram
Relational Database Schema
SQL - ACID Properties, Commands
Data Types
Constraints
Errors in SQL
Keys
Normalisation - 1NF, 2NF, 3NF, BCNF
Operators
Clauses
Alias
Case Statement
Data Blogs Follow Tajamul Khan
1
Data Blogs Follow Tajamul Khan
Data is valid and consistent. It is used to to restrict invalid data Emp ID Name
from entering the table. It can be achieved by:
Data Types 101 Zac
Constraints
Tom 23
2
Data Blogs Follow Tajamul Khan
Data Container
3
Data Blogs Follow Tajamul Khan
4
Data Blogs Follow Tajamul Khan
Non
Relational
5
Data Blogs Follow Tajamul Khan
1960
6
Data Blogs Follow Tajamul Khan
7
Data Blogs Follow Tajamul Khan
The Data is stored in the form of Tables. These tables are connected to each other with the help of
primary and foreign key due to which the data duplicity was completely reduced.
This model is effective and efficient than all other Databases
8
Data Blogs Follow Tajamul Khan
9
Data Blogs Follow Tajamul Khan
10
Data Blogs Follow Tajamul Khan
11
Data Blogs Follow Tajamul Khan
12
Data Blogs Follow Tajamul Khan
13
Data Blogs Follow Tajamul Khan
14
Data Blogs Follow Tajamul Khan
15
Data Blogs Follow Tajamul Khan
independent
dependent
16
Data Blogs Follow Tajamul Khan
17
Data Blogs Follow Tajamul Khan
18
Data Blogs Follow Tajamul Khan
19
Data Blogs Follow Tajamul Khan
20
Data Blogs Follow Tajamul Khan
21
Data Blogs Follow Tajamul Khan
22
Data Blogs Follow Tajamul Khan
23
Data Blogs Follow Tajamul Khan
24
Data Blogs Follow Tajamul Khan
We Can’t Add Null Constraint in A Filled Column but we can add Null Constraint in empty Column
25
Data Blogs Follow Tajamul Khan
26
Data Blogs Follow Tajamul Khan
27
Data Blogs Follow Tajamul Khan
28
Data Blogs Follow Tajamul Khan
29
Data Blogs Follow Tajamul Khan
30
Data Blogs Follow Tajamul Khan
1 = True
0 = False
31
Data Blogs Follow Tajamul Khan
Precision is the number of digits in a number. Scale is the number of digits to the right of the
decimal point in a number. For example, the number 123.45 has a precision of 6 and a scale of 2.
32
Data Blogs Follow Tajamul Khan
Identity column of a table is a column whose value increases automatically. Identity column can be
used to uniquely identify the rows in the table.
The ‘ID’ column of the table starts from 1 as the seed value provided is 1
and is incremented by 1 at each row.
33
Data Blogs Follow Tajamul Khan
Constraints are rules used to limit the type of data entering the columns. It ensures
accuracy and reliability of the data
34
Data Blogs Follow Tajamul Khan
When SQL statements do not follow the correct syntax and structure of the language
35
Data Blogs Follow Tajamul Khan
"Key" refers to a column or set of columns in a table that uniquely identify each
row within that table
To uniquely identify a row
To enforce Data integrity & Constraints
To establish relationship between multiple tables in the database
Primary Key
Foreign Key
Unique Key
Composite Key
Alternate Key
Candidate
Super Key
36
Data Blogs Follow Tajamul Khan
A foreign key establishes a link between two tables, by referencing the primary key in another.
It ensures referential integrity by enforcing a relationship between the tables.
CREATE TABLE orders ( order_id INT PRIMARY KEY, customer_id INT,
FOREIGN KEY (customer_id) REFERENCES customers(customer_id) );
A unique key ensures that all values in a column or a group of columns are unique.
Unlike primary keys, unique keys can contain NULL values (2 NULL are not same)
CREATE TABLE students ( student_id INT UNIQUE, ... );
37
Data Blogs Follow Tajamul Khan
A composite key consists of multiple columns that together uniquely identify a record in a
table.
It's useful when a single column cannot uniquely identify records, but a combination of columns
can.
CREATE TABLE orders ( order_id INT, product_id INT,
PRIMARY KEY (order_id, product_id) );
An alternate key is a candidate key that is not selected as the primary key.
It could serve as a unique identifier if the primary key didn't exist.
38
Data Blogs Follow Tajamul Khan
minimal
not minimal
Set of columns that uniquely identifies each row in a table It may contain more columns than
necessary for uniqueness.
{EmployeeID, Email, SSN} also uniquely identifies each employee. This set is a super key because it
goes beyond the minimal requirement for uniqueness.
In summary, while both candidate keys and super keys ensure uniqueness in a table, candidate keys
are minimal sets of columns fulfilling this requirement, while super keys can include more columns
than necessary.
39
Data Blogs Follow Tajamul Khan
40
Data Blogs Follow Tajamul Khan
1NF
2NF
3NF
BCNF No Functional Dependency
4NF
DKNF
41
Data Blogs Follow Tajamul Khan
Atomic Value
In 1NF, all the rows in a column must have atomic values with consistent data types
No Partial Dependency
PRIME ATTRIBUTE
A B C D AB Composite Key Here D is dependent on canditate key ✅
But C is dependent on only B which is
NON PRIME subset of canditate key and that is known
CD ATTRIBUTE as partial dependency
42
Data Blogs Follow Tajamul Khan
No Transitive Dependency
PRIME ATTRIBUTE
A B C D AB Composite Key
NON PRIME
CD ATTRIBUTE
✅
Here c is dependent on canditate key
But D is dependent on C which is also non-
prime attribute and that is known as
transitive dependency
43
Data Blogs Follow Tajamul Khan
A B C
NON PRIME
PRIME
ATTRIBUTE
C B ATTRIBUTE
alpha Beta
44
Data Blogs Follow Tajamul Khan
45
Data Blogs Follow Tajamul Khan
46
Data Blogs Follow Tajamul Khan
The SQL reserved words and characters used with a WHERE clause in a SQL query
47
Data Blogs Follow Tajamul Khan
48
Data Blogs Follow Tajamul Khan
49
Data Blogs Follow Tajamul Khan
50
Data Blogs Follow Tajamul Khan
51
Data Blogs Follow Tajamul Khan
52
Data Blogs Follow Tajamul Khan
53
Data Blogs Follow Tajamul Khan
A case statement is like an if-elif-else statement in programming, allowing different actions to be taken
based on different conditions.
CASE
WHEN condition1 THEN result1
WHEN condition2 THEN result2
...
WHEN conditionN THEN resultN
END
ELSE default_result
Efficient than If-Else ✅
Causes problems when dealing with NULL
SELECT customer_id,
CASE amount
WHEN 500 THEN 'Prime Customer'
WHEN 100 THEN 'Plus Customer'
ELSE 'Regular Customer'
END AS CustomerStatus
FROM payment
54
Data Blogs Follow Tajamul Khan
Joins
Set Operations
Group By and Having clause
Order of Execution
Functions - Aggregate, Datetime, String
Windows Function
Sub-Query
CTE table
In-built functions
Views
Indexes
Stored Procedures
Triggers
Temporary Tables
Data Blogs Follow Tajamul Khan
a join is an operation that joins rows from two or more tables based on a related column between them.
55
Data Blogs Follow Tajamul Khan
SELECT *
FROM customer AS c
INNER JOIN payment AS p
ON c.customer_id = p.customer_id
Returns all rows from the left table and the matched rows from the right table. If there is no match,
NULL values are returned for the missing values.
SELECT *
FROM customer AS c
LEFT JOIN payment AS p
ON c.customer_id = p.customer_id
Returns all rows from the Right table and the matched rows from the Left table. If there is no match,
NULL values are returned for the missing values.
SELECT *
FROM customer AS c
RIGHT JOIN payment AS p
ON c.customer_id = p.customer_id
Returns all records when there is a match in either left or right table. If there is no match, NULL
values are returned for the missing values in the corresponding columns.
SELECT *
FROM customer AS c
FULL OUTER JOIN payment AS p
ON c.customer_id = p.customer_id
57
Data Blogs Follow Tajamul Khan
It returns the Cartesian product of the two tables involved, meaning it combines every row of the
first table with every row of the second table. In other words, it produces a result set where each row
from the first table is paired with every row from the second table.
SELECT *
FROM customers
CROSS JOIN orders;
58
Data Blogs Follow Tajamul Khan
SELECT
member. Id,
member.FullName,
member.teamleadId,
teamlead.FullName as
teamleadName
FROM members member
JOIN members teamlead
ON member.teamleadId =
teamlead.Id
59
Data Blogs Follow Tajamul Khan
60
Data Blogs Follow Tajamul Khan
are used to combine or manipulate the result sets of multiple SELECT queries.
Set operations compare entire result sets, while joins compare specific columns based on
relationships between tables.
61
Data Blogs Follow Tajamul Khan
combines all the rows including duplicates from result sets of two or more SELECT queries.
combines all the rows except duplicates from result sets of two or more SELECT queries.
62
Data Blogs Follow Tajamul Khan
It returns the common rows that exist in the result sets of two or more SELECT queries.
It only returns distinct that appear in all result sets.
SELECT CustomerName FROM
Customers INTERSECT
SELECT SupplierName FROM
Suppliers;
It returns the distinct rows that are present in the result set of the first SELECT query but not
in the result set of the second SELECT query.
SELECT CustomerName FROM
Customers EXCEPT
SELECT SupplierName FROM
Suppliers;
63
Data Blogs Follow Tajamul Khan
64
Data Blogs Follow Tajamul Khan
(JOIN)
65
Data Blogs Follow Tajamul Khan
66
Data Blogs Follow Tajamul Khan
67
Data Blogs Follow Tajamul Khan
68
Data Blogs Follow Tajamul Khan
69
Data Blogs Follow Tajamul Khan
50 50 Σ
50 50 Σ
Σ
50 50 Σ
Give output one row per aggregation rows maintain their separate identities
Note: In window function over clause is mandatory to use whereas partition by & order by is optional
70
Data Blogs Follow Tajamul Khan
71
Data Blogs Follow Tajamul Khan
Partitioning: The first step in using a window function is to partition the data into groups based on
certain criteria. This partitioning is done using the PARTITION BY (optional) clause. Rows within
each partition will be treated as a separate group for the window function.
Ordering: Once the data is partitioned, it's often helpful to order the rows within each partition to
define the window frame. Ordering is done using the ORDER BY (mandatory) clause. This
determines the order in which the window function will process the rows within each partition.
Window Frame: The window frame defines the subset of rows within each partition that the window
function will operate on. It's defined by the combination of the PARTITION BY and ORDER BY
clauses. You can specify whether the window frame includes all rows in the partition, a fixed
number of rows preceding or following the current row, or rows between a specified range.
Applying the Function: Once the window frame is defined, the window function is applied to the
rows within the frame. The function calculates a result for each row based on the values within its
window frame.
Result: Finally, the result of the window function is returned for each row in the query result set.
The result is typically displayed as an additional column alongside the original data.
72
Data Blogs Follow Tajamul Khan
73
Data Blogs Follow Tajamul Khan
Skip
74
Data Blogs Follow Tajamul Khan
This refers to all rows from the beginning of the partition up to and including the current row.
This refers to all rows from the current row up to the end of the partition.
Used with Range or Rows
all rows from the beginning of the partition up
SELECT
employee_id, to and including the current row
salary,
SUM(salary) OVER (ORDER BY employee_id ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS running_total
FROM
employees;
75
Data Blogs Follow Tajamul Khan
76
Data Blogs Follow Tajamul Khan
77
Data Blogs Follow Tajamul Khan
SELECT new_id,
ROW_NUMBER() OVER(ORDER BY new_id) AS "ROW_NUMBER",
RANK() OVER(ORDER BY new_id) AS "RANK",
DENSE_RANK() OVER(ORDER BY new_id) AS "DENSE_RANK",
PERCENT_RANK() OVER(ORDER BY new_id) AS "PERCENT_RANK"
FROM test_data
78
Data Blogs Follow Tajamul Khan
SELECT new_id,
FIRST_VALUE(new_id) OVER( ORDER BY new_id) AS "FIRST_VALUE",
LAST_VALUE(new_id) OVER( ORDER BY new_id) AS "LAST_VALUE",
LEAD(new_id) OVER( ORDER BY new_id) AS "LEAD",
LAG(new_id) OVER( ORDER BY new_id) AS "LAG"
FROM test_data
79
Data Blogs Follow Tajamul Khan
Convert Formats
80
Data Blogs Follow Tajamul Khan
81
Data Blogs Follow Tajamul Khan
Nested
also known as a nested query, is a query nested within another query. It allows us to retrieve
data based on the results of another query.
Sub query syntax involves two SELECT statements
It can be added after keywords like WHERE or ON, with comparison operators (>, <, =).
SELECT *
FROM table_name
WHERE <=
(SELECT column_name FROM table_name WHERE ...);
82
Data Blogs Follow Tajamul Khan
WITH
also known as Common Table Expressions, is a named temporary table or named result set
that can be used multiple times within a single query.
It used tempdb but is efficient than temporary table
It allows us to break down complex queries into smaller, more manageable parts.
WITH Sales_CTE AS (
SELECT
customer_id,
SUM(amount) AS total_sales
FROM
sales
GROUP BY
customer_id
)
SELECT customer_id, total_sales FROM Sales_CTE WHERE total_sales > 1000;
83
Data Blogs Follow Tajamul Khan
84
Data Blogs Follow Tajamul Khan
View is like a virtual table that is based on the result set of a SELECT query.
View does not store any data, difference between table & view is that table can store
data but view can never stores data
Changes will be done in view table not in main table
85
Data Blogs Follow Tajamul Khan
86
Data Blogs Follow Tajamul Khan
Once you executed if you try to re-execute multiple times it will run without any error
87
Data Blogs Follow Tajamul Khan
Lookup
Indexing in SQL organizes data using a B-tree data structure to swiftly locate information in
tables, enhancing performance for read-intensive operations.
88
Data Blogs Follow Tajamul Khan
A precompiled collection of SQL statements stored in the database and executed as a single
unit.
Improved Performance: Precompiled and stored on the server, they reduce parsing time and enhance
execution speed.
Enhanced Security: Users execute procedures without direct table access, reducing SQL injection risks.
Code Reusability: Encapsulate business logic for reuse across applications.
Reduced Network Traffic: Execute with minimal data transfer, sending only procedure names and
parameters.
89
Data Blogs Follow Tajamul Khan
90
Data Blogs Follow Tajamul Khan
also known as temp tables are on the fly tables used to do complex calculations without
storing data in database. They are stored in tempdb
91
Data Blogs Follow Tajamul Khan
Data Blogs Follow Tajamul Khan
CREATE TABLE: Creates a new table. SELECT: Retrieves specific columns from a table.
CREATE TABLE table_name (id INT PRIMARY KEY, SELECT column1, column2 FROM table_name;
name VARCHAR(50));
DISTINCT: Removes duplicate rows from the result.
ALTER TABLE: Modifies an existing table. SELECT DISTINCT column1 FROM table_name;
ALTER TABLE table_name ADD column2 INT;
WHERE: Filters rows based on a condition.
DROP TABLE: Deletes a table. SELECT * FROM table_name WHERE column1 = 'v1';
DROP TABLE table_name;
ORDER BY: Sorts result set by one or more columns.
CREATE INDEX: Creates an index on a table. SELECT * FROM table_nm ORDER BY column1 ASC;
CREATE INDEX idx_name ON table_name (column1);
LIMIT / FETCH: Limits the number of rows returned.
DROP INDEX: Removes an index. SELECT * FROM table_name LIMIT 10;
DROP INDEX idx_name ON table_name;
LIKE: Searches for patterns in text columns.
CREATE VIEW: Creates virtual table based on query.
CREATE VIEW view_name AS SELECT column1, SELECT * FROM table_name WHERE col1 LIKE 'A%';
column2 FROM table_name;
IN: Filters rows with specific values.
DROP VIEW: Deletes a view. SELECT * FROM table_nm WHERE col1 IN ('v1', 'v2');
DROP VIEW view_name;
BETWEEN: Filters rows within a range of values.
RENAME TABLE: Renames an existing table. SELECT * FROM table WHERE c1 BETWEEN 1 AND 20;
RENAME TABLE old_table_nm TO new_table_name;
@Tajamulkhann
92
Data Blogs Follow Tajamul Khan
COUNT(): Returns the number of rows. INSERT INTO: Adds new rows to a table.
SELECT COUNT(*) FROM table_name; INSERT INTO table_name (column1, column2)
VALUES ('value1', 'value2');
SUM(): Calculates the sum of a numeric column.
SELECT SUM(column1) FROM table_name; UPDATE: Updates existing rows in a table.
UPDATE table_name SET col1 = 'value' WHERE id = 1;
AVG(): Calculates the average of a numeric column.
SELECT AVG(column1) FROM table_name; DELETE: Removes rows from a table.
DELETE FROM table_name WHERE column1 = 'value';
MIN(): Returns the smallest value in a column.
SELECT MIN(column1) FROM table_name; MERGE: Combines INSERT, UPDATE, and DELETE
based on a condition.
MAX(): Returns the largest value in a column. MERGE INTO table_name USING source_table ON
SELECT MAX(column1) FROM table_name; condition WHEN MATCHED THEN UPDATE SET
column1 = value WHEN NOT MATCHED THEN INSERT
GROUP BY: Groups rows for aggregation. (columns) VALUES (values);
SELECT col1, COUNT(*) FROM t1 GROUP BY col1;
TRUNCATE: Removes all rows from a table without
HAVING: Filters grouped rows based on a condition. logging.
SELECT column1, COUNT(*) FROM t1 GROUP BY TRUNCATE TABLE table_name;
column1 HAVING COUNT(*) > 5;
REPLACE: Deletes existing rows and inserts new rows
DISTINCT COUNT(): Counts unique values in column. (MySQL-specific).
SELECT COUNT(DISTINCT col1) FROM table_name; REPLACE INTO table_name VALUES (value1, value2);
@Tajamulkhann
93
Data Blogs Follow Tajamul Khan
Commit Transaction: Finalizes changes when all UNION: Combines results from two queries,
operations succeed. removing duplicates.
START TRANSACTION; SELECT column1 FROM table1 UNION SELECT
UPDATE accounts SET balance = 1000 WHERE id = 1; column1 FROM table2;
WHERE id = 2; COMMIT;
UNION ALL: Combines results from two queries,
Execute a Stored Procedure: Undoes changes if an including duplicates.
error occurs or the transaction is not committed. SELECT column1 FROM table1 UNION ALLSELECT
START TRANSACTION; column1 FROM table2;
UPDATE accounts SET balance = 1000 WHERE id = 1;
ROLLBACK; INTERSECT: Returns common rows from both
queries.
Using Savepoints: Set a rollback point within a SELECT column1 FROM table1 INTERSECT SELECT
transaction, allowing partial rollback without column1 FROM table2;
affecting the whole transaction.
START TRANSACTION; EXCEPT (or MINUS): Returns rows from the first
UPDATE accounts SET balance = 1000 WHERE id = 1; query that are not in the second query.
SAVEPOINT sp1; SELECT column1 FROM table1 EXCEPTSELECT
UPDATE accounts SET balance = 2000 WHERE id = 3; column1 FROM table2;
-- Simulate failure
ROLLBACK TO SAVEPOINT sp1;
UPDATE accounts SET balance = 1000 WHERE id = 2;
COMMIT;
@Tajamulkhann @Tajamulkhann
94
Data Blogs Follow Tajamul Khan
LEFT JOIN: Returns all rows from the left table and SUBSTRING(): Extracts a substring from a string.
matching rows from the right table. SELECT SUBSTRING(column1, 1, 5) FROM table_nm;
SELECT * FROM table1 LEFT JOIN table2 ON table1.id
= table2.id; LENGTH(): Returns the length of a string.
SELECT LENGTH(column1) FROM table_name;
RIGHT JOIN: Returns all rows from the right table and
matching rows from the left table. ROUND(): Rounds a number to a specified number of
SELECT * FROM table1 RIGHT JOIN table2 ON decimal places.
table1.id = table2.id; SELECT ROUND(column1, 2) FROM table_name;
FULL OUTER JOIN: Returns rows when there is a NOW(): Returns the current timestamp.
match in either table. SELECT NOW();
SELECT * FROM table1 FULL OUTER JOIN table2 ON
table1.id = table2.id; DATE_ADD(): Adds a time interval to a date.
SELECT DATE_ADD(NOW(), INTERVAL 7 DAY);
CROSS JOIN: Cartesian product of both tables.
SELECT * FROM table1 CROSS JOIN table2; COALESCE(): Returns the first non-null value.
SELECT COALESCE(column1, column2) FROM
SELF JOIN: Joins a table with itself. table_name;
SELECT a.column1, b.column1 FROM table_name a,
table_name b WHERE a.id = b.parent_id; IFNULL(): Replaces NULL values with desired value.
SELECT IFNULL(col1, 'default') FROM table_name;
@Tajamulkhann
95
Data Blogs Follow Tajamul Khan
@Tajamulkhann
96
Data Blogs Follow Tajamul Khan
@Tajamulkhann
97
Data Blogs Follow Tajamul Khan
@Tajamulkhann
98
Data Blogs Follow Tajamul Khan
Data Blogs Follow Tajamul Khan
Tajamul Khan