Module_6
Module_6
Joins, including:
INNER JOIN
LEFT JOIN
RIGHT JOIN
FULL JOIN
EQUI JOIN
NON-EQUI JOIN
SELF JOIN
Feature engineering
Building master tables
Performing analysis across related tables
🔍 1. INNER JOIN (Most
Common)
✅ Definition:
Returns only the matching rows from both tables based on the join
condition.
✅ Syntax:
✅ Example:
✅ Visualization:
📊 Relevance:
Used to combine related data — e.g., merging user info with transaction
history.
Returns all rows from the left table, and matched rows from the right
table. NULLs if no match found.
✅ Syntax:
✅ Example:
✅ Visualization:
📊 Relevance:
Useful to find:
Returns all rows from the right table, and matched rows from the left
table. NULLs if no match found.
✅ Syntax:
✅ Example:
✅ Visualization:
📊 Relevance:
Use when the right table contains the base data, such as ensuring all
departments are included even if no employees exist.
Returns all rows from both tables, with NULL for unmatched rows on
either side.
✅ Syntax:
✅ Example:
✅ Visualization:
📊 Relevance:
Used to identify:
✅ 5. EQUI JOIN
✅ Definition:
A join based on equality condition (like =). Technically, all joins with =
operator are equi joins.
✅ Syntax:
✅ Equivalent to:
📊 Relevance:
❌ 6. NON-EQUI JOIN
✅ Definition:
✅ Use Case:
📊 Relevance:
🔁 7. SELF JOIN
✅ Definition:
Joins a table to itself. Think of the same table used twice with aliases.
✅ Syntax:
✅ Use Case:
📊 Relevance:
🧠 Summary Table
Matche Unmatched
Join Type Use Case Example
d? Rows
Only matched customers with
INNER JOIN ✅ ❌
orders
Left side
LEFT JOIN ✅ All customers, even if no orders
kept
Right side
RIGHT JOIN ✅ All products, even if never sold
kept
Both sides
FULL JOIN ✅ Full data, including orphans
kept
EQUI JOIN ✅ (on =) ❌ Most standard join
Ranges, bands (e.g., income
NON-EQUI JOIN ✅ (≠) ❌
classification)
Table referencing itself
SELF JOIN ✅ ❌
(manager ↔ staff)
Data standardization
Calculations for reports
Modeling support (e.g., normalization, scaling)
Business logic (e.g., rounding, limits)
🟩 1. SQRT() – Square Root
✅ Purpose:
✅ Syntax:
✅ Example:
Used in:
✅ Syntax:
✅ Example:
Useful for:
✅ Syntax:
✅ Example:
Used in:
✅ Syntax:
✅ Example:
Returns the smallest integer greater than or equal to the given number.
✅ Syntax:
✅ Example:
VARCHAR to INT
FLOAT to DECIMAL
DATE to STRING
etc.
These conversions are explicit or implicit, depending on how they are
written and processed by the SQL engine.
✅ Use Cases:
✅ Example:
✅ Notes:
✅ Example:
✅ Use:
When you need localized parsing for date and number formats
🟦 4. TRY_CAST() and TRY_CONVERT()
These are safe versions of CAST() and CONVERT().
✅ Example:
✅ Use Cases:
id price_str date_str
1 "45.67" "2024-03-01"
2 "32.50" "01-03-2024"
3 "error" "invalid"
Query (Safe Conversion):
⚠️Important Notes
Always validate input before converting (TRY_CAST preferred in
pipelines).
Be cautious when converting large decimals to INT — data may be
lost.
Conversion from string to date is locale-sensitive.
🧠 Summary
Safe Standar
Function Notes
? d?
CAST() ❌ ✅ Cross-platform, fails on bad data
CONVERT() ❌ ❌ (T-SQL) Good for formatting, MS SQL only
PARSE() ❌ ❌ Locale-sensitive, slower
TRY_CAST() ✅ ❌ (T-SQL) Avoids crashes, returns NULL on error
TRY_CONVERT() ✅ ❌ (T-SQL) Like TRY_CAST but with formatting
🔹 1. COALESCE()
✅ Purpose:
✅ Example:
✅ Syntax:
✅ Example:
✅ Real Use:
🔹 3. NULLIF()
✅ Purpose:
Returns NULL if both expressions are equal; otherwise, returns the first
expression.
✅ Syntax:
✅ Example:
✅ Use Case:
🔹 1. IF Expression (Non-standard
SQL)
✅ Available in:
✅ Example:
Used across all major databases (MySQL, PostgreSQL, SQL Server, Oracle,
etc.)
✅ Syntax:
🔸 Simple CASE:
🔸 Searched CASE:
✅ Example:
🔹 3. NULL in Conditions
✅ Remember: NULL is unknown in SQL.
❌ GOTO in SQL?
GOTO is NOT a part of standard SQL.
It's available only in procedural extensions (like PL/SQL, T-SQL, or
SQL scripts).
Not used in typical data science SQL workflows.
🧠 Summary Table
Expres Support
Purpose Example Use
sion ed In
Simple conditional
IF() MySQL IF(score>50,'Pass','Fail')
logic
Label encoding, tiering, rule-
CASE All SQL Multi-condition logic
based logic
Handle missing
NULL All SQL IS NULL, IS NOT NULL
values
PL/SQL Control flow in
GOTO Not used in analytical SQL
only procedures
Example:
🔹 2. NOW()
🔹 3. DATEPART() / EXTRACT()
SQL Server:
🔹 4. DATEDIFF()
MySQL/PostgreSQL:
SQL Server:
🔹 5. DATE_ADD() / DATE_SUB()
MySQL:
🔹 7. DATE_TRUNC()
🔹 8. EXTRACT()
🔹 9. AGE() (PostgreSQL)
Returns the difference between two dates in years, months, and days.
🔎 Sample Query
🧬 Summary Table
Function Purpose Example Usage
NOW() Current date + time NOW()
CURDATE() Current date CURDATE()
DATEDIFF() Days between two dates DATEDIFF('2025-04-14', '2025-04-01')
DATE_ADD() Add days/months/years DATE_ADD(date, INTERVAL 5 DAY)
EXTRACT() Extract part (year, month) EXTRACT(MONTH FROM order_date)
DATE_TRUNC() Truncate to a time unit DATE_TRUNC('month', date)
FORMAT() Format dates FORMAT(date, 'yyyy-MM')
Get age/time difference
AGE() AGE('2025-04-14', '2000-04-14')
(PostgreSQL)
🔚 Final Thoughts
Time-based data is critical in forecasting, churn prediction,
behavioral analysis, and anomaly detection.
SQL date/time functions provide everything needed to preprocess,
analyze, and aggregate time data effectively before feeding it into
visualization tools or ML models.
These functions are essential for data analysis, reporting, and pre-
processing, especially in data science, where numerical accuracy and
transformations are key for statistical computations and machine learning
tasks.
🔹 Data Science use: Log transformations help deal with skewed data,
e.g., in regression or scaling.
9. MOD(n, m) – Modulus (Remainder)
Returns:
1 for positive
-1 for negative
0 for zero
🔍 This example:
🧬 Final Thoughts
Numeric functions are:
'
📌 Used for case-insensitive matching, data normalization.
'
This query:
🧬 Final Thoughts
String functions are foundational tools in SQL for:
🔍 What is a Subquery?
A subquery (or inner query) is a SQL query nested inside another
query (called the outer query). Subqueries are used to:
They are enclosed in parentheses and can appear in the SELECT, FROM, or
WHERE clause.
✅ Why Use Subqueries?
Subqueries:
🧠 Types of Subqueries
Type Description
Single-row
Returns one value (e.g., =, <, >)
subquery
Multi-row
Returns multiple rows (e.g., IN, ANY, ALL)
subquery
Correlated References columns from outer query; runs row-by-
subquery row
Scalar subquery Returns a single value; often used in SELECT or WHERE
Nested
Subqueries inside another subquery (multi-level)
subqueries
📌 Syntax
1. In WHERE Clause
2. In FROM Clause
✅ Calculates average salaries per department and filters those above 80,000.
3. In SELECT Clause
🔁 Correlated Subqueries
A correlated subquery uses a value from the outer query. It executes
once per row of the outer query.
✅ Returns employees who earn more than the average salary in their
department.
🔄 Nested Subqueries
⚡ Subquery vs JOIN
Feature Subquery JOIN
Use case Filtering, computed conditions Combine multiple tables
Performa Can be slower (especially Generally faster with large
nce correlated) datasets
Readabili
Easier for nested logic Better for flat structure
ty
🧾 Summary Table
Type Example Use Case Syntax Example
Single- Compare to one value
salary = (SELECT MAX(salary)...)
row (e.g., max)
Multi- WHERE id IN (SELECT emp_id
Compare to list
row FROM...)
SELECT name, (SELECT
Scalar Add a value column
COUNT(...)...)
Correl WHERE salary > (SELECT AVG(...)
Depends on outer row
ated WHERE...)
Neste Complex multi-level SELECT ... WHERE id = (SELECT ...
d filters (SELECT ...))
🧬 Final Thoughts
Subqueries are a powerful tool in SQL and are:
🔑 Keyword: OVER()
🔹 Explanation:
PARTITION BY: Splits the data into groups (like GROUP BY, but doesn't
collapse rows).
ORDER BY: Defines how ranking or calculations are performed within
each partition.
✅ Example 1: RANK()
✅ Example 2: DENSE_RANK()
✅ Example 3: ROW_NUMBER()
⚠️Common Mistakes
Mistake Correction
Using RANK() with no PARTITION BY Ranks across all rows, not grouped
Confusing RANK() vs ROW_NUMBER() Ties vs unique numbers
Forgetting ORDER BY in OVER() No order = undefined behavior
🔚 Summary Table
Handles Gaps in
Function Description
Ties? Rank?
RANK() Yes Yes Assigns same rank to ties
DENSE_RAN Same rank, but next rank not
Yes No
K() skipped
ROW_NUMBE
No No Unique rank for each row
R()
NTILE(n) N/A N/A Divides into n quantile groups
💡 Final Thoughts
Window functions like RANK(), ROW_NUMBER(), and LAG() are
indispensable for:
This enables:
✅ Best Practices
Practice Why?
Use parameterized queries Prevents SQL injection
Use context managers Auto close connections
Use SQLAlchemy or Pandas Cleaner and more readable
Limit data pulled from SQL Avoid memory overload in Python
Test queries before
Prevent data loss or logical errors
automation
🧠 Summary
Task SQL Python Tool
Connect to CREATE DATABASE, sqlite3, mysql.connector,
DB CONNECT psycopg2
Read Data SELECT pandas.read_sql()
Write Data INSERT, UPDATE df.to_sql()
Query
Joins, Aggregations Pandas, Matplotlib, Scikit-learn
Analysis
Data Star schema,
SQL + Pandas
Modeling normalization
💡 Final Thought
Python + SQL is a powerful combo in Data Science. It allows you to: