0% found this document useful (0 votes)
14 views

Module_6

The document provides a comprehensive analysis of various SQL joins, including INNER, LEFT, RIGHT, FULL, EQUI, NON-EQUI, and SELF JOIN, detailing their definitions, syntax, examples, and relevance in data science. It also covers SQL mathematical functions, conversion functions, general functions, conditional expressions, and date/time functions, emphasizing their importance in data manipulation and analysis. Each section includes practical use cases and highlights how these SQL features are applied in data science workflows.

Uploaded by

Harish Karthik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Module_6

The document provides a comprehensive analysis of various SQL joins, including INNER, LEFT, RIGHT, FULL, EQUI, NON-EQUI, and SELF JOIN, detailing their definitions, syntax, examples, and relevance in data science. It also covers SQL mathematical functions, conversion functions, general functions, conditional expressions, and date/time functions, emphasizing their importance in data manipulation and analysis. Each section includes practical use cases and highlights how these SQL features are applied in data science workflows.

Uploaded by

Harish Karthik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 71

Here is a detailed, brief, and complete analysis of different types of SQL

Joins, including:

 INNER JOIN
 LEFT JOIN
 RIGHT JOIN
 FULL JOIN
 EQUI JOIN
 NON-EQUI JOIN
 SELF JOIN

Each explained with syntax, visual behavior, examples, and Data


Science relevance.

🔗 What Are Joins in SQL?


Joins are used to combine rows from two or more tables based on a
related column between them — usually a primary/foreign key
relationship.

Joins are crucial in relational databases, especially for:

 Feature engineering
 Building master tables
 Performing analysis across related tables
🔍 1. INNER JOIN (Most
Common)
✅ Definition:

Returns only the matching rows from both tables based on the join
condition.

✅ Syntax:

✅ Example:
✅ Visualization:

📊 Relevance:

Used to combine related data — e.g., merging user info with transaction
history.

📥 2. LEFT JOIN (LEFT OUTER


JOIN)
✅ Definition:

Returns all rows from the left table, and matched rows from the right
table. NULLs if no match found.

✅ Syntax:
✅ Example:

✅ Visualization:

📊 Relevance:

Useful to find:

 Customers with or without orders


 Retain all data from the main table
📤 3. RIGHT JOIN (RIGHT
OUTER JOIN)
✅ Definition:

Returns all rows from the right table, and matched rows from the left
table. NULLs if no match found.

✅ Syntax:

✅ Example:
✅ Visualization:

📊 Relevance:

Use when the right table contains the base data, such as ensuring all
departments are included even if no employees exist.

🌐 4. FULL JOIN (FULL OUTER


JOIN)
✅ Definition:

Returns all rows from both tables, with NULL for unmatched rows on
either side.

✅ Syntax:
✅ Example:

✅ Visualization:

📊 Relevance:

Used to identify:

 All possible combinations


 Orphan records in either table

✅ 5. EQUI JOIN
✅ Definition:

A join based on equality condition (like =). Technically, all joins with =
operator are equi joins.
✅ Syntax:

✅ Equivalent to:

📊 Relevance:

Most commonly used join type. Every INNER/LEFT/RIGHT JOIN with = is an


equi join.

❌ 6. NON-EQUI JOIN
✅ Definition:

Joins based on non-equality conditions, like <, >, BETWEEN, etc.


✅ Syntax:

✅ Use Case:

 Range-based matching (e.g., salary brackets, age groups)

📊 Relevance:

Used in interval mapping, binning, categorical grouping, and graded


classification.

🔁 7. SELF JOIN
✅ Definition:

Joins a table to itself. Think of the same table used twice with aliases.
✅ Syntax:

✅ Use Case:

 Represent hierarchical relationships (manager ↔ employee, parent


↔ child)
 Detect patterns in the same table

📊 Relevance:

Important in organizational structures, tree traversal, and anomaly


detection.

🧠 Summary Table
Matche Unmatched
Join Type Use Case Example
d? Rows
Only matched customers with
INNER JOIN ✅ ❌
orders
Left side
LEFT JOIN ✅ All customers, even if no orders
kept
Right side
RIGHT JOIN ✅ All products, even if never sold
kept
Both sides
FULL JOIN ✅ Full data, including orphans
kept
EQUI JOIN ✅ (on =) ❌ Most standard join
Ranges, bands (e.g., income
NON-EQUI JOIN ✅ (≠) ❌
classification)
Table referencing itself
SELF JOIN ✅ ❌
(manager ↔ staff)

📊 Joins in Data Science:


Why They Matter
Purpose Example
Feature
Merge transactions with user info
Engineering
Label Enrichment Join predictions with actual values
Data Cleaning Check mismatches across tables
Aggregation Group by + join for summary statistics
Advanced Join multiple sources (sales, marketing, feedback,
Analytics etc)

📐 What Are SQL Mathematical


Functions?
Mathematical functions in SQL are predefined operations that can be used
to manipulate or analyze numeric data. These functions help with:

 Data standardization
 Calculations for reports
 Modeling support (e.g., normalization, scaling)
 Business logic (e.g., rounding, limits)
🟩 1. SQRT() – Square Root
✅ Purpose:

Returns the square root of a number.

✅ Syntax:

✅ Example:

✅ Use in Data Science:

Used in:

 Normalizing skewed data


 Applying statistical operations
 Feature transformation (e.g., root transformations)
🟩 2. PI() – Mathematical Constant Pi
✅ Purpose:

Returns the value of π (pi) — approximately 3.1415926.

✅ Syntax:

✅ Example:

✅ Use in Data Science:

Useful for:

 Geometry-based data (e.g., maps, distances)


 Trigonometric transformations
 Algorithms involving radial distances (e.g., clustering, geospatial
analysis)
🟩 3. SQUARE() – Square of a Number
✅ Purpose:

Returns the square of a given number (n^2).

✅ Syntax:

✅ Example:

✅ Use in Data Science:

Used in:

 Variance/Standard Deviation (needs squared values)


 Error computation (e.g., squared error in regression)
 Scoring metrics
🟩 4. ROUND() – Rounding a Number
✅ Purpose:

Rounds a numeric value to a specified number of decimal places.

✅ Syntax:

✅ Example:

✅ Use in Data Science:

 Used in reporting and display formatting


 Useful for discretizing continuous values
 Helps to reduce floating-point precision errors
🟩 5. CEILING() – Round Up
✅ Purpose:

Returns the smallest integer greater than or equal to the given number.

✅ Syntax:

✅ Example:

✅ Use in Data Science:

 Useful when ensuring minimum thresholds


 Bin definitions where you need upper limits
 Rounding up scores, grades, or evaluations
📊 Summary Table of SQL Math
Functions
Function Description Example Output
SQRT(x) Square root of x SQRT(16) 4.0
PI() Returns Pi (π) PI() 3.14159
SQUARE(x) Returns x² SQUARE(3) 9
Rounds x to n decimal
ROUND(x,n) ROUND(3.5678, 2) 3.57
places
Smallest integer ≥ x
CEILING(x) CEILING(2.3) 3
(round up)

🧠 Relevance in Data Science


Area Use Case Example
Feature Engineering Square, root, round transformations
Normalization &
Apply SQRT/SQUARE for distribution adjustment
Scaling
Clustering/Geo Analysis PI, SQUARE, SQRT for distance formulas
Reporting/BI
ROUND, CEILING for human-friendly formats
Dashboards
Evaluation Metrics Mean squared error (uses SQUARE)

🔄 What Are Conversion Functions in


SQL?
Conversion functions in SQL are used to change a data type from one
form to another, such as:

 VARCHAR to INT
 FLOAT to DECIMAL
 DATE to STRING
 etc.
These conversions are explicit or implicit, depending on how they are
written and processed by the SQL engine.

🧰 Common Conversion Functions


Function Description
CAST() Explicitly converts one data type to another
Converts a value from one data type to another (T-SQL/MS
CONVERT()
SQL)
Converts string to date/numeric using culture formatting (MS
PARSE()
SQL)
TRY_CAST() Like CAST, but returns NULL on failure
TRY_CONVER
Like CONVERT, but safer (returns NULL if invalid)
T()

🟦 1. CAST() – Standard SQL Type


Conversion
✅ Syntax:
✅ Example:

✅ Use Cases:

 Convert dates to strings for reporting


 Convert strings to numbers for calculations
 Convert float to integer for rounding logic

🟦 2. CONVERT() – SQL Server-Specific


✅ Syntax:

✅ Example:
✅ Notes:

 style parameter is often used for date formatting


 Only works in MS SQL Server

🟦 3. PARSE() – String to Date/Number


(MS SQL)
✅ Syntax:

✅ Example:

✅ Use:

 When you need localized parsing for date and number formats
🟦 4. TRY_CAST() and TRY_CONVERT()
These are safe versions of CAST() and CONVERT().

✅ Example:

✅ Use Cases:

 Data cleaning with inconsistent formats


 Avoiding query errors in production pipelines
 Validating conversion before processing

📊 Conversion Examples in Real Data


Scenarios
Example Table:

id price_str date_str
1 "45.67" "2024-03-01"
2 "32.50" "01-03-2024"
3 "error" "invalid"
Query (Safe Conversion):

📈 Relevance to Data Science


Use Case Relevance
ETL (Extract, Transform,
Convert text to proper types during ingestion
Load)
Convert categorical types to numeric for ML
Feature Engineering
modeling
Data Cleaning Handle inconsistent formats or nulls
Aggregation/Analysis Numeric types needed for SUM, AVG, etc.
Need date/time formats for trend analysis,
Date Calculations
time series
Dashboard & Reporting String formatting for readable outputs

⚠️Important Notes
 Always validate input before converting (TRY_CAST preferred in
pipelines).
 Be cautious when converting large decimals to INT — data may be
lost.
 Conversion from string to date is locale-sensitive.
🧠 Summary
Safe Standar
Function Notes
? d?
CAST() ❌ ✅ Cross-platform, fails on bad data
CONVERT() ❌ ❌ (T-SQL) Good for formatting, MS SQL only
PARSE() ❌ ❌ Locale-sensitive, slower
TRY_CAST() ✅ ❌ (T-SQL) Avoids crashes, returns NULL on error
TRY_CONVERT() ✅ ❌ (T-SQL) Like TRY_CAST but with formatting

🌟 Why Are General Functions


Important?
In real-world datasets:

 Missing values (NULLs) are common.


 You often need default substitutes or need to avoid errors from
unexpected NULLs.
 These functions help control how NULLs behave in your SQL logic
and analysis.

🔹 1. COALESCE()
✅ Purpose:

Returns the first non-null value in a list of arguments.


✅ Syntax:

✅ Example:

✅ Real Table Example:

✅ Use in Data Science:

 Filling missing values


 Providing default fallbacks
 Used in feature engineering to ensure no NULLs during modeling
🔹 2. NVL() (Oracle SQL Specific)
✅ Purpose:

Replaces NULL with a specified replacement value.

✅ Syntax:

✅ Example:

✅ Real Use:

✅ Use in Data Science:

 Used in Oracle-based systems to avoid NULLs in aggregations


 Essential for default value imputation

🔸 Note: Equivalent in other databases:

 MySQL, PostgreSQL: use COALESCE() instead

🔹 3. NULLIF()
✅ Purpose:

Returns NULL if both expressions are equal; otherwise, returns the first
expression.

✅ Syntax:

✅ Example:

✅ Use Case:

Avoiding division by zero:


✅ Use in Data Science:

 Avoiding zero division errors


 Used in data transformation pipelines when checking redundant
values
 Helps handle conditional replacements in queries

🧠 Summary Comparison Table


Functi Out
Description Usage Example
on put
COALES COALESCE(NULL, 'SQ
Returns the first non-null value
CE() NULL, 'SQL') L'
Replaces NULL with a given value
NVL() NVL(NULL, 0) 0
(Oracle only)
NULLIF Returns NULL if both values are NUL
NULLIF(5, 5)
() equal L

📊 Relevance to Data Science


Area Use Case Example
Data Cleaning Replace NULLs with fallback values (COALESCE, NVL)
Feature
Fill empty values with alternate sources
Engineering
Error Handling Prevent divide-by-zero or logic errors using NULLIF
ETL/
Defaulting and conditional logic during transformation
Preprocessing
Analytics/ Ensure no NULLs in outputs for dashboards or ML
Reporting models

🧪 Example Query in a Data Pipeline

🌟 Why Use Conditional Expressions?


In real-world datasets:

 You often need conditional logic: e.g., apply a discount if the


purchase amount is high.
 Conditional expressions help in transforming, labeling, handling
NULLs, and customizing outputs for reports and models.

🔹 1. IF Expression (Non-standard
SQL)
✅ Available in:

 MySQL supports it directly.


 Not standard in PostgreSQL or SQL Server.
✅ Syntax (MySQL):

✅ Example:

✅ Use in Data Science:

 Labeling or bucketing values (e.g., salary levels)


 Creating derived features from existing columns

🔹 2. CASE Statement – Most Powerful


& Standard
✅ Standard SQL Conditional Expression

Used across all major databases (MySQL, PostgreSQL, SQL Server, Oracle,
etc.)
✅ Syntax:

🔸 Simple CASE:

🔸 Searched CASE:
✅ Example:

✅ Use in Data Science:

 Label encoding (e.g., converting numeric scores into letter grades)


 Binning/Bucketing
 Custom categorization for classification problems
 Conditional aggregations in queries

🔹 3. NULL in Conditions
✅ Remember: NULL is unknown in SQL.

 Any comparison with NULL using = or != results in NULL, not TRUE or


FALSE.
 You must use IS NULL or IS NOT NULL to check for NULL values.
✅ Example:

✅ Use in Data Science:

 Handling missing values (feature engineering)


 Data profiling and audit checks for data quality
 Creating flags for nulls as model inputs

❌ GOTO in SQL?
 GOTO is NOT a part of standard SQL.
 It's available only in procedural extensions (like PL/SQL, T-SQL, or
SQL scripts).
 Not used in typical data science SQL workflows.

🧠 Summary Table
Expres Support
Purpose Example Use
sion ed In
Simple conditional
IF() MySQL IF(score>50,'Pass','Fail')
logic
Label encoding, tiering, rule-
CASE All SQL Multi-condition logic
based logic
Handle missing
NULL All SQL IS NULL, IS NOT NULL
values
PL/SQL Control flow in
GOTO Not used in analytical SQL
only procedures

📊 Relevance in Data Science


Application Area How Conditional Expressions Help
Feature
Create custom variables based on business logic
Engineering
Missing Value
Detect or flag NULLs for imputation
Handling
Translate raw numeric/text values into buckets or
Label Encoding
categories
Data Apply rule-based formatting for better analytics or
Transformation visuals
Custom Conditional logic in aggregations (e.g., count only
Aggregations premium users)
Preprocessing Prepare structured inputs for machine learning
for ML models
📌 Advanced Example

✅ This is an example of segmentation, a common data science technique.

📅 Why Date and Time Functions


Matter?
In data science, time-series data is everywhere:

 Web traffic logs


 Sales over time
 User activity
 Financial transactions

✅ Date/Time functions allow you to:

 Extract parts (like year, month)


 Perform arithmetic (e.g., difference in days)
 Format dates for reporting
 Filter or group data by time periods
🧩 Common Date and Time Functions
(Across SQL Engines)
🔹 1. CURRENT_DATE / GETDATE() / SYSDATE

Returns the current system date.

SQL Engine Function


PostgreSQL CURRENT_DATE
MySQL CURDATE()
SQL Server GETDATE()
Oracle SYSDATE

Example:

🔹 2. NOW()

Returns current date and time.

🔹 3. DATEPART() / EXTRACT()

Returns specific components from a date.


PostgreSQL:

SQL Server:

🔹 4. DATEDIFF()

Returns the difference between two dates in days, months, or other


units.

MySQL/PostgreSQL:

SQL Server:
🔹 5. DATE_ADD() / DATE_SUB()

Used for adding/subtracting days/months/years.

MySQL:

🔹 6. TO_CHAR() and FORMAT()

Used for formatting date values as strings.

🔹 7. DATE_TRUNC()

Truncates the date to a specific part (year, month, day).


PostgreSQL:

Great for grouping by time intervals.

🔹 8. EXTRACT()

Pulls a specific part from a date (standard SQL).

🔹 9. AGE() (PostgreSQL)

Returns the difference between two dates in years, months, and days.

🧠 Use Cases in Data Science


Scenario SQL Function Example
Calculate user age DATEDIFF(YEAR, birthdate, GETDATE())
Find users inactive for WHERE DATEDIFF(day, last_login, GETDATE()) >
30+ days 30
GROUP BY DATE_TRUNC('month',
Group sales by month
sale_date)
WHERE sale_date >= DATE_SUB(CURDATE(), INTERVAL 7
Filter for records in last week
DAY)
Extract hour of the day EXTRACT(HOUR FROM login_time)
Format for reporting TO_CHAR(date_column, 'Month YYYY')
Create time series
DATE_TRUNC('day', timestamp) + GROUP BY
windows

🔎 Sample Query

✅ This retrieves recent orders, breaks them down by month/year, and


computes days since each order.

🧬 Summary Table
Function Purpose Example Usage
NOW() Current date + time NOW()
CURDATE() Current date CURDATE()
DATEDIFF() Days between two dates DATEDIFF('2025-04-14', '2025-04-01')
DATE_ADD() Add days/months/years DATE_ADD(date, INTERVAL 5 DAY)
EXTRACT() Extract part (year, month) EXTRACT(MONTH FROM order_date)
DATE_TRUNC() Truncate to a time unit DATE_TRUNC('month', date)
FORMAT() Format dates FORMAT(date, 'yyyy-MM')
Get age/time difference
AGE() AGE('2025-04-14', '2000-04-14')
(PostgreSQL)

🔚 Final Thoughts
 Time-based data is critical in forecasting, churn prediction,
behavioral analysis, and anomaly detection.
 SQL date/time functions provide everything needed to preprocess,
analyze, and aggregate time data effectively before feeding it into
visualization tools or ML models.

🔢 What Are Numeric Functions in


SQL?
Numeric functions are built-in SQL functions that perform mathematical
operations on numeric data types (like INT, FLOAT, DECIMAL, etc.).

These functions are essential for data analysis, reporting, and pre-
processing, especially in data science, where numerical accuracy and
transformations are key for statistical computations and machine learning
tasks.

✅ Common SQL Numeric Functions


Let’s explore the most frequently used numeric functions with examples.

1. ABS() – Absolute Value

Returns the absolute (positive) value of a number.


🔹 Use in Data Science: Used when calculating distance, error margin, or
loss values where only magnitude matters.

2. CEILING() / CEIL() – Round Up

Rounds a decimal up to the nearest integer.

🔹 Useful in allocating resources (e.g., rounding up to nearest hour, container,


etc.).

3. FLOOR() – Round Down

Rounds a decimal down to the nearest integer.

🔹 Used in binning values, age groups, floor-level pricing.


4. ROUND(number, decimal_places) – Rounding

Rounds a number to a specified number of decimal places.

🔹 Important in finance and reporting, especially when rounding values


for dashboards or visualizations.

5. POWER(base, exponent) – Exponentiation

Raises a number to a power.

🔹 Used in formulas like variance, square, exponential decay models.

6. SQRT() – Square Root

Returns the square root of a number.


🔹 Crucial in machine learning: standard deviation, Euclidean distance,
normalization.

7. EXP() – Exponential Function

Returns e raised to the power of the number (e^x).

🔹 Used in exponential growth modeling, compound interest, or decay


curves.

8. LOG() / LOG10() / LN()

 LOG() or LOG10() → Base 10 logarithm


 LN() → Natural logarithm (base e)

🔹 Data Science use: Log transformations help deal with skewed data,
e.g., in regression or scaling.
9. MOD(n, m) – Modulus (Remainder)

Returns the remainder of n ÷ m.

🔹 Common in grouping operations, hashing, time bucketing.

10. SIGN() – Sign of a Number

Returns:

 1 for positive
 -1 for negative
 0 for zero

🔹 Used in analytics to detect trends, changes, or outliers.


📊 Practical Example

🔍 This example:

 Calculates total price with rounding


 Floors and ceils prices
 Uses ABS to get loss value
 Uses MOD to pick even-order IDs

📈 Numeric Functions in Data Science


Function Data Science Use Case
ABS() Used in MAE (Mean Absolute Error), price deviation
SQRT() Root mean square error (RMSE), normalization
LOG() Log transform for normalizing skewed distributions
Calculating squares/cubes for distance metrics or
POWER()
regression formulas
ROUND() Prepping data for display, reporting, or precision control
CEILING() Rounding resource usage (e.g., CPU hours, billing)
MOD() Grouping by intervals, filtering even/odd samples
Identifying positive/negative trends, e.g., stock prices,
SIGN()
sales growth
🧠 Summary Table
Function Description Example
ABS() Absolute value ABS(-10) → 10
ROUND() Round to decimal places ROUND(3.456, 2)
CEILING() Round up CEILING(3.2) → 4
FLOOR() Round down FLOOR(3.8) → 3
POWER() Exponent POWER(2, 3)
SQRT() Square root SQRT(9)
EXP() e raised to power EXP(1)
LOG() Log base 10 LOG10(100)
LN() Natural logarithm LN(2.71)
MOD() Remainder MOD(10, 3)
SIGN() Sign of number SIGN(-5) → -1

🧬 Final Thoughts
Numeric functions are:

 The foundation for arithmetic and statistical analysis in SQL


 Often used in data cleaning, engineering, reporting, and ML
pipelines
 A great way to perform in-database computations before exporting
to Python or R

🔤 What Are String Functions in SQL?


String functions in SQL are used to manipulate, analyze, or transform
string/textual data stored in tables. These functions are critical for:

 Cleaning text data


 Parsing and formatting strings
 Extracting features from text (e.g., names, IDs)
 Prepping data for NLP tasks in Data Science
✅ Common SQL String Functions
Let’s explore the most widely used string functions with examples and use-
cases.

1. LENGTH() or LEN() – Get Length

Returns the number of characters in a string.

📌 Use in Data Science:

 Text feature engineering (e.g., tweet length)


 Detect unusually long/short entries

2. UPPER() and LOWER() – Case Conversion

 UPPER(): Converts all characters to uppercase


 LOWER(): Converts all characters to lowercase

'
📌 Used for case-insensitive matching, data normalization.

3. SUBSTRING() or SUBSTR() – Extract Substring

Extracts a part of the string starting from a position.

📌 Useful for extracting fields, such as:

 Year from date string


 Domain from email

4. TRIM() – Remove Spaces

Removes leading and trailing spaces (or other characters).

📌 Essential for data cleaning—removing whitespace that causes false


mismatches.

5. LTRIM() and RTRIM() – Trim Left/Right

 LTRIM() → Removes leading spaces


 RTRIM() → Removes trailing spaces

📌 Used in column cleanup before joins or comparisons.

6. REPLACE() – Replace Substrings

Replaces occurrences of a substring.

📌 Very helpful in normalizing data or fixing typos.

7. CONCAT() – Combine Strings

Combines two or more strings.

📌 Combine fields like first + last names, or email + domain.


8. INSTR() – Find Position of Substring

Returns position of first occurrence of a substring.

📌 Helpful in parsing or validating strings (e.g., domain in email).

9. LEFT() and RIGHT() – Get Substring from


Ends

 LEFT(string, N) – First N characters


 RIGHT(string, N) – Last N characters

'

📌 Used in extracting identifiers, codes, prefixes.

10. REVERSE() – Reverse a String

Reverses the input string.


📌 Rare in analytics, but useful in some text problems or puzzles.

11. ASCII() – ASCII Value of First Character

Returns ASCII code of the first character.

📌 Useful for encoding, classification, or ordering characters.

12. CHARINDEX() – Find Index of


Character/Substr

Returns index of a substring in a string.

📌 For pattern extraction, validations.


🧠 Real-World Example:

This query:

 Cleans and standardizes names/emails


 Extracts area codes
 Filters comments that mention “refund”

📈 String Functions in Data Science


Function Use in Data Science
Text feature extraction (e.g., comment length, tweet
LENGTH()
size)
TRIM() Cleaning user input
REPLACE() Normalize inconsistent labels
SUBSTRING() Extract year, domain, product code
UPPER()/LOWER() Case normalization for accurate matching
CONCAT() Building identifiers, URLs, full names
INSTR()/
Detect keyword presence in text
CHARINDEX()
LEFT()/RIGHT() Extract codes or indicators from string IDs
📋 Summary Table
Function Purpose Example Result
LENGTH() Length of string LENGTH('abc') → 3
TRIM() Remove whitespace ' abc ' → 'abc'
REPLACE() Replace substring 'AI'→'ML'
SUBSTRING() Extract part of string 'Hello'→'ell'
CONCAT() Combine strings 'Data' + 'Science'
UPPER()/LOWER() Change case 'SQL'→'sql'
LEFT()/RIGHT() Left/right characters '12345'→'12'
INSTR()/CHARINDEX() Find substr index 'DataSci'→5

🧬 Final Thoughts
String functions are foundational tools in SQL for:

✅ Text data cleaning


✅ Standardization before analysis
✅ Feature engineering for NLP
✅ Query optimization for searching and filtering

🔍 What is a Subquery?
A subquery (or inner query) is a SQL query nested inside another
query (called the outer query). Subqueries are used to:

 Retrieve intermediate results


 Filter data based on computed values
 Perform comparisons, aggregations, and joins indirectly

They are enclosed in parentheses and can appear in the SELECT, FROM, or
WHERE clause.
✅ Why Use Subqueries?
Subqueries:

 Break complex problems into simpler steps


 Allow you to use aggregated data as filters
 Are essential in Data Science for data wrangling, feature
extraction, nested filtering, etc.

🧠 Types of Subqueries
Type Description
Single-row
Returns one value (e.g., =, <, >)
subquery
Multi-row
Returns multiple rows (e.g., IN, ANY, ALL)
subquery
Correlated References columns from outer query; runs row-by-
subquery row
Scalar subquery Returns a single value; often used in SELECT or WHERE
Nested
Subqueries inside another subquery (multi-level)
subqueries
📌 Syntax
1. In WHERE Clause

✅ Filters employees working in the Data Science department.

2. In FROM Clause

✅ Calculates average salaries per department and filters those above 80,000.
3. In SELECT Clause

✅ Adds a column with total projects for each employee.

🔁 Correlated Subqueries
A correlated subquery uses a value from the outer query. It executes
once per row of the outer query.

✅ Returns employees who earn more than the average salary in their
department.
🔄 Nested Subqueries

✅ Get all employees who work under the manager "Alice".

⚡ Subquery vs JOIN
Feature Subquery JOIN
Use case Filtering, computed conditions Combine multiple tables
Performa Can be slower (especially Generally faster with large
nce correlated) datasets
Readabili
Easier for nested logic Better for flat structure
ty

📈 Subqueries in Data Science


Subqueries are highly relevant in:

 Data preprocessing: Extracting filtered groups or features


 Anomaly detection: Finding outliers (e.g., values greater than group
avg)
 Feature engineering: Adding computed metrics (e.g., total
transactions, max ratings)
 Nested filters: Filter with dynamic thresholds or rules

🧪 Example in Data Science Workflow

✅ A subquery is used to calculate and compare group-level aggregates


—a common pattern in data analysis.

🧾 Summary Table
Type Example Use Case Syntax Example
Single- Compare to one value
salary = (SELECT MAX(salary)...)
row (e.g., max)
Multi- WHERE id IN (SELECT emp_id
Compare to list
row FROM...)
SELECT name, (SELECT
Scalar Add a value column
COUNT(...)...)
Correl WHERE salary > (SELECT AVG(...)
Depends on outer row
ated WHERE...)
Neste Complex multi-level SELECT ... WHERE id = (SELECT ...
d filters (SELECT ...))

🧬 Final Thoughts
Subqueries are a powerful tool in SQL and are:

 Vital for data exploration and preprocessing


 Frequently used in reporting and data science pipelines
 Help extract complex data patterns without creating intermediate
tables

📘 What are Window Functions?


Window functions perform calculations across a set of table rows
that are somehow related to the current row. Unlike aggregate
functions, they do not collapse rows—they return a value for each row.

🔑 Keyword: OVER()

These are especially useful in analytics, data summarization, and trend


analysis, which are core tasks in Data Science.

🎯 Use Case of Window Functions in


Data Science
 Ranking customers based on purchases
 Finding moving averages, running totals
 Calculating percentiles or lead/lag comparisons
 Feature generation for machine learning (e.g., trends, change over
time)
🔢 Types of Ranking Functions
(Window Functions)
Function Description
Assigns a rank to each row; ties get same rank, next rank is
RANK()
skipped
DENSE_RAN
Like RANK(), but no rank gaps after ties
K()
ROW_NUMBE Assigns a unique number to each row, no duplicates, even
R() for ties
NTILE(n) Divides rows into n roughly equal buckets or quartiles

🧠 Basic Syntax of a Window Function

🔹 Explanation:

 PARTITION BY: Splits the data into groups (like GROUP BY, but doesn't
collapse rows).
 ORDER BY: Defines how ranking or calculations are performed within
each partition.
✅ Example 1: RANK()

📌 Ranks employees by salary within each department.

✅ Example 2: DENSE_RANK()

📌 Similar to RANK() but no skipped values for ties.

✅ Example 3: ROW_NUMBER()

📌 Assigns a unique row number, even if two salaries are equal.


✅ Example 4: NTILE()

📌 Splits the employees into 4 salary quartiles.

🔁 Analytical Window Functions (with


OVER)
Other non-ranking window functions:

Function Purpose Example Use Case


SUM() Running total Cumulative sales
AVG() Moving average 7-day rolling mean
LAG(), Compare current vs previous/next Price change
LEAD() row detection
FIRST_VALUE
First value in partition Identify initial status
()
LAST_VALUE( End state of a
Last value in partition
) process
✅ Example 5: LAG() and LEAD()

📌 Tracks previous and next hire dates in sequence.

📈 Use in Data Science


Use Case Description
Feature Engineering Rank, moving averages, lead-lag variables
Time Series Analysis Windowing over dates, detecting trends
Customer Segmentation NTILE or RANK based on revenue/engagement
Fraud Detection Compare current and previous transactions
Churn Prediction Track last purchase date or account activity

🧪 Example in Data Science Scenario


Goal: Rank customers by purchase amount within each month
📌 Helps identify top spenders per month – useful for marketing or loyalty
campaigns.

⚠️Common Mistakes
Mistake Correction
Using RANK() with no PARTITION BY Ranks across all rows, not grouped
Confusing RANK() vs ROW_NUMBER() Ties vs unique numbers
Forgetting ORDER BY in OVER() No order = undefined behavior

🔚 Summary Table
Handles Gaps in
Function Description
Ties? Rank?
RANK() Yes Yes Assigns same rank to ties
DENSE_RAN Same rank, but next rank not
Yes No
K() skipped
ROW_NUMBE
No No Unique rank for each row
R()
NTILE(n) N/A N/A Divides into n quantile groups

💡 Final Thoughts
Window functions like RANK(), ROW_NUMBER(), and LAG() are
indispensable for:

 Ranking, comparison, and analytics


 Temporal and hierarchical analysis
 Data transformations in pipelines

They are heavily used in SQL-backed data science platforms like


Snowflake, Redshift, BigQuery, and PostgreSQL.
🔗 What Does It Mean to Integrate
Python with SQL?
Integrating Python with SQL refers to using Python code to connect
to, query, and manipulate data stored in SQL databases like MySQL,
PostgreSQL, SQLite, etc.

This enables:

 Automating data retrieval


 Combining Python’s data processing power (Pandas, NumPy) with
SQL’s querying power
 Building data pipelines and dashboards
 Performing advanced analytics and machine learning using SQL data

📦 Common Python Libraries for SQL


Integration
Library Description Use Case
Built-in Python module for Lightweight, local
sqlite3
SQLite databases storage
mysql-connector- For connecting to MySQL Web apps, backend
python / PyMySQL databases integration
Popular PostgreSQL adapter Scalable data
psycopg2
for Python applications
Abstract database
SQLAlchemy ORM + SQL toolkit for Python
interactions
Read/write SQL tables with Data analysis &
Pandas
read_sql, to_sql transformation
🧱 SQL Integration Structure
1. Connect to the database
2. Write/Execute SQL queries
3. Fetch and store results
4. Manipulate data using Python (e.g., Pandas)
5. Close the connection

✅ Example 1: Using sqlite3


✅ Example 2: Using Pandas with SQL

⚙️With MySQL or PostgreSQL


You’ll use:
MySQL Example:

🔍 Why Is SQL-Python Integration


Important in Data Science?
Use Case Benefit
Extract data from SQL → Transform in Python →
ETL Pipelines
Load
Feature Engineering SQL to filter/join + Python to create features
Pull data → Predict → Save predictions back to
Model Deployment
SQL
Reporting and Query + clean data → Visualize using
Visualization Matplotlib/Seaborn
Handling Large
Use SQL for filtering → Python for analysis
Datasets
🧠 Use Case Example: ML Workflow
1. Extract data from SQL using pandas.read_sql()
2. Clean & transform data in Pandas
3. Train model using Scikit-learn
4. Store predictions back in the SQL database using to_sql()

🧱 Using SQLAlchemy (Advanced


Integration)
SQLAlchemy lets you write ORM-based queries:
This is scalable, flexible, and supports multiple DB engines.

✅ Best Practices
Practice Why?
Use parameterized queries Prevents SQL injection
Use context managers Auto close connections
Use SQLAlchemy or Pandas Cleaner and more readable
Limit data pulled from SQL Avoid memory overload in Python
Test queries before
Prevent data loss or logical errors
automation

🧠 Summary
Task SQL Python Tool
Connect to CREATE DATABASE, sqlite3, mysql.connector,
DB CONNECT psycopg2
Read Data SELECT pandas.read_sql()
Write Data INSERT, UPDATE df.to_sql()
Query
Joins, Aggregations Pandas, Matplotlib, Scikit-learn
Analysis
Data Star schema,
SQL + Pandas
Modeling normalization

💡 Final Thought
Python + SQL is a powerful combo in Data Science. It allows you to:

 Seamlessly move between data storage and analysis


 Build reliable, end-to-end pipelines
 Scale from small scripts to enterprise-grade systems

You might also like