0% found this document useful (0 votes)

14 views71 pages

Module 6

The document provides a comprehensive analysis of various SQL joins, including INNER, LEFT, RIGHT, FULL, EQUI, NON-EQUI, and SELF JOIN, detailing their definitions, syntax, examples, and relevance in data science. It also covers SQL mathematical functions, conversion functions, general functions, conditional expressions, and date/time functions, emphasizing their importance in data manipulation and analysis. Each section includes practical use cases and highlights how these SQL features are applied in data science workflows.

Uploaded by

Harish Karthik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views71 pages

Module 6

Uploaded by

Harish Karthik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 71

Here is a detailed, brief, and complete analysis of different types of SQL

Joins, including:

 INNER JOIN
 LEFT JOIN
 RIGHT JOIN
 FULL JOIN
 EQUI JOIN
 NON-EQUI JOIN
 SELF JOIN

Each explained with syntax, visual behavior, examples, and Data

Science relevance.

🔗 What Are Joins in SQL?

Joins are used to combine rows from two or more tables based on a
related column between them — usually a primary/foreign key
relationship.

Joins are crucial in relational databases, especially for:

 Feature engineering
 Building master tables
 Performing analysis across related tables
🔍 1. INNER JOIN (Most
Common)
✅ Definition:

Returns only the matching rows from both tables based on the join
condition.

✅ Syntax:

✅ Example:
✅ Visualization:

📊 Relevance:

Used to combine related data — e.g., merging user info with transaction
history.

📥 2. LEFT JOIN (LEFT OUTER

JOIN)
✅ Definition:

Returns all rows from the left table, and matched rows from the right
table. NULLs if no match found.

✅ Syntax:
✅ Example:

✅ Visualization:

📊 Relevance:

Useful to find:

 Customers with or without orders

 Retain all data from the main table
📤 3. RIGHT JOIN (RIGHT
OUTER JOIN)
✅ Definition:

Returns all rows from the right table, and matched rows from the left
table. NULLs if no match found.

✅ Syntax:

✅ Example:
✅ Visualization:

📊 Relevance:

Use when the right table contains the base data, such as ensuring all
departments are included even if no employees exist.

🌐 4. FULL JOIN (FULL OUTER

JOIN)
✅ Definition:

Returns all rows from both tables, with NULL for unmatched rows on
either side.

✅ Syntax:
✅ Example:

✅ Visualization:

📊 Relevance:

Used to identify:

 All possible combinations

 Orphan records in either table

✅ 5. EQUI JOIN
✅ Definition:

A join based on equality condition (like =). Technically, all joins with =
operator are equi joins.
✅ Syntax:

✅ Equivalent to:

📊 Relevance:

Most commonly used join type. Every INNER/LEFT/RIGHT JOIN with = is an

equi join.

❌ 6. NON-EQUI JOIN
✅ Definition:

Joins based on non-equality conditions, like <, >, BETWEEN, etc.

✅ Syntax:

✅ Use Case:

 Range-based matching (e.g., salary brackets, age groups)

📊 Relevance:

Used in interval mapping, binning, categorical grouping, and graded

classification.

🔁 7. SELF JOIN
✅ Definition:

Joins a table to itself. Think of the same table used twice with aliases.
✅ Syntax:

✅ Use Case:

 Represent hierarchical relationships (manager ↔ employee, parent

↔ child)
 Detect patterns in the same table

📊 Relevance:

Important in organizational structures, tree traversal, and anomaly

detection.

🧠 Summary Table
Matche Unmatched
Join Type Use Case Example
d? Rows
Only matched customers with
INNER JOIN ✅ ❌
orders
Left side
LEFT JOIN ✅ All customers, even if no orders
kept
Right side
RIGHT JOIN ✅ All products, even if never sold
kept
Both sides
FULL JOIN ✅ Full data, including orphans
kept
EQUI JOIN ✅ (on =) ❌ Most standard join
Ranges, bands (e.g., income
NON-EQUI JOIN ✅ (≠) ❌
classification)
Table referencing itself
SELF JOIN ✅ ❌
(manager ↔ staff)

📊 Joins in Data Science:

Why They Matter
Purpose Example
Feature
Merge transactions with user info
Engineering
Label Enrichment Join predictions with actual values
Data Cleaning Check mismatches across tables
Aggregation Group by + join for summary statistics
Advanced Join multiple sources (sales, marketing, feedback,
Analytics etc)

📐 What Are SQL Mathematical

Functions?
Mathematical functions in SQL are predefined operations that can be used
to manipulate or analyze numeric data. These functions help with:

 Data standardization
 Calculations for reports
 Modeling support (e.g., normalization, scaling)
 Business logic (e.g., rounding, limits)
🟩 1. SQRT() – Square Root
✅ Purpose:

Returns the square root of a number.

✅ Syntax:

✅ Example:

✅ Use in Data Science:

Used in:

 Normalizing skewed data

 Applying statistical operations
 Feature transformation (e.g., root transformations)
🟩 2. PI() – Mathematical Constant Pi
✅ Purpose:

Returns the value of π (pi) — approximately 3.1415926.

✅ Syntax:

✅ Example:

✅ Use in Data Science:

Useful for:

 Geometry-based data (e.g., maps, distances)

 Trigonometric transformations
 Algorithms involving radial distances (e.g., clustering, geospatial
analysis)
🟩 3. SQUARE() – Square of a Number
✅ Purpose:

Returns the square of a given number (n^2).

✅ Syntax:

✅ Example:

✅ Use in Data Science:

Used in:

 Variance/Standard Deviation (needs squared values)

 Error computation (e.g., squared error in regression)
 Scoring metrics
🟩 4. ROUND() – Rounding a Number
✅ Purpose:

Rounds a numeric value to a specified number of decimal places.

✅ Syntax:

✅ Example:

✅ Use in Data Science:

 Used in reporting and display formatting

 Useful for discretizing continuous values
 Helps to reduce floating-point precision errors
🟩 5. CEILING() – Round Up
✅ Purpose:

Returns the smallest integer greater than or equal to the given number.

✅ Syntax:

✅ Example:

✅ Use in Data Science:

 Useful when ensuring minimum thresholds

 Bin definitions where you need upper limits
 Rounding up scores, grades, or evaluations
📊 Summary Table of SQL Math
Functions
Function Description Example Output
SQRT(x) Square root of x SQRT(16) 4.0
PI() Returns Pi (π) PI() 3.14159
SQUARE(x) Returns x² SQUARE(3) 9
Rounds x to n decimal
ROUND(x,n) ROUND(3.5678, 2) 3.57
places
Smallest integer ≥ x
CEILING(x) CEILING(2.3) 3
(round up)

🧠 Relevance in Data Science

Area Use Case Example
Feature Engineering Square, root, round transformations
Normalization &
Apply SQRT/SQUARE for distribution adjustment
Scaling
Clustering/Geo Analysis PI, SQUARE, SQRT for distance formulas
Reporting/BI
ROUND, CEILING for human-friendly formats
Dashboards
Evaluation Metrics Mean squared error (uses SQUARE)

🔄 What Are Conversion Functions in

SQL?
Conversion functions in SQL are used to change a data type from one
form to another, such as:

 VARCHAR to INT
 FLOAT to DECIMAL
 DATE to STRING
 etc.
These conversions are explicit or implicit, depending on how they are
written and processed by the SQL engine.

🧰 Common Conversion Functions

Function Description
CAST() Explicitly converts one data type to another
Converts a value from one data type to another (T-SQL/MS
CONVERT()
SQL)
Converts string to date/numeric using culture formatting (MS
PARSE()
SQL)
TRY_CAST() Like CAST, but returns NULL on failure
TRY_CONVER
Like CONVERT, but safer (returns NULL if invalid)
T()

🟦 1. CAST() – Standard SQL Type

Conversion
✅ Syntax:
✅ Example:

✅ Use Cases:

 Convert dates to strings for reporting

 Convert strings to numbers for calculations
 Convert float to integer for rounding logic

🟦 2. CONVERT() – SQL Server-Specific

✅ Syntax:

✅ Example:
✅ Notes:

 style parameter is often used for date formatting

 Only works in MS SQL Server

🟦 3. PARSE() – String to Date/Number

(MS SQL)
✅ Syntax:

✅ Example:

✅ Use:

 When you need localized parsing for date and number formats
🟦 4. TRY_CAST() and TRY_CONVERT()
These are safe versions of CAST() and CONVERT().

✅ Example:

✅ Use Cases:

 Data cleaning with inconsistent formats

 Avoiding query errors in production pipelines
 Validating conversion before processing

📊 Conversion Examples in Real Data

Scenarios
Example Table:

id price_str date_str
1 "45.67" "2024-03-01"
2 "32.50" "01-03-2024"
3 "error" "invalid"
Query (Safe Conversion):

📈 Relevance to Data Science

Use Case Relevance
ETL (Extract, Transform,
Convert text to proper types during ingestion
Load)
Convert categorical types to numeric for ML
Feature Engineering
modeling
Data Cleaning Handle inconsistent formats or nulls
Aggregation/Analysis Numeric types needed for SUM, AVG, etc.
Need date/time formats for trend analysis,
Date Calculations
time series
Dashboard & Reporting String formatting for readable outputs

⚠️Important Notes
 Always validate input before converting (TRY_CAST preferred in
pipelines).
 Be cautious when converting large decimals to INT — data may be
lost.
 Conversion from string to date is locale-sensitive.
🧠 Summary
Safe Standar
Function Notes
? d?
CAST() ❌ ✅ Cross-platform, fails on bad data
CONVERT() ❌ ❌ (T-SQL) Good for formatting, MS SQL only
PARSE() ❌ ❌ Locale-sensitive, slower
TRY_CAST() ✅ ❌ (T-SQL) Avoids crashes, returns NULL on error
TRY_CONVERT() ✅ ❌ (T-SQL) Like TRY_CAST but with formatting

🌟 Why Are General Functions

Important?
In real-world datasets:

 Missing values (NULLs) are common.

 You often need default substitutes or need to avoid errors from
unexpected NULLs.
 These functions help control how NULLs behave in your SQL logic
and analysis.

🔹 1. COALESCE()
✅ Purpose:

Returns the first non-null value in a list of arguments.

✅ Syntax:

✅ Example:

✅ Real Table Example:

✅ Use in Data Science:

 Filling missing values

 Providing default fallbacks
 Used in feature engineering to ensure no NULLs during modeling
🔹 2. NVL() (Oracle SQL Specific)
✅ Purpose:

Replaces NULL with a specified replacement value.

✅ Syntax:

✅ Example:

✅ Real Use:

✅ Use in Data Science:

 Used in Oracle-based systems to avoid NULLs in aggregations

 Essential for default value imputation

🔸 Note: Equivalent in other databases:

 MySQL, PostgreSQL: use COALESCE() instead

🔹 3. NULLIF()
✅ Purpose:

Returns NULL if both expressions are equal; otherwise, returns the first
expression.

✅ Syntax:

✅ Example:

✅ Use Case:

Avoiding division by zero:

✅ Use in Data Science:

 Avoiding zero division errors

 Used in data transformation pipelines when checking redundant
values
 Helps handle conditional replacements in queries

🧠 Summary Comparison Table

Functi Out
Description Usage Example
on put
COALES COALESCE(NULL, 'SQ
Returns the first non-null value
CE() NULL, 'SQL') L'
Replaces NULL with a given value
NVL() NVL(NULL, 0) 0
(Oracle only)
NULLIF Returns NULL if both values are NUL
NULLIF(5, 5)
() equal L

📊 Relevance to Data Science

Area Use Case Example
Data Cleaning Replace NULLs with fallback values (COALESCE, NVL)
Feature
Fill empty values with alternate sources
Engineering
Error Handling Prevent divide-by-zero or logic errors using NULLIF
ETL/
Defaulting and conditional logic during transformation
Preprocessing
Analytics/ Ensure no NULLs in outputs for dashboards or ML
Reporting models

🧪 Example Query in a Data Pipeline

🌟 Why Use Conditional Expressions?

In real-world datasets:

 You often need conditional logic: e.g., apply a discount if the

purchase amount is high.
 Conditional expressions help in transforming, labeling, handling
NULLs, and customizing outputs for reports and models.

🔹 1. IF Expression (Non-standard
SQL)
✅ Available in:

 MySQL supports it directly.

 Not standard in PostgreSQL or SQL Server.
✅ Syntax (MySQL):

✅ Example:

✅ Use in Data Science:

 Labeling or bucketing values (e.g., salary levels)

 Creating derived features from existing columns

🔹 2. CASE Statement – Most Powerful

& Standard
✅ Standard SQL Conditional Expression

Used across all major databases (MySQL, PostgreSQL, SQL Server, Oracle,
etc.)
✅ Syntax:

🔸 Simple CASE:

🔸 Searched CASE:
✅ Example:

✅ Use in Data Science:

 Label encoding (e.g., converting numeric scores into letter grades)

 Binning/Bucketing
 Custom categorization for classification problems
 Conditional aggregations in queries

🔹 3. NULL in Conditions
✅ Remember: NULL is unknown in SQL.

 Any comparison with NULL using = or != results in NULL, not TRUE or

FALSE.
 You must use IS NULL or IS NOT NULL to check for NULL values.
✅ Example:

✅ Use in Data Science:

 Handling missing values (feature engineering)

 Data profiling and audit checks for data quality
 Creating flags for nulls as model inputs

❌ GOTO in SQL?
 GOTO is NOT a part of standard SQL.
 It's available only in procedural extensions (like PL/SQL, T-SQL, or
SQL scripts).
 Not used in typical data science SQL workflows.

🧠 Summary Table
Expres Support
Purpose Example Use
sion ed In
Simple conditional
IF() MySQL IF(score>50,'Pass','Fail')
logic
Label encoding, tiering, rule-
CASE All SQL Multi-condition logic
based logic
Handle missing
NULL All SQL IS NULL, IS NOT NULL
values
PL/SQL Control flow in
GOTO Not used in analytical SQL
only procedures

📊 Relevance in Data Science

Application Area How Conditional Expressions Help
Feature
Create custom variables based on business logic
Engineering
Missing Value
Detect or flag NULLs for imputation
Handling
Translate raw numeric/text values into buckets or
Label Encoding
categories
Data Apply rule-based formatting for better analytics or
Transformation visuals
Custom Conditional logic in aggregations (e.g., count only
Aggregations premium users)
Preprocessing Prepare structured inputs for machine learning
for ML models
📌 Advanced Example

✅ This is an example of segmentation, a common data science technique.

📅 Why Date and Time Functions

Matter?
In data science, time-series data is everywhere:

 Web traffic logs

 Sales over time
 User activity
 Financial transactions

✅ Date/Time functions allow you to:

 Extract parts (like year, month)

 Perform arithmetic (e.g., difference in days)
 Format dates for reporting
 Filter or group data by time periods
🧩 Common Date and Time Functions
(Across SQL Engines)
🔹 1. CURRENT_DATE / GETDATE() / SYSDATE

Returns the current system date.

SQL Engine Function

PostgreSQL CURRENT_DATE
MySQL CURDATE()
SQL Server GETDATE()
Oracle SYSDATE

Example:

🔹 2. NOW()

Returns current date and time.

🔹 3. DATEPART() / EXTRACT()

Returns specific components from a date.

PostgreSQL:

SQL Server:

🔹 4. DATEDIFF()

Returns the difference between two dates in days, months, or other

units.

MySQL/PostgreSQL:

SQL Server:
🔹 5. DATE_ADD() / DATE_SUB()

Used for adding/subtracting days/months/years.

MySQL:

🔹 6. TO_CHAR() and FORMAT()

Used for formatting date values as strings.

🔹 7. DATE_TRUNC()

Truncates the date to a specific part (year, month, day).

PostgreSQL:

Great for grouping by time intervals.

🔹 8. EXTRACT()

Pulls a specific part from a date (standard SQL).

🔹 9. AGE() (PostgreSQL)

Returns the difference between two dates in years, months, and days.

🧠 Use Cases in Data Science

Scenario SQL Function Example
Calculate user age DATEDIFF(YEAR, birthdate, GETDATE())
Find users inactive for WHERE DATEDIFF(day, last_login, GETDATE()) >
30+ days 30
GROUP BY DATE_TRUNC('month',
Group sales by month
sale_date)
WHERE sale_date >= DATE_SUB(CURDATE(), INTERVAL 7
Filter for records in last week
DAY)
Extract hour of the day EXTRACT(HOUR FROM login_time)
Format for reporting TO_CHAR(date_column, 'Month YYYY')
Create time series
DATE_TRUNC('day', timestamp) + GROUP BY
windows

🔎 Sample Query

✅ This retrieves recent orders, breaks them down by month/year, and

computes days since each order.

🧬 Summary Table
Function Purpose Example Usage
NOW() Current date + time NOW()
CURDATE() Current date CURDATE()
DATEDIFF() Days between two dates DATEDIFF('2025-04-14', '2025-04-01')
DATE_ADD() Add days/months/years DATE_ADD(date, INTERVAL 5 DAY)
EXTRACT() Extract part (year, month) EXTRACT(MONTH FROM order_date)
DATE_TRUNC() Truncate to a time unit DATE_TRUNC('month', date)
FORMAT() Format dates FORMAT(date, 'yyyy-MM')
Get age/time difference
AGE() AGE('2025-04-14', '2000-04-14')
(PostgreSQL)

🔚 Final Thoughts
 Time-based data is critical in forecasting, churn prediction,
behavioral analysis, and anomaly detection.
 SQL date/time functions provide everything needed to preprocess,
analyze, and aggregate time data effectively before feeding it into
visualization tools or ML models.

🔢 What Are Numeric Functions in

SQL?
Numeric functions are built-in SQL functions that perform mathematical
operations on numeric data types (like INT, FLOAT, DECIMAL, etc.).

These functions are essential for data analysis, reporting, and pre-
processing, especially in data science, where numerical accuracy and
transformations are key for statistical computations and machine learning
tasks.

✅ Common SQL Numeric Functions

Let’s explore the most frequently used numeric functions with examples.

1. ABS() – Absolute Value

Returns the absolute (positive) value of a number.

🔹 Use in Data Science: Used when calculating distance, error margin, or
loss values where only magnitude matters.

2. CEILING() / CEIL() – Round Up

Rounds a decimal up to the nearest integer.

🔹 Useful in allocating resources (e.g., rounding up to nearest hour, container,

etc.).

3. FLOOR() – Round Down

Rounds a decimal down to the nearest integer.

🔹 Used in binning values, age groups, floor-level pricing.

4. ROUND(number, decimal_places) – Rounding

Rounds a number to a specified number of decimal places.

🔹 Important in finance and reporting, especially when rounding values

for dashboards or visualizations.

5. POWER(base, exponent) – Exponentiation

Raises a number to a power.

🔹 Used in formulas like variance, square, exponential decay models.

6. SQRT() – Square Root

Returns the square root of a number.

🔹 Crucial in machine learning: standard deviation, Euclidean distance,
normalization.

7. EXP() – Exponential Function

Returns e raised to the power of the number (e^x).

🔹 Used in exponential growth modeling, compound interest, or decay

curves.

8. LOG() / LOG10() / LN()

 LOG() or LOG10() → Base 10 logarithm

 LN() → Natural logarithm (base e)

🔹 Data Science use: Log transformations help deal with skewed data,
e.g., in regression or scaling.
9. MOD(n, m) – Modulus (Remainder)

Returns the remainder of n ÷ m.

🔹 Common in grouping operations, hashing, time bucketing.

10. SIGN() – Sign of a Number

Returns:

 1 for positive
 -1 for negative
 0 for zero

🔹 Used in analytics to detect trends, changes, or outliers.

📊 Practical Example

🔍 This example:

 Calculates total price with rounding

 Floors and ceils prices
 Uses ABS to get loss value
 Uses MOD to pick even-order IDs

📈 Numeric Functions in Data Science

Function Data Science Use Case
ABS() Used in MAE (Mean Absolute Error), price deviation
SQRT() Root mean square error (RMSE), normalization
LOG() Log transform for normalizing skewed distributions
Calculating squares/cubes for distance metrics or
POWER()
regression formulas
ROUND() Prepping data for display, reporting, or precision control
CEILING() Rounding resource usage (e.g., CPU hours, billing)
MOD() Grouping by intervals, filtering even/odd samples
Identifying positive/negative trends, e.g., stock prices,
SIGN()
sales growth
🧠 Summary Table
Function Description Example
ABS() Absolute value ABS(-10) → 10
ROUND() Round to decimal places ROUND(3.456, 2)
CEILING() Round up CEILING(3.2) → 4
FLOOR() Round down FLOOR(3.8) → 3
POWER() Exponent POWER(2, 3)
SQRT() Square root SQRT(9)
EXP() e raised to power EXP(1)
LOG() Log base 10 LOG10(100)
LN() Natural logarithm LN(2.71)
MOD() Remainder MOD(10, 3)
SIGN() Sign of number SIGN(-5) → -1

🧬 Final Thoughts
Numeric functions are:

 The foundation for arithmetic and statistical analysis in SQL

 Often used in data cleaning, engineering, reporting, and ML
pipelines
 A great way to perform in-database computations before exporting
to Python or R

🔤 What Are String Functions in SQL?

String functions in SQL are used to manipulate, analyze, or transform
string/textual data stored in tables. These functions are critical for:

 Cleaning text data

 Parsing and formatting strings
 Extracting features from text (e.g., names, IDs)
 Prepping data for NLP tasks in Data Science
✅ Common SQL String Functions
Let’s explore the most widely used string functions with examples and use-
cases.

1. LENGTH() or LEN() – Get Length

Returns the number of characters in a string.

📌 Use in Data Science:

 Text feature engineering (e.g., tweet length)

 Detect unusually long/short entries

2. UPPER() and LOWER() – Case Conversion

 UPPER(): Converts all characters to uppercase

 LOWER(): Converts all characters to lowercase

'
📌 Used for case-insensitive matching, data normalization.

3. SUBSTRING() or SUBSTR() – Extract Substring

Extracts a part of the string starting from a position.

📌 Useful for extracting fields, such as:

 Year from date string

 Domain from email

4. TRIM() – Remove Spaces

Removes leading and trailing spaces (or other characters).

📌 Essential for data cleaning—removing whitespace that causes false

mismatches.

5. LTRIM() and RTRIM() – Trim Left/Right

 LTRIM() → Removes leading spaces

 RTRIM() → Removes trailing spaces

📌 Used in column cleanup before joins or comparisons.

6. REPLACE() – Replace Substrings

Replaces occurrences of a substring.

📌 Very helpful in normalizing data or fixing typos.

7. CONCAT() – Combine Strings

Combines two or more strings.

📌 Combine fields like first + last names, or email + domain.

8. INSTR() – Find Position of Substring

Returns position of first occurrence of a substring.

📌 Helpful in parsing or validating strings (e.g., domain in email).

9. LEFT() and RIGHT() – Get Substring from

Ends

 LEFT(string, N) – First N characters

 RIGHT(string, N) – Last N characters

📌 Used in extracting identifiers, codes, prefixes.

10. REVERSE() – Reverse a String

Reverses the input string.

📌 Rare in analytics, but useful in some text problems or puzzles.

11. ASCII() – ASCII Value of First Character

Returns ASCII code of the first character.

📌 Useful for encoding, classification, or ordering characters.

12. CHARINDEX() – Find Index of

Character/Substr

Returns index of a substring in a string.

📌 For pattern extraction, validations.

🧠 Real-World Example:

This query:

 Cleans and standardizes names/emails

 Extracts area codes
 Filters comments that mention “refund”

📈 String Functions in Data Science

Function Use in Data Science
Text feature extraction (e.g., comment length, tweet
LENGTH()
size)
TRIM() Cleaning user input
REPLACE() Normalize inconsistent labels
SUBSTRING() Extract year, domain, product code
UPPER()/LOWER() Case normalization for accurate matching
CONCAT() Building identifiers, URLs, full names
INSTR()/
Detect keyword presence in text
CHARINDEX()
LEFT()/RIGHT() Extract codes or indicators from string IDs
📋 Summary Table
Function Purpose Example Result
LENGTH() Length of string LENGTH('abc') → 3
TRIM() Remove whitespace ' abc ' → 'abc'
REPLACE() Replace substring 'AI'→'ML'
SUBSTRING() Extract part of string 'Hello'→'ell'
CONCAT() Combine strings 'Data' + 'Science'
UPPER()/LOWER() Change case 'SQL'→'sql'
LEFT()/RIGHT() Left/right characters '12345'→'12'
INSTR()/CHARINDEX() Find substr index 'DataSci'→5

🧬 Final Thoughts
String functions are foundational tools in SQL for:

✅ Text data cleaning

✅ Standardization before analysis
✅ Feature engineering for NLP
✅ Query optimization for searching and filtering

🔍 What is a Subquery?
A subquery (or inner query) is a SQL query nested inside another
query (called the outer query). Subqueries are used to:

 Retrieve intermediate results

 Filter data based on computed values
 Perform comparisons, aggregations, and joins indirectly

They are enclosed in parentheses and can appear in the SELECT, FROM, or
WHERE clause.
✅ Why Use Subqueries?
Subqueries:

 Break complex problems into simpler steps

 Allow you to use aggregated data as filters
 Are essential in Data Science for data wrangling, feature
extraction, nested filtering, etc.

🧠 Types of Subqueries
Type Description
Single-row
Returns one value (e.g., =, <, >)
subquery
Multi-row
Returns multiple rows (e.g., IN, ANY, ALL)
subquery
Correlated References columns from outer query; runs row-by-
subquery row
Scalar subquery Returns a single value; often used in SELECT or WHERE
Nested
Subqueries inside another subquery (multi-level)
subqueries
📌 Syntax
1. In WHERE Clause

✅ Filters employees working in the Data Science department.

2. In FROM Clause

✅ Calculates average salaries per department and filters those above 80,000.
3. In SELECT Clause

✅ Adds a column with total projects for each employee.

🔁 Correlated Subqueries
A correlated subquery uses a value from the outer query. It executes
once per row of the outer query.

✅ Returns employees who earn more than the average salary in their
department.
🔄 Nested Subqueries

✅ Get all employees who work under the manager "Alice".

⚡ Subquery vs JOIN
Feature Subquery JOIN
Use case Filtering, computed conditions Combine multiple tables
Performa Can be slower (especially Generally faster with large
nce correlated) datasets
Readabili
Easier for nested logic Better for flat structure
ty

📈 Subqueries in Data Science

Subqueries are highly relevant in:

 Data preprocessing: Extracting filtered groups or features

 Anomaly detection: Finding outliers (e.g., values greater than group
avg)
 Feature engineering: Adding computed metrics (e.g., total
transactions, max ratings)
 Nested filters: Filter with dynamic thresholds or rules

🧪 Example in Data Science Workflow

✅ A subquery is used to calculate and compare group-level aggregates

—a common pattern in data analysis.

🧾 Summary Table
Type Example Use Case Syntax Example
Single- Compare to one value
salary = (SELECT MAX(salary)...)
row (e.g., max)
Multi- WHERE id IN (SELECT emp_id
Compare to list
row FROM...)
SELECT name, (SELECT
Scalar Add a value column
COUNT(...)...)
Correl WHERE salary > (SELECT AVG(...)
Depends on outer row
ated WHERE...)
Neste Complex multi-level SELECT ... WHERE id = (SELECT ...
d filters (SELECT ...))

🧬 Final Thoughts
Subqueries are a powerful tool in SQL and are:

 Vital for data exploration and preprocessing

 Frequently used in reporting and data science pipelines
 Help extract complex data patterns without creating intermediate
tables

📘 What are Window Functions?

Window functions perform calculations across a set of table rows
that are somehow related to the current row. Unlike aggregate
functions, they do not collapse rows—they return a value for each row.

🔑 Keyword: OVER()

These are especially useful in analytics, data summarization, and trend

analysis, which are core tasks in Data Science.

🎯 Use Case of Window Functions in

Data Science
 Ranking customers based on purchases
 Finding moving averages, running totals
 Calculating percentiles or lead/lag comparisons
 Feature generation for machine learning (e.g., trends, change over
time)
🔢 Types of Ranking Functions
(Window Functions)
Function Description
Assigns a rank to each row; ties get same rank, next rank is
RANK()
skipped
DENSE_RAN
Like RANK(), but no rank gaps after ties
K()
ROW_NUMBE Assigns a unique number to each row, no duplicates, even
R() for ties
NTILE(n) Divides rows into n roughly equal buckets or quartiles

🧠 Basic Syntax of a Window Function

🔹 Explanation:

 PARTITION BY: Splits the data into groups (like GROUP BY, but doesn't
collapse rows).
 ORDER BY: Defines how ranking or calculations are performed within
each partition.
✅ Example 1: RANK()

📌 Ranks employees by salary within each department.

✅ Example 2: DENSE_RANK()

📌 Similar to RANK() but no skipped values for ties.

✅ Example 3: ROW_NUMBER()

📌 Assigns a unique row number, even if two salaries are equal.

✅ Example 4: NTILE()

📌 Splits the employees into 4 salary quartiles.

🔁 Analytical Window Functions (with

OVER)
Other non-ranking window functions:

Function Purpose Example Use Case

SUM() Running total Cumulative sales
AVG() Moving average 7-day rolling mean
LAG(), Compare current vs previous/next Price change
LEAD() row detection
FIRST_VALUE
First value in partition Identify initial status
()
LAST_VALUE( End state of a
Last value in partition
) process
✅ Example 5: LAG() and LEAD()

📌 Tracks previous and next hire dates in sequence.

📈 Use in Data Science

Use Case Description
Feature Engineering Rank, moving averages, lead-lag variables
Time Series Analysis Windowing over dates, detecting trends
Customer Segmentation NTILE or RANK based on revenue/engagement
Fraud Detection Compare current and previous transactions
Churn Prediction Track last purchase date or account activity

🧪 Example in Data Science Scenario

Goal: Rank customers by purchase amount within each month
📌 Helps identify top spenders per month – useful for marketing or loyalty
campaigns.

⚠️Common Mistakes
Mistake Correction
Using RANK() with no PARTITION BY Ranks across all rows, not grouped
Confusing RANK() vs ROW_NUMBER() Ties vs unique numbers
Forgetting ORDER BY in OVER() No order = undefined behavior

🔚 Summary Table
Handles Gaps in
Function Description
Ties? Rank?
RANK() Yes Yes Assigns same rank to ties
DENSE_RAN Same rank, but next rank not
Yes No
K() skipped
ROW_NUMBE
No No Unique rank for each row
R()
NTILE(n) N/A N/A Divides into n quantile groups

💡 Final Thoughts
Window functions like RANK(), ROW_NUMBER(), and LAG() are
indispensable for:

 Ranking, comparison, and analytics

 Temporal and hierarchical analysis
 Data transformations in pipelines

They are heavily used in SQL-backed data science platforms like

Snowflake, Redshift, BigQuery, and PostgreSQL.
🔗 What Does It Mean to Integrate
Python with SQL?
Integrating Python with SQL refers to using Python code to connect
to, query, and manipulate data stored in SQL databases like MySQL,
PostgreSQL, SQLite, etc.

This enables:

 Automating data retrieval

 Combining Python’s data processing power (Pandas, NumPy) with
SQL’s querying power
 Building data pipelines and dashboards
 Performing advanced analytics and machine learning using SQL data

📦 Common Python Libraries for SQL

Integration
Library Description Use Case
Built-in Python module for Lightweight, local
sqlite3
SQLite databases storage
mysql-connector- For connecting to MySQL Web apps, backend
python / PyMySQL databases integration
Popular PostgreSQL adapter Scalable data
psycopg2
for Python applications
Abstract database
SQLAlchemy ORM + SQL toolkit for Python
interactions
Read/write SQL tables with Data analysis &
Pandas
read_sql, to_sql transformation
🧱 SQL Integration Structure
1. Connect to the database
2. Write/Execute SQL queries
3. Fetch and store results
4. Manipulate data using Python (e.g., Pandas)
5. Close the connection

✅ Example 1: Using sqlite3

✅ Example 2: Using Pandas with SQL

⚙️With MySQL or PostgreSQL

You’ll use:
MySQL Example:

🔍 Why Is SQL-Python Integration

Important in Data Science?
Use Case Benefit
Extract data from SQL → Transform in Python →
ETL Pipelines
Load
Feature Engineering SQL to filter/join + Python to create features
Pull data → Predict → Save predictions back to
Model Deployment
SQL
Reporting and Query + clean data → Visualize using
Visualization Matplotlib/Seaborn
Handling Large
Use SQL for filtering → Python for analysis
Datasets
🧠 Use Case Example: ML Workflow
1. Extract data from SQL using pandas.read_sql()
2. Clean & transform data in Pandas
3. Train model using Scikit-learn
4. Store predictions back in the SQL database using to_sql()

🧱 Using SQLAlchemy (Advanced

Integration)
SQLAlchemy lets you write ORM-based queries:
This is scalable, flexible, and supports multiple DB engines.

✅ Best Practices
Practice Why?
Use parameterized queries Prevents SQL injection
Use context managers Auto close connections
Use SQLAlchemy or Pandas Cleaner and more readable
Limit data pulled from SQL Avoid memory overload in Python
Test queries before
Prevent data loss or logical errors
automation

🧠 Summary
Task SQL Python Tool
Connect to CREATE DATABASE, sqlite3, mysql.connector,
DB CONNECT psycopg2
Read Data SELECT pandas.read_sql()
Write Data INSERT, UPDATE df.to_sql()
Query
Joins, Aggregations Pandas, Matplotlib, Scikit-learn
Analysis
Data Star schema,
SQL + Pandas
Modeling normalization

💡 Final Thought
Python + SQL is a powerful combo in Data Science. It allows you to:

 Seamlessly move between data storage and analysis

 Build reliable, end-to-end pipelines
 Scale from small scripts to enterprise-grade systems

Class 10 Civics Chapter 3 Previous Year Questions - Gender Religion and Caste (Old Syllabus)
No ratings yet
Class 10 Civics Chapter 3 Previous Year Questions - Gender Religion and Caste (Old Syllabus)
26 pages
Crack Your Data Engineering SQL Round
No ratings yet
Crack Your Data Engineering SQL Round
112 pages
Advanced SQL Concepts
No ratings yet
Advanced SQL Concepts
38 pages
Data Analysis With SQL: Postgresql Cheat Sheet
No ratings yet
Data Analysis With SQL: Postgresql Cheat Sheet
4 pages
Roadmap 2 ETL Testing - by Himanshu
100% (1)
Roadmap 2 ETL Testing - by Himanshu
56 pages
SQL Advanced Cheatsheet
No ratings yet
SQL Advanced Cheatsheet
1 page
SQL 1721960421
No ratings yet
SQL 1721960421
131 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
16 pages
Data Science: Part 2 - SQL
100% (1)
Data Science: Part 2 - SQL
13 pages
SQL 1
No ratings yet
SQL 1
58 pages
SQL Cheat Sheet
No ratings yet
SQL Cheat Sheet
62 pages
Unit 3 - Interactive SQL & Advanced SQL (1) (1) 949
No ratings yet
Unit 3 - Interactive SQL & Advanced SQL (1) (1) 949
84 pages
Data Analysis With SQL: Mysql Cheat Sheet
100% (1)
Data Analysis With SQL: Mysql Cheat Sheet
4 pages
Ch-6 Mysql Functions
No ratings yet
Ch-6 Mysql Functions
51 pages
SQL Notes
100% (1)
SQL Notes
8 pages
Unit 5 Bcomcardbms 2024
No ratings yet
Unit 5 Bcomcardbms 2024
31 pages
2 - Module - PFA - Secondary - 1.1 - Final
No ratings yet
2 - Module - PFA - Secondary - 1.1 - Final
50 pages
Interview - 7 - IMP
No ratings yet
Interview - 7 - IMP
26 pages
662a5089e0494246e350140dslides - Data Wrangling With SQL
No ratings yet
662a5089e0494246e350140dslides - Data Wrangling With SQL
85 pages
World of SQL
No ratings yet
World of SQL
30 pages
02 Advanced Querying
No ratings yet
02 Advanced Querying
60 pages
SQL PL SQL Queries q1 To q13
No ratings yet
SQL PL SQL Queries q1 To q13
15 pages
The Most Commonly Used SQL Queries
No ratings yet
The Most Commonly Used SQL Queries
29 pages
SQL Doc
No ratings yet
SQL Doc
39 pages
Class 12 Informatics Practices (IP) - Code 065: Chapter: Querying and SQL Functions
No ratings yet
Class 12 Informatics Practices (IP) - Code 065: Chapter: Querying and SQL Functions
18 pages
Using Built-In Functions
No ratings yet
Using Built-In Functions
22 pages
9.SQL Queries-Single Row Functions To Custmoized Output
No ratings yet
9.SQL Queries-Single Row Functions To Custmoized Output
22 pages
Shivanesh Dbms
No ratings yet
Shivanesh Dbms
22 pages
SQL Basic Cheat Sheet
100% (1)
SQL Basic Cheat Sheet
1 page
Wa0003.
No ratings yet
Wa0003.
20 pages
Math Kangaroo 2006 Grades 3-4
67% (3)
Math Kangaroo 2006 Grades 3-4
4 pages
SQL Functions
No ratings yet
SQL Functions
18 pages
MS SQL Server Managment Functions - Section-3
No ratings yet
MS SQL Server Managment Functions - Section-3
36 pages
Unit-2 Important Questions & Answers
No ratings yet
Unit-2 Important Questions & Answers
17 pages
SQL Functions
No ratings yet
SQL Functions
20 pages
Data Science Edited
No ratings yet
Data Science Edited
57 pages
SQL Statements With Functions
No ratings yet
SQL Statements With Functions
22 pages
UNIT 5-sql
No ratings yet
UNIT 5-sql
15 pages
BDST 122 RDBMS
No ratings yet
BDST 122 RDBMS
12 pages
Advanced SQL 65 71 1-1
No ratings yet
Advanced SQL 65 71 1-1
7 pages
Revision Unit Test 1
No ratings yet
Revision Unit Test 1
17 pages
J 1 SQL
No ratings yet
J 1 SQL
10 pages
IP XII Quick Notes - Querying in MYSQL
No ratings yet
IP XII Quick Notes - Querying in MYSQL
11 pages
Lab 03
No ratings yet
Lab 03
8 pages
Define SQL
No ratings yet
Define SQL
13 pages
SQL Functions
No ratings yet
SQL Functions
9 pages
Introduction To Database Systems: Lecture #5 Ketevan Grigalashvili
No ratings yet
Introduction To Database Systems: Lecture #5 Ketevan Grigalashvili
11 pages
SQL Operators Functions and Keywords
No ratings yet
SQL Operators Functions and Keywords
9 pages
SQL Lab Manual 4
No ratings yet
SQL Lab Manual 4
6 pages
Step-by-Step Guide To Learn SQL
No ratings yet
Step-by-Step Guide To Learn SQL
11 pages
SQL - 1 Cheat Sheet
No ratings yet
SQL - 1 Cheat Sheet
5 pages
SQL Basics
No ratings yet
SQL Basics
3 pages
Functions and Grouping
No ratings yet
Functions and Grouping
7 pages
Lockheed Tri Star Case Study
100% (1)
Lockheed Tri Star Case Study
18 pages
Class 12 Notes Informatices Pratice Chap 1 (2024-25)
No ratings yet
Class 12 Notes Informatices Pratice Chap 1 (2024-25)
6 pages
Functions in SQL
No ratings yet
Functions in SQL
7 pages
SQL Tutorial On Data Analysis in R
No ratings yet
SQL Tutorial On Data Analysis in R
5 pages
Standard SQL Functions Cheat Sheet Letter
No ratings yet
Standard SQL Functions Cheat Sheet Letter
2 pages
Standard SQL Functions Cheat Sheet A4
No ratings yet
Standard SQL Functions Cheat Sheet A4
2 pages
Standard SQL Functions Cheat Sheet A4
No ratings yet
Standard SQL Functions Cheat Sheet A4
2 pages
Standard SQL Functions Cheat Sheet Letter
No ratings yet
Standard SQL Functions Cheat Sheet Letter
2 pages
Reviewer SQL Lab2
No ratings yet
Reviewer SQL Lab2
2 pages
Sermon - Delight in Discipline - Hebrews 12 - Lifeway
No ratings yet
Sermon - Delight in Discipline - Hebrews 12 - Lifeway
10 pages
Nagtalon vs. United Coconut Planters Bank, 702 SCRA 615, July 31, 2013
No ratings yet
Nagtalon vs. United Coconut Planters Bank, 702 SCRA 615, July 31, 2013
14 pages
Career
No ratings yet
Career
81 pages
Present Laikai Lentele
100% (5)
Present Laikai Lentele
1 page
Module 7
No ratings yet
Module 7
51 pages
Module 5
No ratings yet
Module 5
67 pages
Surgical Scrubbing Gowning and Gloving
No ratings yet
Surgical Scrubbing Gowning and Gloving
2 pages
BYD - Colors 2023
No ratings yet
BYD - Colors 2023
8 pages
American International School Class-X Subject-English Topic-Articles & Determiners
No ratings yet
American International School Class-X Subject-English Topic-Articles & Determiners
4 pages
Dreams of Tomorrow (Essay On AI in The Modern Day)
No ratings yet
Dreams of Tomorrow (Essay On AI in The Modern Day)
5 pages
Latest
No ratings yet
Latest
36 pages
Sultanate Period Mosques
No ratings yet
Sultanate Period Mosques
8 pages
Sps. Yulo v. BPI, G.R. No. 217044, January 16, 2019 Case Digest
No ratings yet
Sps. Yulo v. BPI, G.R. No. 217044, January 16, 2019 Case Digest
1 page
Mando Learnings
No ratings yet
Mando Learnings
17 pages
Repair Fanuc A02B-0168-B012 Power Mate B PDF
No ratings yet
Repair Fanuc A02B-0168-B012 Power Mate B PDF
3 pages
Case Study No. 3
No ratings yet
Case Study No. 3
3 pages
ENGLISH II P3 B 2020
No ratings yet
ENGLISH II P3 B 2020
3 pages
Lesson 1 - Teaching Portfolio A Dios Le Pido Lesson
No ratings yet
Lesson 1 - Teaching Portfolio A Dios Le Pido Lesson
6 pages
Suggested Retail Price: Effective January 27, 2010
No ratings yet
Suggested Retail Price: Effective January 27, 2010
5 pages
Pola Kemitraan Antara Taman Ayu Agrowisata Dengan Petani Kopi Luwak
No ratings yet
Pola Kemitraan Antara Taman Ayu Agrowisata Dengan Petani Kopi Luwak
9 pages
The Hausa and Fulani Pre Colonial Administration in Nigeria by Echofu Innocent Echofu
No ratings yet
The Hausa and Fulani Pre Colonial Administration in Nigeria by Echofu Innocent Echofu
7 pages
T3 T4 First Floor
No ratings yet
T3 T4 First Floor
1 page
Buddhism Dissertation Topics
100% (1)
Buddhism Dissertation Topics
5 pages
Zsolt Bara, A205 668 646 (BIA July 24, 2015)
No ratings yet
Zsolt Bara, A205 668 646 (BIA July 24, 2015)
9 pages
Selling of Accumulators
No ratings yet
Selling of Accumulators
7 pages
Basic Opposite Words in English - English Study Page
No ratings yet
Basic Opposite Words in English - English Study Page
3 pages
BDX CGK#4 Project: Entrant Details
No ratings yet
BDX CGK#4 Project: Entrant Details
5 pages
Republic Vs CA & Precision
No ratings yet
Republic Vs CA & Precision
1 page
Arabic Reflective Report
No ratings yet
Arabic Reflective Report
8 pages
Others - Sparrow Conservation
No ratings yet
Others - Sparrow Conservation
3 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)