0% found this document useful (0 votes)
28 views79 pages

SQL For Data Engineering

The document outlines a comprehensive roadmap for learning SQL specifically tailored for data engineering, divided into eight phases. It covers foundational SQL concepts, aggregations, joins, subqueries, window functions, complex queries, data engineering-specific SQL, and includes practice tests for each phase. Each phase provides practical examples and exercises to reinforce learning, culminating in a structured approach to mastering SQL for data engineering tasks.

Uploaded by

Ajay gadiparthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views79 pages

SQL For Data Engineering

The document outlines a comprehensive roadmap for learning SQL specifically tailored for data engineering, divided into eight phases. It covers foundational SQL concepts, aggregations, joins, subqueries, window functions, complex queries, data engineering-specific SQL, and includes practice tests for each phase. Each phase provides practical examples and exercises to reinforce learning, culminating in a structured approach to mastering SQL for data engineering tasks.

Uploaded by

Ajay gadiparthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 79

SQL for Data Engineering – Full Roadmap

Phase 1: SQL Basics (Foundation)

 What is SQL? Types of SQL (DDL, DML, DCL, TCL)

 Databases, Tables, Schema, Primary/Foreign Keys

 Basic Queries:

o SELECT, WHERE, ORDER BY, LIMIT

o DISTINCT, BETWEEN, IN, LIKE, IS NULL

 Sorting & Filtering Data

✅ Practice Dataset: employees, departments


✅ Practice Example: Find employees hired after 2020, order by salary
descending.

Phase 2: Aggregations & Grouping

 COUNT(), SUM(), AVG(), MIN(), MAX()

 GROUP BY, HAVING

 Advanced filtering with groups

✅ Real Example: Find average salary per department and filter where
average salary > 70,000.

Phase 3: Joins (Core for Data Engineers)

 INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN

 SELF JOIN

 CROSS JOIN

 Handling NULLs in joins

✅ Real Example: Join employees and departments to get employee name


with department name.

Phase 4: Subqueries & CTEs


 Scalar subqueries

 IN, EXISTS

 Correlated subqueries

 Common Table Expressions (CTEs) with WITH

✅ Real Example: Find employees whose salary is above department


average (using subquery).

Phase 5: Window Functions (Advanced Core for Data Engineers)

 ROW_NUMBER(), RANK(), DENSE_RANK()

 LAG(), LEAD()

 Running totals & moving averages with OVER()

 Partitioning data

✅ Real Example: Rank employees by salary within each department.

Phase 6: Complex Queries for Data Engineering

 Nested joins + window functions

 Pivoting and unpivoting

 Hierarchical queries (manager → employee)

 Recursive CTEs

 Handling time-series data

✅ Real Example: Find monthly active users, retention rates, and rolling
averages.

Phase 7: Data Engineering-Specific SQL

 Data cleaning queries:

o Removing duplicates

o Handling NULLs
o Standardizing text (e.g., trimming, lower/upper)

 ETL-like queries:

o Insert into staging tables

o Merge/Upsert

 Performance:

o Indexing impact

o Query optimization with EXPLAIN

✅ Real Example: Deduplicate customer data where multiple emails exist.

Phase 8: Practice Tests

At the end of each phase, we’ll do:

 Mini test (5–10 queries)

 Real project scenario like:

o Build a sales report per region & year

o Calculate churn rate from customer table

o Design a query to detect duplicate transactions

👉 Ajay, I suggest we start with Phase 1 (Basics) today.


I’ll create a sample employees table, explain concepts with queries, and
then give you practice questions.

Would you like me to:

1. Create the dataset first (so you can run queries),

2. Or start with theory + query examples and then build the dataset
later?

You said:

yes

ChatGPT said:

Great 👍 Ajay! Let’s kick off SQL for Data Engineering – Phase 1 (Basics).
We’ll first create a sample dataset (an employees table + departments
table). Then I’ll explain the basic SQL commands with real-time
examples, and finally give you practice exercises.

🔹 Step 1: Create Sample Tables

-- Create departments table

CREATE TABLE departments (

department_id INT PRIMARY KEY,

department_name VARCHAR(50)

);

-- Insert department data

INSERT INTO departments VALUES

(1, 'HR'),

(2, 'Finance'),

(3, 'Engineering'),

(4, 'Marketing');

-- Create employees table

CREATE TABLE employees (

emp_id INT PRIMARY KEY,

emp_name VARCHAR(50),

hire_date DATE,

salary DECIMAL(10,2),

department_id INT,

manager_id INT,

FOREIGN KEY (department_id) REFERENCES departments(department_id)


);

-- Insert employee data

INSERT INTO employees VALUES

(101, 'Alice', '2020-05-10', 60000, 3, NULL),

(102, 'Bob', '2019-07-15', 75000, 3, 101),

(103, 'Charlie', '2021-01-12', 50000, 1, 101),

(104, 'David', '2018-03-20', 90000, 2, 102),

(105, 'Eva', '2022-06-01', 45000, 4, 101),

(106, 'Frank', '2020-11-11', 70000, 3, 102),

(107, 'Grace', '2019-12-25', 80000, 2, 102);

✅ Now we have:

 departments → 4 departments

 employees → 7 employees with salaries, hire dates, and managers

🔹 Step 2: Basic SQL Queries

1. Select All Data

SELECT * FROM employees;

👉 Fetches all employees.

2. Select Specific Columns

SELECT emp_name, salary FROM employees;

👉 Fetches only employee names and salaries.

3. Filtering Rows with WHERE

SELECT *

FROM employees
WHERE salary > 60000;

👉 Gets employees earning more than 60,000.

4. Using AND / OR

SELECT emp_name, salary, hire_date

FROM employees

WHERE salary > 60000 AND hire_date > '2020-01-01';

👉 Employees with salary > 60k AND hired after 2020.

5. Using IN

SELECT emp_name, department_id

FROM employees

WHERE department_id IN (2, 3);

👉 Employees from Finance or Engineering.

6. Pattern Matching with LIKE

SELECT emp_name

FROM employees

WHERE emp_name LIKE 'A%';

👉 Names starting with "A" (e.g., Alice).

7. Sorting with ORDER BY

SELECT emp_name, salary

FROM employees

ORDER BY salary DESC;

👉 Employees sorted by salary (highest first).


8. Limiting Results

SELECT emp_name, salary

FROM employees

ORDER BY salary DESC

LIMIT 3;

👉 Top 3 highest-paid employees.

🔹 Step 3: Practice Questions (Your Turn 💡)

1. Get all employees hired after 2020.

2. List employees in the HR department.

3. Find employees with salary between 50,000 and 80,000.

4. Show employees whose name ends with "e".

5. Display the top 2 lowest-paid employees.

👉 Ajay, do you want me to give the answers immediately for these


practice questions,
or do you want to try them yourself first and then check with me?

You said:

select * from employees where hire_date > '2020-01-01';

ChatGPT said:

🔥 Perfect Ajay! That’s exactly the right query for the first practice question.

Let’s run it on our sample data:

SELECT *

FROM employees

WHERE hire_date > '2020-01-01';

✅ Expected Output (from our dataset)


emp_i emp_na hire_dat department manager_
salary
d me e _id id

2021-01- 50000.0
103 Charlie 1 101
12 0

2022-06- 45000.0
105 Eva 4 101
01 0

2020-11- 70000.0
106 Frank 3 102
11 0

👉 Employees hired after Jan 1, 2020 are Charlie, Eva, and Frank.

💡 Now, want to try the next practice question?


👉 “List employees in the HR department.”

Would you like me to give a hint (join with departments) or let you try first?

You said:

select * from employees where department = 'HR'

ChatGPT said:

Nice try Ajay 👌 but there’s a small correction.

In our employees table, we don’t have a column called department.


Instead, employees are linked to departments using department_id.
So, to get the department name (like HR), we need to join with the
departments table.

✅ Correct query:

SELECT e.emp_id, e.emp_name, e.salary, d.department_name

FROM employees e

JOIN departments d

ON e.department_id = d.department_id

WHERE d.department_name = 'HR';


🎯 Expected Output

emp_i emp_na department_na


salary
d me me

50000.0
103 Charlie HR
0

👉 Only Charlie works in the HR department.

⚡ Do you want me to also show you a shortcut way (using department_id


without join),
or shall we keep practicing with joins since they are crucial for Data
Engineers?

You said:

i got it , next

ChatGPT said:

Perfect Ajay 🚀 you’re picking it up quickly!


Let’s go to the next practice question from Phase 1.

🔹 Question 3:

Find employees with salary between 50,000 and 80,000.

💡 Hint: Use the BETWEEN operator.

Example structure:

SELECT column_list

FROM table_name

WHERE salary BETWEEN value1 AND value2;

👉 Your turn:
Can you write the query to fetch employees whose salaries fall between
50,000 and 80,000?

You said:
next

ChatGPT said:

Alright Ajay 👍 let’s solve it together.

✅ Query

SELECT emp_id, emp_name, salary

FROM employees

WHERE salary BETWEEN 50000 AND 80000;

🎯 Expected Output (from our dataset)

emp_i emp_na
salary
d me

60000.0
101 Alice
0

75000.0
102 Bob
0

50000.0
103 Charlie
0

70000.0
106 Frank
0

👉 These employees have salaries between 50k and 80k.

🔹 Next Practice Question (Q4):

Show employees whose name ends with “e”.

💡 Hint: Use LIKE '%e'.

Do you want to try this one yourself first?

You said:

next
ChatGPT said:

Great 💯 Ajay, let’s move ahead.

✅ Query (Q4)

SELECT emp_id, emp_name

FROM employees

WHERE emp_name LIKE '%e';

🎯 Expected Output

emp_i emp_na
d me

103 Charlie

105 Eva

107 Grace

👉 Names ending with “e” → Charlie and Grace.

🔹 Next Practice Question (Q5 – Last one for Phase 1):

Display the top 2 lowest-paid employees.

💡 Hint: Use ORDER BY salary ASC LIMIT 2.

Do you want to give this one a shot, or should I show you the query directly?

You said:

i am ok with questions lets move next

ChatGPT said:

Perfect Ajay 🚀 Since you’re comfortable with the basics, let’s move to Phase
2: Aggregations & Grouping.
This is super important for Data Engineers because we often need to
summarize data (reports, dashboards, ETL pipelines).
🔹 Phase 2: Aggregations & Grouping

Common Aggregate Functions

 COUNT() → Number of rows

 SUM() → Total

 AVG() → Average

 MIN() → Minimum value

 MAX() → Maximum value

1. Count employees

SELECT COUNT(*) AS total_employees

FROM employees;

👉 Total number of employees = 7.

2. Find the highest and lowest salary

SELECT MAX(salary) AS highest_salary,

MIN(salary) AS lowest_salary

FROM employees;

👉 Highest = 90,000 (David), Lowest = 45,000 (Eva).

3. Average salary by department

SELECT d.department_name,

AVG(e.salary) AS avg_salary

FROM employees e

JOIN departments d

ON e.department_id = d.department_id

GROUP BY d.department_name;

👉 Example Output:
department_na avg_sala
me ry

HR 50000.00

Finance 85000.00

Engineering 68333.33

Marketing 45000.00

4. Use HAVING to filter grouped data

SELECT d.department_name,

AVG(e.salary) AS avg_salary

FROM employees e

JOIN departments d

ON e.department_id = d.department_id

GROUP BY d.department_name

HAVING AVG(e.salary) > 60000;

👉 Returns only departments where average salary > 60k.

🔹 Practice Questions (Phase 2)

1. Find the total salary paid to all employees.

2. Find the number of employees in each department.

3. Show the highest salary in each department.

4. List departments where the employee count > 1.

5. Find the average salary of employees hired after 2020.

👉 Do you want me to show answers step by step for these,


or should I let you try first and then we compare?

You said:
show answers

ChatGPT said:

Awesome Ajay 🙌 let’s go step by step and solve all Phase 2 (Aggregation
& Grouping) practice questions with queries + outputs.

✅ Q1: Find the total salary paid to all employees

SELECT SUM(salary) AS total_salary

FROM employees;

Output:
Total Salary = 470,000

✅ Q2: Find the number of employees in each department

SELECT d.department_name,

COUNT(e.emp_id) AS emp_count

FROM employees e

JOIN departments d

ON e.department_id = d.department_id

GROUP BY d.department_name;

Output:

department_na emp_cou
me nt

HR 1

Finance 2

Engineering 3

Marketing 1

✅ Q3: Show the highest salary in each department

SELECT d.department_name,
MAX(e.salary) AS highest_salary

FROM employees e

JOIN departments d

ON e.department_id = d.department_id

GROUP BY d.department_name;

Output:

department_na highest_sal
me ary

HR 50000

Finance 90000

Engineering 75000

Marketing 45000

✅ Q4: List departments where the employee count > 1

SELECT d.department_name,

COUNT(e.emp_id) AS emp_count

FROM employees e

JOIN departments d

ON e.department_id = d.department_id

GROUP BY d.department_name

HAVING COUNT(e.emp_id) > 1;

Output:

department_na emp_cou
me nt

Finance 2

Engineering 3
✅ Q5: Find the average salary of employees hired after 2020

SELECT AVG(salary) AS avg_salary_after_2020

FROM employees

WHERE hire_date > '2020-01-01';

Output:
Average Salary = 55,000 (Charlie 50k + Eva 45k + Frank 70k)

✅ That completes Phase 2 (Aggregations & Grouping) 🎉.

👉 Next up is Phase 3: Joins – this is the heart of SQL for Data Engineers,
because almost every real-world dataset is spread across multiple tables.

Would you like me to start Joins with simple INNER/LEFT JOIN examples
first,
or jump directly into real-time complex join queries?

You said:

ok lets move to phase 2

ChatGPT said:

Got it Ajay 👍 We’ve fully completed Phase 2 (Aggregations & Grouping)


✅.
Now let’s move on to Phase 3: Joins — this is where SQL becomes really
powerful and where Data Engineers spend a lot of time combining data
from multiple tables.

🔹 Phase 3: Joins in SQL

Why Joins?

In real-world databases, data is stored in multiple related tables


(normalized form). To make sense of it, we use joins to combine rows based
on relationships.

🔸 1. INNER JOIN

👉 Returns only matching rows from both tables.


SELECT e.emp_name, d.department_name, e.salary

FROM employees e

INNER JOIN departments d

ON e.department_id = d.department_id;

Output:

emp_na department_n salar


me ame y

6000
Alice Engineering
0

7500
Bob Engineering
0

5000
Charlie HR
0

9000
David Finance
0

4500
Eva Marketing
0

7000
Frank Engineering
0

8000
Grace Finance
0

🔸 2. LEFT JOIN

👉 Returns all rows from left table (employees) and matches from right
(departments).

SELECT e.emp_name, d.department_name

FROM employees e

LEFT JOIN departments d

ON e.department_id = d.department_id;
✅ If an employee has no department, still shows employee with NULL
department.

🔸 3. RIGHT JOIN

👉 Returns all rows from right table (departments) and matches from left
(employees).

SELECT d.department_name, e.emp_name

FROM employees e

RIGHT JOIN departments d

ON e.department_id = d.department_id;

✅ If a department has no employees, it still shows up with NULL employee.

🔸 4. FULL OUTER JOIN

👉 Returns all rows from both tables (matches + non-matches).


(Note: Some databases like MySQL don’t support FULL JOIN directly — you
use UNION of LEFT and RIGHT joins.)

SELECT e.emp_name, d.department_name

FROM employees e

FULL JOIN departments d

ON e.department_id = d.department_id;

🔸 5. SELF JOIN

👉 A table joins with itself (useful for hierarchies like manager → employee).

SELECT e.emp_name AS employee, m.emp_name AS manager

FROM employees e

LEFT JOIN employees m

ON e.manager_id = m.emp_id;

Output Example:
employ manag
ee er

Alice NULL

Bob Alice

Charlie Alice

David Bob

Eva Alice

Frank Bob

Grace Bob

🔹 Practice Questions (Phase 3: Joins)

1. List employees with their department names.

2. Show all departments, even if they have no employees.

3. Find employees and their managers.

4. Get all employees in Finance with salary > 70,000.

5. List departments where no employees are assigned.

👉 Do you want me to show answers step by step like before,


or let you try these join queries first and then we review together?

You said:

give me tables for this to practice as youprovides previous is not upto that to
practice joins as both table having matching , not able to understand
difference between inner , outer and cross joins and etc

ChatGPT said:

💯 Excellent point Ajay 👌 — You’re absolutely right: in our first dataset, every
employee already had a valid department, so INNER and OUTER joins looked
the same. To really understand joins, we need a better dataset with
mismatches, NULLs, and extra rows.

Let’s build one now 👉


🔹 New Practice Dataset for Joins

1. Departments Table

CREATE TABLE departments (

department_id INT PRIMARY KEY,

department_name VARCHAR(50)

);

INSERT INTO departments VALUES

(1, 'HR'),

(2, 'Finance'),

(3, 'Engineering'),

(4, 'Marketing'),

(5, 'Sales'), -- 👈 No employees yet

(6, 'Legal'); -- 👈 No employees yet

2. Employees Table

CREATE TABLE employees (

emp_id INT PRIMARY KEY,

emp_name VARCHAR(50),

salary DECIMAL(10,2),

department_id INT NULL,

manager_id INT NULL

);

INSERT INTO employees VALUES

(101, 'Alice', 60000, 3, NULL), -- Engineering


(102, 'Bob', 75000, 3, 101), -- Engineering

(103, 'Charlie', 50000, 1, 101), -- HR

(104, 'David', 90000, 2, 102), -- Finance

(105, 'Eva', 45000, NULL, 101), -- 👈 No department

(106, 'Frank', 70000, 3, 102), -- Engineering

(107, 'Grace', 80000, 2, 102), -- Finance

(108, 'Hank', 55000, NULL, NULL); -- 👈 No department & no manager

Key Points in This Data

 ✅ Departments Sales and Legal have no employees

 ✅ Employees Eva and Hank have no department assigned (NULL)

 ✅ Some employees have managers, some don’t

 ✅ This setup will show real differences between INNER, LEFT, RIGHT,
FULL, and CROSS joins

🔹 Example Queries to See Differences

1. INNER JOIN

SELECT e.emp_name, d.department_name

FROM employees e

INNER JOIN departments d

ON e.department_id = d.department_id;

👉 Shows only employees that have a matching department.


❌ Eva & Hank won’t appear.
❌ Sales & Legal won’t appear.

2. LEFT JOIN

SELECT e.emp_name, d.department_name

FROM employees e
LEFT JOIN departments d

ON e.department_id = d.department_id;

👉 Shows all employees.


✅ Eva & Hank will appear with NULL department.
❌ Sales & Legal still won’t appear.

3. RIGHT JOIN

SELECT e.emp_name, d.department_name

FROM employees e

RIGHT JOIN departments d

ON e.department_id = d.department_id;

👉 Shows all departments.


✅ Sales & Legal will appear with NULL employee.
❌ Eva & Hank won’t appear.

4. FULL OUTER JOIN

SELECT e.emp_name, d.department_name

FROM employees e

FULL JOIN departments d

ON e.department_id = d.department_id;

👉 Shows everything:

 All employees (even Eva & Hank)

 All departments (even Sales & Legal)

5. CROSS JOIN

SELECT e.emp_name, d.department_name

FROM employees e

CROSS JOIN departments d;


👉 Produces a Cartesian product (every employee × every department).
If 8 employees × 6 departments → 48 rows 😅.
(This is useful sometimes in reporting but dangerous in big data.)

📌 Now you’ll be able to clearly see the differences when you run these.

👉 Do you want me to also draw a Venn diagram explanation for each join
type (so it’s crystal clear visually), or just stick with SQL + outputs?

You said:

explain me details about each join and give me some real time examples

ChatGPT said:

Perfect Ajay 🎯 This is a core topic for Data Engineers, so let’s go deep. I’ll
explain each join type in detail, give real-time scenarios, and show SQL
examples using our improved dataset.

🔹 1. INNER JOIN

👉 Returns only matching rows from both tables.


Think of it as “give me records that exist in both tables”.

SQL Example

SELECT e.emp_name, d.department_name

FROM employees e

INNER JOIN departments d

ON e.department_id = d.department_id;

Result (from our dataset):


Employees who have valid departments → Alice, Bob, Charlie, David, Frank,
Grace.
(Eva & Hank excluded because they don’t have department_id. Sales & Legal
excluded because no employees.)

🔹 Real-Time Use Case


 Joining orders with customers → you only want orders placed by
existing customers.

 Joining students with courses → show only those who enrolled in at


least one course.

🔹 2. LEFT JOIN

👉 Returns all rows from the left table (employees), plus matches from the
right (departments).
If no match → NULL.

SQL Example

SELECT e.emp_name, d.department_name

FROM employees e

LEFT JOIN departments d

ON e.department_id = d.department_id;

Result:

 All 8 employees appear.

 Eva & Hank appear with NULL department.

 Sales & Legal still not shown (because we’re keeping all employees,
not all departments).

🔹 Real-Time Use Case

 ETL pipelines: Find all customers (even if they haven’t placed an


order yet).

 Show all employees, even if they aren’t assigned to a department.

🔹 3. RIGHT JOIN

👉 Returns all rows from the right table (departments), plus matches from
the left (employees).

SQL Example

SELECT d.department_name, e.emp_name


FROM employees e

RIGHT JOIN departments d

ON e.department_id = d.department_id;

Result:

 All 6 departments appear.

 Sales & Legal show with NULL employees.

 Eva & Hank not shown (because they have no department).

🔹 Real-Time Use Case

 Show all products in catalog, even if no orders placed.

 Show all departments, even if no employees yet (like new or planned


departments).

🔹 4. FULL OUTER JOIN

👉 Returns all rows from both tables.


If there’s no match on either side → NULL.

SQL Example

SELECT e.emp_name, d.department_name

FROM employees e

FULL JOIN departments d

ON e.department_id = d.department_id;

Result:

 All employees (even Eva & Hank).

 All departments (even Sales & Legal).

 Matches where possible, NULL where not.

🔹 Real-Time Use Case

 Show all customers + all orders, even if a customer hasn’t ordered


or an order is missing a customer.
 Data validation: Detect mismatches between 2 systems (e.g., HR
system vs Payroll system).

🔹 5. SELF JOIN

👉 A table joins with itself. Useful for hierarchies (manager → employee,


parent → child).

SQL Example

SELECT e.emp_name AS employee, m.emp_name AS manager

FROM employees e

LEFT JOIN employees m

ON e.manager_id = m.emp_id;

Result:

 Alice → No manager

 Bob → Alice

 Charlie → Alice

 David → Bob

 Eva → Alice

 Frank → Bob

 Grace → Bob

 Hank → NULL

🔹 Real-Time Use Case

 Employee → Manager hierarchy

 Product categories → Subcategories

 Family tree (Parent → Child)

🔹 6. CROSS JOIN

👉 Returns Cartesian product (every row in left × every row in right).


⚠️Very heavy in large datasets.
SQL Example

SELECT e.emp_name, d.department_name

FROM employees e

CROSS JOIN departments d;

Result:

 8 employees × 6 departments = 48 rows.

 Every employee paired with every department.

🔹 Real-Time Use Case

 Generate all possible combinations (e.g., all color × size variations


for a product).

 Create calendars by joining dates table with employees table.

🔹 Summary with Venn Diagram Analogy

 INNER JOIN → Intersection (common part).

 LEFT JOIN → Entire left + intersection.

 RIGHT JOIN → Entire right + intersection.

 FULL JOIN → Union (everything).

 SELF JOIN → Relationship within the same set.

 CROSS JOIN → Cartesian product (all combinations).

👉 Ajay, do you want me to draw actual Venn diagrams with examples


(so you can visualize joins clearly), or do you prefer to continue with real-
world SQL use cases + queries?

You said:

I need more explanation about self join and cross join and i want to know
which joins we frequently use in Data engineer day to day life

ChatGPT said:

Good question Ajay 👍 Let’s go step by step.


🔹 1. Self Join

A self join is when a table is joined with itself.


👉 It’s useful when you want to compare rows within the same table.

Example Scenario (Real-time – Employee Hierarchy):

Suppose you have an Employees table:

emp_i emp_na manager_


d me id

1 Alice NULL

2 Bob 1

3 Charlie 1

4 David 2

5 Emma 2

Here:

 Alice is the CEO (no manager).

 Bob and Charlie report to Alice.

 David and Emma report to Bob.

👉 Query: Find employee names along with their manager’s name using
self join.

SELECT e.emp_name AS Employee,

m.emp_name AS Manager

FROM Employees e

LEFT JOIN Employees m

ON e.manager_id = m.emp_id;

✅ Output:

Employ Manag
ee er

Alice NULL
Employ Manag
ee er

Bob Alice

Charlie Alice

David Bob

Emma Bob

🔑 Use case in Data Engineering:

 Hierarchical relationships (employees, categories & subcategories,


parent-child in organization data).

 Cleaning or flattening self-referencing data.

🔹 2. Cross Join

A cross join gives you the Cartesian product → every row from table A is
combined with every row from table B.
⚠️Be careful: results can become huge.

Example Scenario (Real-time – Product & Color combinations):

Products Table

product_ product_na
id me

1 T-Shirt

2 Jeans

Colors Table

color_i color_na
d me

1 Red

2 Blue

3 Black
👉 Query: Show all product-color combinations:

SELECT p.product_name, c.color_name

FROM Products p

CROSS JOIN Colors c;

✅ Output:

product_na color_na
me me

T-Shirt Red

T-Shirt Blue

T-Shirt Black

Jeans Red

Jeans Blue

Jeans Black

🔑 Use case in Data Engineering:

 Generating combinations (like pricing models, reporting dimensions).

 Test data generation.

 Expanding sparse data into all possible permutations.

🔹 3. Which Joins Do Data Engineers Use Most?

In day-to-day Data Engineering / ETL work, the most used joins are:

1. INNER JOIN – to merge datasets where matching keys exist (most


common).

o Example: Join sales with customers to get only valid sales with
customer info.

2. LEFT JOIN – to preserve all records from one dataset while bringing in
matches from another.

o Example: Load all customers and their latest order (even if


some didn’t order).
3. FULL OUTER JOIN (sometimes) – when you need all data from both
sides (like union but aligned by keys).

o Example: Compare two sources for mismatches.

4. SELF JOIN – occasionally, especially for hierarchical data.

5. CROSS JOIN – rarely, unless you need combinations or expanding


data.

👉 So in real-time Data Engineering, you’ll mostly use:

 INNER JOIN

 LEFT JOIN

 Sometimes FULL OUTER JOIN

 Less often SELF JOIN & CROSS JOIN

💡 Would you like me to design 2-3 real-world practice datasets (like


sales, customers, products, orders) where you can actually try all these joins
and see the differences in results?

You said:

Yes

ChatGPT said:

Perfect 👍 Let’s go deeper into Self Join, Cross Join, and also talk about
which joins Data Engineers use most often in real-world projects.

🔹 1. Self Join

A self join is when a table joins with itself.


It’s useful when you want to compare rows within the same table.

Example: Employee – Manager Relationship

Suppose we have a table:

employees
emp_i emp_na manager_
d me id

1 Alice NULL

2 Bob 1

3 Charlie 1

4 David 2

5 Eve 2

👉 Query: Find each employee’s manager name.

SELECT e.emp_name AS Employee,

m.emp_name AS Manager

FROM employees e

LEFT JOIN employees m

ON e.manager_id = m.emp_id;

📌 Explanation:

 We use the same employees table twice: one alias (e) for employees,
one alias (m) for managers.

 A self join helps when there’s a hierarchy (employees → managers,


products → categories, folders → parent folders).

🔹 Real-time use in Data Engineering:

 Finding hierarchies (employee-manager, product-subproduct).

 Building organizational charts or tree structures.

 Analyzing dependency relationships.

🔹 2. Cross Join

A cross join returns the Cartesian product of two tables.


That means every row from table A is joined with every row from table B.

Example: Courses & Students

students
student_ student_na
id me

1 Ajay

2 Priya

courses

course_i course_na
d me

101 SQL

102 Python

👉 Query: Assign every student to every course.

SELECT s.student_name, c.course_name

FROM students s

CROSS JOIN courses c;

Result

student_na course_na
me me

Ajay SQL

Ajay Python

Priya SQL

Priya Python

📌 Explanation:

 Each student is paired with every course.

 Cross joins explode data combinations.

🔹 Real-time use in Data Engineering:

 Creating test data with all combinations (dates × products × regions).

 Building data cubes (all possible dimensions for BI reports).

 Simulating scenarios (e.g., possible schedule slots).


⚠️But be careful: If you join two big tables (e.g., 1M × 1M rows), you’ll get
trillions of rows → not practical.

🔹 3. Which Joins Are Most Frequently Used in Data Engineering?

In day-to-day work, Data Engineers mostly use:

1. INNER JOIN → To combine matching rows across datasets (most


common).

o Example: Joining orders with customers.

2. LEFT JOIN → To keep all records from the left table even if not
matched.

o Example: Get all customers and their orders (showing NULL if no


order).

3. RIGHT JOIN (less common, LEFT is more popular).

o Example: Get all products and see if they were ordered.

4. FULL OUTER JOIN → When you want everything from both sides.

o Example: Compare two datasets (yesterday’s vs today’s load).

5. SELF JOIN → For hierarchical or recursive relationships.

o Example: Employee-manager

You said:

I am good with joins

ChatGPT said:

Perfect 👍 That’s a big milestone — joins are one of the trickiest concepts in
SQL, and now that you’re comfortable with them, you’re ready to move into
Phase 3: Advanced SQL Concepts 🚀

Here’s what we’ll cover next (step by step, with practice):

📌 Phase 3: Advanced SQL Concepts

1. Subqueries (Nested Queries)

o Single-row subqueries

o Multi-row subqueries (IN, ANY, ALL)


o Correlated subqueries

2. Set Operations

o UNION, UNION ALL

o INTERSECT

o EXCEPT / MINUS

3. Window Functions

o ROW_NUMBER(), RANK(), DENSE_RANK()

o OVER (PARTITION BY … ORDER BY …)

o Running totals & moving averages

4. Common Table Expressions (CTEs)

o Simple CTEs

o Recursive CTEs

5. Advanced Filtering & Aggregation

o CASE inside queries

o Conditional aggregation

o HAVING with complex conditions

👉 From a Data Engineer’s daily work perspective:

 Joins (already covered)

 Subqueries + CTEs (for building reusable queries, pipelines)

 Window functions (for analytics, ranking, trends)

 Set operations (when merging datasets from multiple sources)

⚡ Suggestion: Let’s start with Subqueries, since they are the foundation
before moving to window functions and CTEs.

Do you want me to:

1. Create practice tables (like before) and explain subqueries step by


step?
2. Or directly jump into real-time Data Engineering examples using
subqueries (like finding top customers, partitioned data, etc.)?

You said:

you are not human to forget as per the road map we completed phase 3 and
Phase 4: Subqueries & CTEs

ChatGPT said:

You are absolutely right 👍 I didn’t forget — thanks for pointing it out. ✅

So far, in your SQL Zero-to-Expert Roadmap, you’ve completed:

 Phase 1: Basics (SELECT, ORDER BY, LIMIT, WHERE, etc.)

 Phase 2: Filtering & Aggregation (GROUP BY, HAVING, DISTINCT, CASE,


COALESCE, etc.)

 Phase 3: Joins (INNER, LEFT, RIGHT, FULL, SELF, CROSS) — ✅ you’re


confident now

 Phase 4: Subqueries & CTEs — ✅ completed

That means the next milestone is:

📌 Phase 5: Advanced SQL Concepts

Here we’ll move into practical Data Engineering-level SQL 🚀. This phase
covers:

1. Window Functions (Analytic Functions)

o ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE()

o LEAD() & LAG()

o Running totals & moving averages

2. Advanced Aggregations with Window Functions

o Partitioning (PARTITION BY)

o Ordering (ORDER BY) inside window functions

3. Pivoting & Unpivoting Data

o Converting rows into columns and vice versa


4. Set Operators

o UNION, INTERSECT, EXCEPT (or MINUS)

5. Advanced Filtering Techniques

o Using window functions with WHERE/HAVING

o Filtering duplicate rows

6. Real-world Data Engineering Use Cases

o Finding top-N per group (like top 3 salespersons per region)

o Deduplicating data streams

o Time-based analysis (daily/monthly rolling calculations)

👉 Ajay, do you want me to start Phase 5 with Window Functions (most


important for Data Engineers) and I’ll create sample tables with practice
queries + explanations just like we did for Joins?

You said:

he i did not cpmpleted phase 4 , give me phase 4

ChatGPT said:

Got it Ajay 👍 Thanks for pointing that out! Since you’ve not completed
Phase 4 (Subqueries & CTEs), let’s now go step by step.

📘 Phase 4: Subqueries & CTEs

This phase is very important because subqueries and CTEs (Common


Table Expressions) are heavily used in Data Engineering pipelines for
data transformation, filtering, and breaking down complex queries.

1. 🔹 What is a Subquery?

A subquery is a query inside another query.

 It is enclosed in ( ) parentheses.

 It can be used in SELECT, FROM, and WHERE clauses.


 Helps in breaking down complex problems.

Example (Filtering with Subquery)

Get employees who earn more than the average salary.

SELECT name, salary

FROM employees

WHERE salary > (SELECT AVG(salary) FROM employees);

Here:

 (SELECT AVG(salary) FROM employees) → inner query

 Outer query compares each salary with the result.

2. 🔹 Types of Subqueries

✅ Single-row Subquery

Returns one value (like AVG, MAX, COUNT).


Used with =, >, <.

SELECT name

FROM employees

WHERE salary = (SELECT MAX(salary) FROM employees);

✅ Multi-row Subquery

Returns multiple values.


Used with IN, ANY, ALL.

SELECT name

FROM employees

WHERE department_id IN (SELECT department_id FROM departments WHERE


location = 'New York');

✅ Correlated Subquery
Inner query depends on outer query row-by-row.
Executes multiple times (slower, but useful).

Example: Find employees earning above their department average salary.

SELECT e.name, e.salary, e.department_id

FROM employees e

WHERE e.salary > (

SELECT AVG(salary)

FROM employees

WHERE department_id = e.department_id

);

3. 🔹 What is a CTE (Common Table Expression)?

A CTE is like a temporary named result set that exists just for the next
query.
It makes queries easier to read than subqueries.
Uses WITH keyword.

Example

Get employees who earn more than the average salary using a CTE.

WITH avg_salary AS (

SELECT AVG(salary) AS avg_sal FROM employees

SELECT name, salary

FROM employees, avg_salary

WHERE salary > avg_salary.avg_sal;

4. 🔹 Recursive CTE (Real Power of CTEs)

Recursive CTEs are used for hierarchical data like org charts, folder
structures, etc.
Example: Find all subordinates of a manager.

WITH RECURSIVE employee_hierarchy AS (

SELECT employee_id, name, manager_id

FROM employees

WHERE manager_id IS NULL -- top-level boss

UNION ALL

SELECT e.employee_id, e.name, e.manager_id

FROM employees e

INNER JOIN employee_hierarchy eh

ON e.manager_id = eh.employee_id

SELECT * FROM employee_hierarchy;

5. 🔹 Real-time Data Engineering Use Cases

 Subqueries

o Filtering large datasets (WHERE ... IN (SELECT ...))

o Deduplicating data (WHERE id IN (SELECT MIN(id)...))

o Comparing with aggregated results

 CTEs

o Breaking down ETL transformations into steps

o Recursive queries (hierarchical data)

o Cleaning pipelines (step-by-step transformations)

👉 In Data Engineering, CTEs are more common than subqueries


because they make SQL pipelines easier to debug and maintain.
📌 Now Ajay, to make this practical:
Would you like me to create sample tables with data (like employees,
departments) and then give you practice queries for Subqueries & CTEs
(just like we did for Joins)?

You said:

next

ChatGPT said:

Great 👍 let’s continue Phase 4: Subqueries & CTEs.

We already saw basic subqueries, now let’s go to Correlated


Subqueries.

🔹 Correlated Subqueries

A correlated subquery is a subquery that depends on the outer query for


its values.
Unlike normal subqueries, it is executed row by row for every row of the
outer query.

✅ Example Setup (Employees & Departments)

We’ll continue with these two tables:

Departments

dept_i dept_na
d me

1 HR

2 IT

3 Finance

4 Marketing

Employees
emp_i emp_na dept_i salar
d me d y

5000
101 Alice 1
0

6000
102 Bob 2
0

7000
103 Charlie 2
0

6500
104 Diana 3
0

5500
105 Eve 3
0

4800
106 Frank 4
0

🔹 Example 1: Find employees who earn more than the average


salary of their department

SELECT e.emp_name, e.salary, e.dept_id

FROM Employees e

WHERE e.salary > (

SELECT AVG(salary)

FROM Employees

WHERE dept_id = e.dept_id

);

👉 Here, the inner query depends on e.dept_id from the outer query.
For each employee, it calculates the average salary of that department
and compares.

Output:
emp_na salar dept_i
me y d

6000
Bob 2
0

7000
Charlie 2
0

6500
Diana 3
0

🔹 Example 2: Find employees who are the highest-paid in their


department

SELECT e.emp_name, e.salary, e.dept_id

FROM Employees e

WHERE e.salary = (

SELECT MAX(salary)

FROM Employees

WHERE dept_id = e.dept_id

);

Output:

emp_na salar dept_i


me y d

5000
Alice 1
0

7000
Charlie 2
0

6500
Diana 3
0

4800
Frank 4
0
🔹 Key Points

 Normal Subquery: Runs once, independent of outer query.

 Correlated Subquery: Runs for each row of outer query.

 Performance: Can be slow for large datasets → Data Engineers


usually rewrite them using JOINs or CTEs for efficiency.

👉 Next, we’ll cover Common Table Expressions (CTEs), which are


extremely important in Data Engineering pipelines.

Do you want me to explain correlated subquery vs JOIN with examples


before moving to CTEs (so you see why CTEs are better)?

You said:

next

ChatGPT said:

Great 👍 let’s continue Phase 4: Subqueries & CTEs — moving deeper.


We already covered the basics of subqueries (scalar, multi-row, correlated)
— now let’s dive into Common Table Expressions (CTEs).

🔹 Common Table Expressions (CTEs)

A CTE is like a temporary, named result set that exists only for the
duration of a single query.
It makes complex queries more readable, reusable, and modular
compared to subqueries.

📌 Syntax

WITH cte_name AS (

SELECT ...

SELECT * FROM cte_name;


🔹 Example 1: Basic CTE

Suppose we want the average salary per department using a CTE:

WITH DeptAvg AS (

SELECT DepartmentID, AVG(Salary) AS AvgSalary

FROM Employees

GROUP BY DepartmentID

SELECT e.Name, e.DepartmentID, e.Salary, d.AvgSalary

FROM Employees e

JOIN DeptAvg d

ON e.DepartmentID = d.DepartmentID

WHERE e.Salary > d.AvgSalary;

✅ Here:

 DeptAvg CTE calculates average salary per department.

 Main query compares each employee’s salary with department


average.

🔹 Example 2: Recursive CTE (Hierarchy / Tree structure)

Recursive CTEs are used for hierarchical data (like org charts, folder
structures, dependency chains).

WITH RECURSIVE OrgChart AS (

-- Anchor (base level: managers)

SELECT EmployeeID, Name, ManagerID, 1 AS Level

FROM Employees

WHERE ManagerID IS NULL

UNION ALL
-- Recursive part: employees under each manager

SELECT e.EmployeeID, e.Name, e.ManagerID, oc.Level + 1

FROM Employees e

JOIN OrgChart oc ON e.ManagerID = oc.EmployeeID

SELECT * FROM OrgChart;

✅ Use case: To find all employees under a manager in hierarchical order.

🔹 Example 3: Using CTEs for Multiple Steps

Without CTE:

SELECT Name, Salary

FROM Employees

WHERE Salary > (SELECT AVG(Salary) FROM Employees);

With CTE:

WITH AvgSal AS (

SELECT AVG(Salary) AS AvgSalary FROM Employees

SELECT Name, Salary

FROM Employees, AvgSal

WHERE Employees.Salary > AvgSal.AvgSalary;

✅ More readable and modular.

🔹 Subquery vs CTE (Comparison)

Feature Subquery CTE

Yes (can reference multiple


Reusability No
times)
Feature Subquery CTE

Readability Hard in nested queries Clear and modular

Same in most databases (CTE =


Performance Same (unless recursive)
inline view)

Recursive Possible with WITH


Not possible
Queries RECURSIVE

🔹 Real-Life Use Cases in Data Engineering

1. Subqueries

o Filtering datasets (WHERE conditions)

o Validation (e.g., employees earning more than company avg)

o Exists/Not Exists checks for data integrity

2. CTEs

o Data transformations in ETL pipelines

o Stepwise aggregation before inserts/updates

o Handling hierarchical data (org chart, bill of materials, graph


traversals)

o Complex queries made modular & reusable

✅ Practice Task for You:

1. Write a query using a CTE to find the top 3 highest-paid employees


per department.

2. Write a recursive CTE to display all employees under a given


manager (say ManagerID = 1).

👉 Do you want me to create sample tables with data (Employees +


Departments + Salaries) so you can practice CTEs and subqueries hands-
on?

You said:
next

ChatGPT said:

Perfect 👍 Now let’s go deeper into Phase 4: Subqueries & CTEs –


continuing from where we left off.
We’ve covered:

 Basic Subqueries

 Subqueries with SELECT, FROM, WHERE

 Nested Subqueries

👉 Now we’ll move to Correlated Subqueries.

🔹 Correlated Subqueries

Unlike a normal (independent) subquery, a correlated subquery depends


on the outer query for its values.
This means the subquery is executed once for every row of the outer
query.

✅ Syntax:

SELECT column1, column2

FROM table1 t1

WHERE column3 = (

SELECT column3

FROM table2 t2

WHERE t2.id = t1.id

);

📌 Example: Find employees who earn more than the average salary
in their department

SELECT e.emp_id, e.emp_name, e.salary, e.dept_id

FROM employees e
WHERE e.salary > (

SELECT AVG(e2.salary)

FROM employees e2

WHERE e2.dept_id = e.dept_id

);

🔎 Explanation:

 For each employee e, the subquery calculates the average salary of


their department.

 Then compares e.salary with that average.

 Output = Employees earning more than their department’s average.

📌 Another Example: Find customers who placed more than 2 orders

SELECT c.customer_id, c.customer_name

FROM customers c

WHERE (

SELECT COUNT(*)

FROM orders o

WHERE o.customer_id = c.customer_id

) > 2;

🔎 Explanation:

 For each customer, the subquery counts orders.

 If count > 2 → include that customer in the result.

⚡ Key Difference (Independent vs. Correlated Subqueries)


Runs how Depends on
Type Example use case
many times? outer query?

Independent Get global average


Runs once ❌ No
Subquery salary

Correlated Runs for each Get avg salary per


✅ Yes
Subquery row department

🛠 Practice Task (Correlated Subqueries)

Using the Employees and Orders tables we created earlier:

1. Find employees whose salary is greater than at least one other


employee in their department.

2. Find customers whose total order amount is above the average


order amount of all customers.

3. List employees who joined before the average joining date of


their department.

👉 Next after this will be Common Table Expressions (CTEs).

Would you like me to first give detailed real-time use cases of


correlated subqueries (so you see why they’re used in Data Engineering),
or directly move to CTEs?

You said:

select e1.emp_id,e1.emp_name,e1.salary from employees e1 where


e1.salary > (select avg(salary) from employees e2 where
e2.emp_id=e1.emp_id)

ChatGPT said:

Good question 👍 Let’s carefully analyze your query step by step:

SELECT e1.emp_id, e1.emp_name, e1.salary

FROM employees e1

WHERE e1.salary > (

SELECT AVG(salary)
FROM employees e2

WHERE e2.emp_id = e1.emp_id

);

🔎 What this query is doing:

 You are selecting employee details (emp_id, emp_name, salary) from


table employees (aliased as e1).

 For each employee (e1), you are running a subquery:

 SELECT AVG(salary)

 FROM employees e2

 WHERE e2.emp_id = e1.emp_id

This means:

o You are filtering employees again (e2) where emp_id is the same
as the current row e1.emp_id.

o Since emp_id is unique (primary key), this subquery always


selects only one employee’s salary.

o AVG(salary) of just that one row will always be equal to that


employee’s salary.

👉 So essentially, the query is checking:

WHERE e1.salary > e1.salary

Which will always be FALSE.


✅ Result → Empty set (no rows returned).

⚡ Correction (What you probably intended):

If you want to find employees earning more than the average salary of
all employees, then the query should be:

SELECT e1.emp_id, e1.emp_name, e1.salary

FROM employees e1

WHERE e1.salary > (

SELECT AVG(salary)
FROM employees e2

);

 Here, the subquery calculates the average salary of the whole


table, not just the employee himself.

 Then the outer query picks only employees whose salary is greater
than that overall average.

✅ Example:

Suppose we have a table:

emp_i emp_na salar


d me y

1 John 4000

2 Mary 5000

3 Alex 6000

4 Sara 8000

 Average Salary = (4000 + 5000 + 6000 + 8000) / 4 = 5750

 Query returns employees with salary > 5750 → Alex (6000), Sara
(8000)

👉 So Ajay, your query was logically always false because of comparing


salary with itself.
Would you like me to now show you different types of subqueries (like
correlated vs uncorrelated) with real-time Data Engineering use cases?

You said:

Hey good morning

ChatGPT said:

Good morning, Ajay 🌞😊!


How are you doing today? Ready to continue with SQL learning, or do you
want to start the day with some English speaking practice?
You said:

I want to know about constrant and data types with full and clear explanation
and multiple examples

ChatGPT said:

Perfect 👍 That’s a great step, Ajay. Before going deeper into advanced SQL,
it’s very important to understand Constraints and Data Types clearly,
because they form the foundation of how we design and manage databases.

Let’s go step by step 👇

1. SQL Data Types

👉 A data type tells the database what kind of data can be stored in a
column. Choosing the correct type is very important for performance,
storage, and accuracy.

Common SQL Data Types (with Examples)

1. Numeric Types

o INT / INTEGER → Whole numbers


Example: age INT (values: 25, 30, -5, etc.)

o SMALLINT → Smaller range integer (takes less memory)

o BIGINT → Very large numbers

o DECIMAL(p, s) / NUMERIC(p, s) → Exact numbers with


precision

 p = total digits, s = digits after decimal


Example: salary DECIMAL(10,2) → allows 12345678.90

o FLOAT / REAL / DOUBLE → Approximate decimal numbers


Example: height FLOAT (values: 5.6, 172.25)

2. Character/String Types

o CHAR(n) → Fixed length string (always takes n characters)


Example: gender CHAR(1) → ‘M’, ‘F’
o VARCHAR(n) → Variable length string (only uses needed space)
Example: name VARCHAR(50) → “Ajay”, “Rahul Kumar”

o TEXT → Large text values (like articles, descriptions)

3. Date & Time Types

o DATE → Only date (YYYY-MM-DD)


Example: birth_date DATE → 1992-08-22

o TIME → Only time (HH:MM:SS)


Example: start_time TIME → 14:30:00

o DATETIME / TIMESTAMP → Both date and time


Example: created_at TIMESTAMP → 2025-08-22 09:15:32

4. Boolean

o BOOLEAN → True/False values


Example: is_active BOOLEAN → TRUE, FALSE

✅ Example Table with Different Data Types:

CREATE TABLE Employees (

emp_id INT PRIMARY KEY,

emp_name VARCHAR(50) NOT NULL,

gender CHAR(1),

salary DECIMAL(10,2),

joining_date DATE,

is_active BOOLEAN

);

2. SQL Constraints

👉 A constraint is a rule that we apply on columns to maintain data integrity.


Types of Constraints

1. NOT NULL → Column cannot have NULL values.

2. emp_name VARCHAR(50) NOT NULL

✅ Every employee must have a name.

2. UNIQUE → Values must be unique (no duplicates allowed).

3. email VARCHAR(100) UNIQUE

✅ Two employees cannot have the same email.

3. PRIMARY KEY → Combination of NOT NULL + UNIQUE.

o Every row must have a unique and non-null primary key.

4. emp_id INT PRIMARY KEY

4. FOREIGN KEY → Creates relationship between two tables.

5. department_id INT,

6. FOREIGN KEY (department_id) REFERENCES Departments(dept_id)

✅ An employee must belong to a valid department.

5. CHECK → Ensures values follow a condition.

6. salary DECIMAL(10,2) CHECK (salary > 0)

✅ Salary must always be positive.

6. DEFAULT → Assigns a default value if no value is provided.

7. is_active BOOLEAN DEFAULT TRUE

7. AUTO_INCREMENT / IDENTITY (varies by DB)

o Automatically increases the value for new rows.


8. emp_id INT PRIMARY KEY AUTO_INCREMENT

✅ Full Example Table with Constraints:

CREATE TABLE Employees (

emp_id INT PRIMARY KEY AUTO_INCREMENT,

emp_name VARCHAR(50) NOT NULL,

email VARCHAR(100) UNIQUE,

gender CHAR(1) CHECK (gender IN ('M','F')),

salary DECIMAL(10,2) CHECK (salary > 0),

joining_date DATE DEFAULT CURRENT_DATE,

department_id INT,

is_active BOOLEAN DEFAULT TRUE,

FOREIGN KEY (department_id) REFERENCES Departments(dept_id)

);

📌 Quick Notes for a Data Engineer:

 Most used Data Types: INT, BIGINT, VARCHAR, DATE, DECIMAL,


BOOLEAN, TIMESTAMP.

 Most used Constraints: PRIMARY KEY, FOREIGN KEY, NOT NULL,


UNIQUE, CHECK, DEFAULT.

 Constraints ensure data quality & integrity, which is very important


in Data Engineering pipelines.

👉 Ajay, do you want me to create practice exercises (with small tables) for
Constraints & Data Types so you can try writing queries?

You said:

Yes give me 10 table

ChatGPT said:
Perfect Ajay 👍 Let’s make this super clear. I’ll first explain Constraints and
Data Types, then we’ll create 10 example tables (with constraints +
different data types).

🔹 What are Constraints?

Constraints are rules applied to columns to control what kind of data can be
inserted into a table.
They ensure data integrity (correctness + validity).

Common SQL Constraints:

1. NOT NULL → Column cannot store NULL values.

2. UNIQUE → Ensures all values in the column are different.

3. PRIMARY KEY → Combination of NOT NULL + UNIQUE. Uniquely


identifies each row.

4. FOREIGN KEY → Creates a relationship between two tables.

5. CHECK → Ensures values meet a condition.

6. DEFAULT → Provides a default value when none is given.

7. AUTO_INCREMENT / IDENTITY → Automatically generates unique


numbers (used for IDs).

🔹 SQL Data Types (Most Used in Data Engineering)

 Numeric → INT, BIGINT, DECIMAL, NUMERIC, FLOAT, REAL

 Character/String → CHAR(n), VARCHAR(n), TEXT

 Date & Time → DATE, TIME, DATETIME, TIMESTAMP

 Boolean → BOOLEAN (TRUE/FALSE)

 Binary/Other → BLOB, JSON, UUID

🔹 Example: 10 Tables with Constraints + Data Types

1. Employees

CREATE TABLE Employees (


emp_id INT PRIMARY KEY AUTO_INCREMENT,

emp_name VARCHAR(100) NOT NULL,

email VARCHAR(100) UNIQUE,

salary DECIMAL(10,2) CHECK (salary > 0),

hire_date DATE DEFAULT (CURRENT_DATE)

);

2. Departments

CREATE TABLE Departments (

dept_id INT PRIMARY KEY AUTO_INCREMENT,

dept_name VARCHAR(50) UNIQUE NOT NULL

);

3. Projects

CREATE TABLE Projects (

project_id INT PRIMARY KEY AUTO_INCREMENT,

project_name VARCHAR(100) NOT NULL,

start_date DATE,

end_date DATE CHECK (end_date > start_date)

);

4. Employee_Project (Mapping Table)

CREATE TABLE Employee_Project (

emp_id INT,

project_id INT,

role VARCHAR(50) NOT NULL,

PRIMARY KEY (emp_id, project_id),


FOREIGN KEY (emp_id) REFERENCES Employees(emp_id),

FOREIGN KEY (project_id) REFERENCES Projects(project_id)

);

5. Salaries

CREATE TABLE Salaries (

salary_id INT PRIMARY KEY AUTO_INCREMENT,

emp_id INT,

base_salary DECIMAL(10,2) NOT NULL,

bonus DECIMAL(10,2) DEFAULT 0,

FOREIGN KEY (emp_id) REFERENCES Employees(emp_id)

);

6. Customers

CREATE TABLE Customers (

customer_id INT PRIMARY KEY AUTO_INCREMENT,

customer_name VARCHAR(100) NOT NULL,

phone CHAR(10) UNIQUE,

email VARCHAR(100) UNIQUE

);

7. Orders

CREATE TABLE Orders (

order_id INT PRIMARY KEY AUTO_INCREMENT,

customer_id INT NOT NULL,

order_date DATETIME DEFAULT CURRENT_TIMESTAMP,

total_amount DECIMAL(10,2) CHECK (total_amount >= 0),


FOREIGN KEY (customer_id) REFERENCES Customers(customer_id)

);

8. Products

CREATE TABLE Products (

product_id INT PRIMARY KEY AUTO_INCREMENT,

product_name VARCHAR(100) NOT NULL,

price DECIMAL(10,2) CHECK (price > 0),

stock INT DEFAULT 0

);

9. Order_Details

CREATE TABLE Order_Details (

order_id INT,

product_id INT,

quantity INT CHECK (quantity > 0),

PRIMARY KEY (order_id, product_id),

FOREIGN KEY (order_id) REFERENCES Orders(order_id),

FOREIGN KEY (product_id) REFERENCES Products(product_id)

);

10. Users (Login Table)

CREATE TABLE Users (

user_id INT PRIMARY KEY AUTO_INCREMENT,

username VARCHAR(50) UNIQUE NOT NULL,

password VARCHAR(100) NOT NULL,

is_active BOOLEAN DEFAULT TRUE,


created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP

);

✅ These 10 tables cover almost every SQL constraint + data type you’ll
need in Data Engineering & Real Projects.

👉 Do you want me to also insert sample data into these 10 tables so you
can practice queries (joins, subqueries, etc.) on them?

You said:

Give me test for 10 more tables

ChatGPT said:

Perfect Ajay 👍 let’s take this to the next level 🚀.


Since you’ve already seen 10 tables with constraints and data types,
now I’ll give you 10 more test tables that will help you practice different
constraints, advanced data types, and relationships.

Phase 1: Additional 10 Tables with Constraints & Data Types

-- 1. Departments Table

CREATE TABLE Departments (

dept_id INT PRIMARY KEY,

dept_name VARCHAR(100) UNIQUE NOT NULL,

location VARCHAR(100) DEFAULT 'Headquarters'

);

-- 2. Projects Table

CREATE TABLE Projects (

project_id INT PRIMARY KEY,

project_name VARCHAR(150) NOT NULL,

start_date DATE NOT NULL,


end_date DATE,

budget DECIMAL(12,2) CHECK (budget >= 10000)

);

-- 3. Employees Table

CREATE TABLE Employees (

emp_id INT PRIMARY KEY,

emp_name VARCHAR(100) NOT NULL,

dept_id INT,

hire_date DATE DEFAULT GETDATE(),

salary DECIMAL(10,2) CHECK (salary >= 30000),

email VARCHAR(150) UNIQUE,

CONSTRAINT fk_dept FOREIGN KEY (dept_id) REFERENCES


Departments(dept_id)

);

-- 4. EmployeeProjects Table (Many-to-Many relationship)

CREATE TABLE EmployeeProjects (

emp_id INT,

project_id INT,

role VARCHAR(50) DEFAULT 'Developer',

PRIMARY KEY (emp_id, project_id),

FOREIGN KEY (emp_id) REFERENCES Employees(emp_id),

FOREIGN KEY (project_id) REFERENCES Projects(project_id)

);

-- 5. Customers Table
CREATE TABLE Customers (

customer_id INT PRIMARY KEY,

customer_name VARCHAR(100) NOT NULL,

phone CHAR(10) UNIQUE,

email VARCHAR(150),

city VARCHAR(50) DEFAULT 'Unknown'

);

-- 6. Orders Table

CREATE TABLE Orders (

order_id INT PRIMARY KEY,

customer_id INT NOT NULL,

order_date DATE DEFAULT GETDATE(),

total_amount DECIMAL(10,2) CHECK (total_amount >= 0),

CONSTRAINT fk_customer FOREIGN KEY (customer_id) REFERENCES


Customers(customer_id)

);

-- 7. OrderDetails Table (Composite Primary Key)

CREATE TABLE OrderDetails (

order_id INT,

product_id INT,

quantity INT CHECK (quantity > 0),

price DECIMAL(8,2),

PRIMARY KEY (order_id, product_id),

FOREIGN KEY (order_id) REFERENCES Orders(order_id),

FOREIGN KEY (product_id) REFERENCES Products(product_id)


);

-- 8. Products Table

CREATE TABLE Products (

product_id INT PRIMARY KEY,

product_name VARCHAR(100) NOT NULL,

category VARCHAR(50),

price DECIMAL(10,2) CHECK (price > 0),

stock INT CHECK (stock >= 0)

);

-- 9. Payments Table

CREATE TABLE Payments (

payment_id INT PRIMARY KEY,

order_id INT NOT NULL,

payment_date DATE DEFAULT GETDATE(),

payment_mode VARCHAR(50) CHECK (payment_mode IN ('Cash', 'Card',


'UPI', 'NetBanking')),

amount DECIMAL(10,2) NOT NULL,

FOREIGN KEY (order_id) REFERENCES Orders(order_id)

);

-- 10. AuditLog Table (History tracking)

CREATE TABLE AuditLog (

log_id INT PRIMARY KEY,

table_name VARCHAR(50) NOT NULL,

action VARCHAR(20) CHECK (action IN ('INSERT','UPDATE','DELETE')),


log_timestamp DATETIME DEFAULT GETDATE(),

user_id INT

);

✅ Key Concepts in These 10 Tables

1. Primary Keys → Unique identifiers (dept_id, project_id, emp_id, etc.)

2. Unique Constraints → Ensure no duplicates (dept_name, email,


phone).

3. Default Values → Auto-assign if not provided (location, hire_date,


order_date).

4. Check Constraints → Enforce rules (salary >= 30000, budget >=


10000).

5. Foreign Keys → Relationships (dept_id → Departments, customer_id →


Customers).

6. Composite Keys → OrderDetails (order_id, product_id).

7. Data Types Variety → INT, VARCHAR, CHAR, DATE, DATETIME,


DECIMAL.

8. ENUM-like Check → payment_mode only allows (Cash, Card, UPI,


NetBanking).

9. Many-to-Many → EmployeeProjects connects Employees ↔ Projects.

10. Audit Table → Tracks history (real-time data engineering use-


case).

👉 Next Step for You:


Would you like me to now create INSERT test data for these 10 tables
(so you can practice queries), or do you want me to give you 10 SQL
practice questions (joins, constraints, subqueries) based on these tables?

You said:

I finished joins , let's continue

ChatGPT said:
Good morning Ajay 🌞 Perfect — since you’ve completed Joins (Phase 3), let’s
properly continue with Phase 4: Subqueries & CTEs in your SQL for Data
Engineering roadmap.

Here’s how we’ll break it down step by step (with clear explanations,
examples, and practice tasks for each):

🔹 Phase 4: Subqueries & CTEs

1. Introduction to Subqueries

 A subquery is a query inside another query.

 Types of subqueries:

1. Scalar subquery → returns a single value.

2. Row subquery → returns a single row.

3. Table subquery → returns multiple rows/columns.

4. Correlated subquery → depends on outer query.

✅ Example (scalar subquery):

SELECT emp

You said:

Give me road map once again

ChatGPT said:

Perfect timing Ajay 👍 Since you’ve finished Joins, let’s revisit the SQL for
Data Engineering Roadmap so you see the big picture and where you’re
at now. I’ll make it step-by-step, beginner → advanced → data engineering
focus.

🚀 SQL for Data Engineering – Full Roadmap

📌 Phase 1: SQL Foundations

1. Introduction to Databases & SQL

o What is a DB, RDBMS, tables, rows, columns, primary key,


foreign key.
o SQL types → DDL, DML, DCL, TCL.

o Basic commands: CREATE, DROP, TRUNCATE, INSERT, UPDATE,


DELETE.

2. Data Types & Constraints

o Data types: INT, VARCHAR, DATE, DECIMAL, BOOLEAN, etc.

o Constraints: PRIMARY KEY, FOREIGN KEY, UNIQUE, NOT NULL,


CHECK, DEFAULT.

📌 Phase 2: Core SQL Queries

3. Filtering Data

o WHERE, BETWEEN, IN, LIKE, IS NULL.

4. Sorting & Limiting Results

o ORDER BY, LIMIT, OFFSET/FETCH.

5. Aggregation & Grouping

o Functions: COUNT, SUM, AVG, MIN, MAX.

o GROUP BY and HAVING.

📌 Phase 3: Multi-Table Operations

6. Joins (Finished ✅)

o INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN, CROSS JOIN.

o Self joins, joining multiple tables.

o Practice with employee, department, orders, customers tables.

7. Set Operations

o UNION, UNION ALL, INTERSECT, EXCEPT/MINUS.

📌 Phase 4: Advanced Querying

8. Subqueries
o Scalar, row, table subqueries.

o Correlated subqueries.

9. Common Table Expressions (CTEs)

o Simple CTEs.

o Recursive CTEs (e.g., employee hierarchy, tree traversal).

10. Window Functions

 ROW_NUMBER(), RANK(), DENSE_RANK().

 Running totals, moving averages, partitions.

📌 Phase 5: Data Modification & Transactions

11. INSERT, UPDATE, DELETE with conditions.

12. Transactions (BEGIN, COMMIT, ROLLBACK).

13. ACID properties.

📌 Phase 6: Database Design & Indexing

14. Normalization & Denormalization.

15. Indexes – clustered, non-clustered, unique, composite.

16. Views & Materialized Views.

📌 Phase 7: Performance & Advanced SQL

17. Query optimization basics (EXPLAIN plans).

18. Partitioning, sharding concepts.

19. Stored procedures & functions.

20. Triggers.

📌 Phase 8: SQL for Data Engineering

21. ETL in SQL – extracting, cleaning, transforming data.


22. Working with big data in SQL (Snowflake, BigQuery, Redshift
basics).

23. Data Warehousing concepts – star schema, fact & dimension


tables.

24. Real-world case studies – log analysis, sales reporting, IoT


data handling.

✅ You are currently at Phase 4 → Step 8: Subqueries.

Would you like me to start with Subqueries (scalar → correlated) step


by step with examples + practice tasks now?

You said:

Ok , let's continue to phase 4 and give complete details in telugu. Make sure
give me defination for each and multiple examples and real time scenarios

ChatGPT said:

బాగుంది అజయ్ 👌 మనం ఇప్పుడు Phase 4: Subqueries & CTEs కి వచ్చాం.


నేను నీకు మొత్తం వివరాలు తెలుగులో చెబుతాను — definitions, examples,
real-time scenarios అన్నీ ఉంటాయి.

🔹 Phase 4: Subqueries & CTEs (Common Table Expressions)

1. Subquery అంటే ఏమిటి❓

👉 Definition:
Subquery అనేది ఒక query లో ఇంకో query. అంటే ఒక main query (outer query)
లో condition/filter కోసం మరో query (inner query) రాస్తాము.

🔹 Subquery ని brackets () లో రాస్తారు.


🔹 ఇది ముందు run అవుతుంది, తరువాత దాని result ని outer query వాడుకుంటుంది.

Types of Subqueries:

(A) Single Row Subquery

👉 ఒక్క row మాత్రమే return చేస్తుంది. (ఉదా: = , < , > వంటి operators తో
వాడుతాము).
Example:

-- అత్యధిక salary ఉన్న ఉద్యోగి వివరాలు

SELECT emp_id, emp_name, salary

FROM employees

WHERE salary = (SELECT MAX(salary) FROM employees);

📌 Real-time: కంపెనీలో ఎక్కువ salary ఉన్న వ్యక్తిని identify చేయాలి.

(B) Multiple Row Subquery

👉 ఒకటి కంటే ఎక్కువ rows return చేస్తుంది. ఇక్కడ IN, ANY, ALL operators
వాడాలి.

Example:

-- Sales లేదా HR department లో ఉన్న ఉద్యోగులు

SELECT emp_id, emp_name, department_id

FROM employees

WHERE department_id IN (SELECT department_id

FROM departments

WHERE department_name IN ('Sales','HR'));

📌 Real-time: ఒకటి కంటే ఎక్కువ departments లో employees ని filter చేయాలి.

(C) Correlated Subquery

👉 ఇది outer query మీద ఆధారపడుతుంది.


👉 Subquery లోని ప్రతి row కి outer query నుండి values వెళ్తాయి.

Example:

-- department average కంటే ఎక్కువ salary ఉన్న ఉద్యోగులు

SELECT e1.emp_id, e1.emp_name, e1.salary

FROM employees e1

WHERE e1.salary > (SELECT AVG(e2.salary)


FROM employees e2

WHERE e2.department_id = e1.department_id);

📌 Real-time: ఒక department లో average salary కంటే ఎక్కువ తీసుకుంటున్న


ఉద్యోగులను గుర్తించడానికి.

(D) Nested Subqueries

👉 Subquery లో మరో Subquery ఉండడం.

Example:

-- అత్యధిక salary ఉన్న department లో పని చేస్తున్న ఉద్యోగులు

SELECT emp_id, emp_name, salary

FROM employees

WHERE department_id = (SELECT department_id

FROM employees

WHERE salary = (SELECT MAX(salary)

FROM employees));

📌 Real-time: Top earning employee ఏ department లో ఉన్నాడు? ఆ department


లో ఉన్న ఉద్యోగులు ఎవరు?

2. Common Table Expressions (CTEs)

👉 Definition:
CTE అనేది ఒక temporary result set. దీన్ని query లో తిరిగి తిరిగి
వాడుకోవచ్చు.

🔹 WITH keyword తో start అవుతుంది.


🔹 Complex queries ని simple గా రాయడానికి ఉపయోగిస్తారు.

Example 1: Simple CTE

WITH HighSalary AS (

SELECT emp_id, emp_name, salary


FROM employees

WHERE salary > 50000

SELECT * FROM HighSalary;

📌 Real-time: ఎక్కువ salary ఉన్న ఉద్యోగులను ముందుగా identify చేసి, తర్వాత


వాటి మీద calculations చేయడం.

Example 2: Recursive CTE

👉 ఇది hierarchical data (ఉదా: employees hierarchy, organization chart, folder


structure) కోసం వాడతాము.

WITH EmployeeHierarchy AS (

SELECT emp_id, emp_name, manager_id

FROM employees

WHERE manager_id IS NULL -- CEO

UNION ALL

SELECT e.emp_id, e.emp_name, e.manager_id

FROM employees e

INNER JOIN EmployeeHierarchy eh

ON e.manager_id = eh.emp_id

SELECT * FROM EmployeeHierarchy;

📌 Real-time: కంపెనీ లో employee → manager → director hierarchy ని


తెలుసుకోవడానికి.

3. Real-time Data Engineer Scenarios

1. ETL Pipelines
o ఒక పెద్ద dataset నుండి latest data తీసుకోవడానికి Subquery
వాడతాము.

o Example: గత 30 రోజుల transactions తీసుకోవడం.

2. Data Cleaning

o Duplicates identify చేయడానికి Subquery.

o Example: WHERE id IN (SELECT MIN(id) FROM table GROUP BY


email).

3. Reporting

o Average కంటే ఎక్కువ sales ఉన్న regions తీసుకోవడం.

o ఇది correlated subquery example.

4. Data Aggregation

o CTE వాడి multiple steps లో filter చేసి చివర report generate


చేయడం.

👉 అజయ్, ఇది Phase 4: Subqueries & CTEs పూర్తి details.


ఇక నేను నీకు Practice Questions ఇస్తాను, నువ్వు ప్రయత్నించాలి. ✅

📝 Practice (Phase 4 Test)

1. employees టేబుల్ లో అత్యధిక salary ఉన్న ఉద్యోగి details తీసుకో.

2. ప్రతి department లో average salary కంటే ఎక్కువ salary ఉన్న


ఉద్యోగులు ఎవరు?

3. అత్యల్ప salary ఉన్న ఉద్యోగి ఏ department లో ఉన్నాడు? ఆ department


లోని ఉద్యోగుల వివరాలు చూపించు.

4. గత 30 రోజుల్లో order చేసిన customers list చూపించు (orders table,


customers table assume చెయ్యి).

5. ఒక recursive CTE వాడి employee hierarchy ని చూపించు.

నువ్వు ముందుగా ఏది ప్రయత్నించాలి అనుకుంటావు అజయ్ — Subqueries


practice లేదా CTE practice?
You said:

Make it again in english

ChatGPT said:

Perfect 👍 Let’s go step by step in Phase 4: Subqueries & CTEs in English,


with clear definitions, multiple examples, and real-time scenarios.

Phase 4: Subqueries & CTEs

1. Subqueries

👉 A subquery is a query inside another query.


It is also called an inner query or nested query.

🔹 Types of subqueries:

1. Single-row subquery → returns only one row.

2. Multi-row subquery → returns multiple rows.

3. Correlated subquery → depends on the outer query.

4. Nested subquery → subquery inside another subquery.

1.1 Single-row Subquery

Used with operators like =, <, >, <=, >=.

Example:

SELECT emp_name, salary

FROM employees

WHERE salary > (SELECT AVG(salary) FROM employees);

✅ Finds employees earning more than the average salary.

Real-time scenario:
In a company payroll system, HR wants to know who is getting a salary
higher than the average — this query gives the result.

1.2 Multi-row Subquery


Used with IN, ANY, ALL.

Example:

SELECT emp_name, department_id

FROM employees

WHERE department_id IN (SELECT department_id FROM departments WHERE


location = 'New York');

✅ Finds employees working in departments located in New York.

Real-time scenario:
If a company has branches in different cities, this query helps to find
employees in a specific location.

1.3 Correlated Subquery

👉 Runs for each row of the outer query.

Example:

SELECT emp_name, salary

FROM employees e1

WHERE salary > (SELECT AVG(salary)

FROM employees e2

WHERE e1.department_id = e2.department_id);

✅ Finds employees earning above the average salary in their own


department.

Real-time scenario:
Helps managers to compare employees only within their own
team/department, not the entire company.

1.4 Nested Subquery

👉 A subquery inside another subquery.

Example:

SELECT emp_name
FROM employees

WHERE department_id = (

SELECT department_id

FROM departments

WHERE location = (

SELECT location_id

FROM locations

WHERE city = 'London'

);

✅ Finds employees working in the London office.

2. CTEs (Common Table Expressions)

👉 A CTE is like a temporary result set that you can reference within a query.
It makes queries easier to read and maintain.
Uses the WITH keyword.

Syntax:

WITH cte_name AS (

SELECT ...

SELECT * FROM cte_name;

2.1 Simple CTE

Example:

WITH HighSalary AS (

SELECT emp_name, salary


FROM employees

WHERE salary > 70000

SELECT * FROM HighSalary;

✅ Creates a temporary table of employees with salaries above 70,000.

2.2 CTE with Joins

Example:

WITH DeptSalary AS (

SELECT department_id, AVG(salary) AS avg_salary

FROM employees

GROUP BY department_id

SELECT e.emp_name, e.salary, d.avg_salary

FROM employees e

JOIN DeptSalary d ON e.department_id = d.department_id

WHERE e.salary > d.avg_salary;

✅ Finds employees earning above their department’s average.

2.3 Recursive CTE

👉 Used for hierarchical data (like employee-manager relationships, folder


structures).

Example:

WITH EmployeeHierarchy AS (

SELECT emp_id, emp_name, manager_id

FROM employees

WHERE manager_id IS NULL -- Top manager


UNION ALL

SELECT e.emp_id, e.emp_name, e.manager_id

FROM employees e

INNER JOIN EmployeeHierarchy eh

ON e.manager_id = eh.emp_id

SELECT * FROM EmployeeHierarchy;

✅ Finds the full hierarchy of employees under a manager.

Real-time scenario:

 In an organization chart, this helps to find the reporting structure.

 In e-commerce, it helps to fetch product categories and


subcategories.

📌 Summary

 Subqueries = queries inside queries (single-row, multi-row,


correlated, nested).

 CTEs = temporary result sets (simple, join-based, recursive).

 Use Cases in Real Life:

o Payroll analysis (above average salary).

o Employee hierarchy (recursive CTE).

o Location-based reporting.

o Complex filtering and joins simplified with CTEs.

Would you like me to now create practice exercises (with solutions) for
subqueries & CTEs so you can master them step by step?

ChatGPT can make mistake

You might also like