0% found this document useful (0 votes)
0 views

Module_5

SQL (Structured Query Language) is a standard language for managing relational databases, enabling CRUD operations and ensuring data integrity. Data modeling is crucial in organizing data elements and their relationships, which aids in efficient database design and data quality. Key concepts include normalization to reduce redundancy, star schemas for analytical processing, and ACID properties to ensure transaction reliability.

Uploaded by

Harish Karthik
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Module_5

SQL (Structured Query Language) is a standard language for managing relational databases, enabling CRUD operations and ensuring data integrity. Data modeling is crucial in organizing data elements and their relationships, which aids in efficient database design and data quality. Key concepts include normalization to reduce redundancy, star schemas for analytical processing, and ACID properties to ensure transaction reliability.

Uploaded by

Harish Karthik
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 67

📘 What is an SQL Database?

✅ SQL (Structured Query Language) is:

 A standard language for managing and querying relational databases


 Used to create, read, update, and delete (CRUD) data
 The foundation of data storage and retrieval in many organizations

✅ Relational Database (RDBMS):

 Stores data in tables (relations) with rows and columns


 Enforces data integrity, relationships, and constraints
 Examples: MySQL, PostgreSQL, SQLite, SQL Server, Oracle

💡 Why SQL is Crucial in Data Science


Data Science
Role of SQL
Task
Extract structured data from relational databases using
Data Collection
queries
Data Cleaning Filter, group, and preprocess large datasets before analysis
Feature
Join multiple tables, calculate aggregates for modeling
Engineering
Model Training Load only relevant data for efficiency
Dashboard/
Query real-time or stored data for visualization tools
Reporting
Big Data Pipelines SQL integrates with tools like Spark, Hive, BigQuery

🧱 Common SQL Database Systems


Database Key Features Best For
MySQL Widely used, open-source, fast Web apps, analytics
Advanced features, strong data Complex queries,
PostgreSQL
types geospatial
File-based, lightweight, no server
SQLite Prototyping, small projects
needed
Enterprise BI, large
SQL Server Microsoft’s RDBMS, enterprise-grade
companies
Banking, large data
Oracle High performance, paid
warehouses
SQL Database: Installation Guide (Step-by-Step)

Let’s focus on MySQL (most commonly used) and SQLite (simplest to start with).

🔹 Option 1: Install MySQL

🔧 1. Download MySQL

Visit: https://dev.mysql.com/downloads/

 Choose MySQL Community Edition


 Select the appropriate version for Windows/macOS/Linux

🔧 2. Install MySQL

 Use the MySQL Installer


 Choose Developer Default setup
 Set root password (remember it!)
 Start MySQL Server as a Windows service (or daemon on Linux)

🔧 3. Verify Installation

 Open terminal/command prompt:

 Login using the password


 You should see the MySQL prompt:

✅ Install Python Connector


✅ Sample Python Connection Code

🔹 Option 2: Install SQLite (No Server Required)

📦 Install SQLite via Python

🧪 Connect and Use


✅ Works for quick testing, small datasets, or local projects.

🧠 Core SQL Concepts

Concept Example Purpose


SELECT SELECT * FROM table Retrieve data
WHERE WHERE price > 500 Filter rows
JOIN INNER JOIN orders ON ... Combine data from tables
GROUP BY GROUP BY product Aggregate (sum, avg, etc.)
ORDER BY ORDER BY revenue DESC Sort results
INSERT, UPDATE INSERT INTO table ... Add or modify data
CREATE TABLE Define table schema Create new tables

📈 Relevance in Data Science

Use Case SQL Role


Preprocessing for ML Query only needed features, rows, date ranges
Data Pipelines Use SQL queries in ETL/ELT stages
Combine data from customers, transactions, feedback,
Joins and Merges
etc.
Exploratory Data
Aggregate sales, user behavior, visits from raw tables
Analysis
Business Intelligence Backend for dashboards using Tableau, Power BI,
(BI) Looker
Big Data Access Connect SQL with SparkSQL, Hive, Google BigQuery
✅ Summary

Feature MySQL SQLite


Server needed ✅ Yes ❌ No
Speed 🔁 High 🚀 Fast for small data
Setup difficulty Medium Very Easy
Best for Production DBs, Analysis Testing, Local Projects
Python support mysql.connector, sqlalchemy sqlite3, sqlalchemy

📌 What is Data Modeling?

Data Modeling is the process of defining and organizing data elements, their
relationships, and structure to represent how data will be stored, accessed, and
updated in a database system.

🧠 Think of data modeling as creating a blueprint for your data, just like how an
architect creates a blueprint for a building.

🎯 Purpose of Data Modeling

 ✅ Ensures data consistency


 ✅ Helps in designing efficient databases
 ✅ Supports data quality and validation
 ✅ Clarifies data relationships
 ✅ Makes databases easier to query and scale

🧱 Components of a Data Model

Component Description
Entities Things or objects (e.g., Customer, Product, Order)
Attributes Properties of entities (e.g., CustomerName, OrderDate)
Relationshi
How entities are linked (e.g., one-to-many, many-to-many)
ps
Primary Key Unique identifier for an entity
Foreign Key Connects rows across different tables
🧭 Types of Data Models

Type Description
Conceptual High-level view of data, focuses on business entities and
Model relationships
Logical
Adds structure: tables, columns, data types, keys, constraints
Model
Physical Maps to actual database implementation (SQL, NoSQL), with
Model indexes, storage

Example: E-Commerce Data Model

🧾 Entities:

 Customer: CustomerID, Name, Email


 Product: ProductID, Name, Price
 Order: OrderID, CustomerID, Date
 OrderItem: OrderItemID, OrderID, ProductID, Quantity

🧩 Relationships:

 One Customer → many Orders


 One Order → many OrderItems
 Each OrderItem → one Product

This structure is typically implemented in a relational database using foreign keys.

Data Modeling Tools

 ER Diagrams (Entity-Relationship Diagrams): Visualize entities and


relationships
 Tools like:
o dbdiagram.io
o Lucidchart
o MySQL Workbench
o ERDPlus
o Microsoft Visio
🧪 Data Modeling in Python / Pandas Example

Even in pandas, you mimic data models using DataFrames and their
relationships:

📊 Data Modeling in Data Science


Use Case How Data Modeling Helps
🧼 Data Cleaning Understand data dependencies and integrity rules
Build features across related tables (joins,
Feature Engineering
aggregates)
🧠 Machine Learning
Identify entity relationships and time dependencies
Pipelines
Define star/snowflake schemas for analytical
📈 Data Warehousing
processing
Design clear models for Tableau, Power BI, Looker
📊 Business Intelligence
dashboards
Mapping from source → transformed → warehouse
🔌 ETL Pipelines
structure
🧠 Best Practices in Data Modeling

Practice Benefit
🔍 Define clear
Better joins, queries, and aggregations
relationships
🧼 Normalize your data Reduce redundancy and improve consistency
📅 Include time dimension Enables time-based analysis (e.g., cohort, seasonality)
🔐 Use appropriate keys Maintain referential integrity
Avoid over-nesting or overly complex joins in large
📏 Design for scalability
datasets

✅ Summary

Aspect Key Idea


What Blueprint for how data is structured and related
Why Ensures clean, consistent, and usable data for analysis and modeling
Types Conceptual, Logical, Physical
Used In ML pipelines, BI dashboards, database design, ETL processes
ER diagrams, MySQL Workbench, dbdiagram.io, pandas (for lightweight
Tools
work)

🧠 Data Modeling in SQL — Complete Guide

🔍 What is Data Modeling?

Data Modeling is the process of creating a conceptual, logical, and physical


structure for how data is stored, related, and accessed within a database system.

In SQL, data modeling defines:

 Tables (Entities)
 Columns (Attributes)
 Relationships (Foreign Keys)
 Constraints (Rules to ensure data integrity)

👉 Think of it as blueprinting your database—just like designing a building before


construction.
🧭 3 Stages of Data Modeling

Stage Description Output


1.
High-level view of business entities and their
Concept ER Diagram
relationships
ual
2. Tables, columns, data types, primary & foreign keys, Schema
Logical normalization Definition
3.
Actual implementation in SQL with indexes, partitioning, SQL DDL
Physica
performance tweaks scripts
l

🧱 Key Components of SQL Data Modeling

Component Description
Entity (Table) Represents a real-world object (Customer, Order, Product)
Attribute
A field in a table (e.g., Name, Email, OrderDate)
(Column)
Primary Key
Uniquely identifies each record in a table
(PK)
Foreign Key (FK) Links tables to establish relationships
Rules to maintain data accuracy (NOT NULL, UNIQUE, CHECK,
Constraints
etc.)
Indexes Improve query performance by allowing faster data access

🧮 Example: E-Commerce Data Model

Entities:

 Customer: CustomerID (PK), Name, Email


 Product: ProductID (PK), ProductName, Price
 Order: OrderID (PK), OrderDate, CustomerID (FK)
 OrderItem: OrderItemID (PK), OrderID (FK), ProductID (FK), Quantity

SQL Code Sample:


📊 Data Modeling Techniques

✅ Normalization

Reduces redundancy & improves data integrity.

 1NF: Atomic values


 2NF: Full functional dependency
 3NF: No transitive dependency

✅ Indexing

Speeds up SELECT queries (especially on JOINs and WHERE conditions)


✅ Relationship Design

Establishing one-to-many and many-to-many relationships using foreign keys.

🧬 Data Modeling in Data Science

Use Case How Data Modeling Helps


Structured schema simplifies extraction,
ETL Pipelines
transformation & loading
Star/snowflake schemas enable fast querying &
Data Warehousing
aggregation
Provides clean, relational datasets for feature
Machine Learning
engineering & training
Business Makes dashboards (Tableau, Power BI) efficient
Intelligence and scalable
Well-modeled data ensures faster, accurate
Query Optimization
analytics

Example in Python (via pandas) after modeling in SQL:

🧠 Best Practices

Practice Why it Matters


Use Descriptive Naming Improves readability and collaboration
Document Relationships Avoids ambiguity in joins and business logic
Avoid Over- Keep balance between performance and
Denormalization normalization
Anticipate growth in data volume and access
Design for Scalability
patterns
Enables powerful analytics like cohort analysis,
Include Time Dimensions
trends, and seasonality

✅ Final Summary

Aspect Details
What SQL-based structure to define how data is stored and related
Why Ensures data integrity, scalability, and efficient querying
MySQL Workbench, dbdiagram.io, ERDPlus, PostgreSQL, SQL
Key Tools
Server
Used in Data ETL, Feature Engineering, BI Reporting, Data Cleaning, and
Science Predictive Modeling

🧱 1. Normalization in SQL

📌 What is Normalization?

Normalization is the process of organizing data in a relational database to:

 Minimize redundancy (duplicate data)


 Improve data integrity
 Ensure logical data grouping

Normalization helps make databases efficient, scalable, and maintainable.

🧪 Why Normalize?

Benefit Description
✅ Reduces Redundancy Avoids storing the same data in multiple places
✅ Improves Consistency Updating in one table updates all related info
✅ Enhances Integrity Data relationships and rules are strictly maintained
✅ Scales Well Easier to expand, change, or migrate
🧬 Normal Forms (NFs)

Here’s a breakdown of the most common normal forms used in SQL databases:

🔹 1NF (First Normal Form)

 Data is stored in atomic (indivisible) units


 No repeating groups or arrays

✅ Example:

OrderID Product
101 Apple
102 Banana

❌ Avoid:

OrderID Products
101 Apple, Banana

🔹 2NF (Second Normal Form)

 Must be in 1NF
 No partial dependencies (non-key columns should depend on the whole
primary key)

🔍 Applies mainly to composite primary keys

🔹 3NF (Third Normal Form)

 Must be in 2NF
 No transitive dependencies (non-key attribute depends on another non-
key attribute)

Example: If City depends on ZipCode, and ZipCode depends on CustomerID, then


move City and ZipCode to a new table.
✅ Normalized Database Example

❌ Bad (denormalized):

✅ Good (normalized into 3NF):

🌟 2. Star Schema in SQL


📌 What is a Star Schema?

A Star Schema is a denormalized structure used in Data Warehousing and BI


(Business Intelligence). It allows for fast and efficient querying, especially for
aggregations.

It consists of:

 One central fact table (e.g., Sales)


 Multiple dimension tables (e.g., Date, Customer, Product)

🔎 Components

Component Description
Fact Table Stores measurable metrics (sales, revenue, etc.)
Dimension Stores descriptive attributes (product name, customer region,
Table etc.)
✅ Star Schema SQL Example

🔸 Dimension Tables

🔸 Fact Table
🔁 Difference: Normalization vs Star Schema

Feature Normalized (OLTP) Star Schema (OLAP)


Purpose Transaction processing Analytical processing
Highly normalized
Structure Denormalized
(3NF)
Joins Many joins Fewer joins
Redundancy Minimal Some redundancy (for performance)
Use Case Banking, CRM BI, dashboards, analytics
Slower for complex
Query Speed Faster for aggregations
joins

🧠 How These Concepts Are Used in Data Science

Task in Data Science Relevance of Normalization & Star Schema


Normalize source data → Denormalize to star schema for
🔄 ETL Pipelines
analysis
Star schema is common for reporting and dimensional
📊 Data Warehousing
modeling
📈 BI/Dashboarding Star schemas enable fast slice-and-dice of metrics
🧪 Feature Normalized tables allow clean joins and efficient
Engineering extraction
🧮 Predictive Data from fact/dimension tables is used for building
Modeling training datasets

✅ Summary

Concept Normalization Star Schema


One fact table, several dimension
Structure Multiple normalized tables
tables
Transaction systems
Use Case Analytics and BI (OLAP)
(OLTP)
Joins More complex joins Simpler, faster joins
Data Redundancy Low Some redundancy for performance
Common In Backend databases Data marts, warehouses, analytics
🔒 ACID Transactions in SQL: A Complete Guide

🔍 What is an ACID Transaction?

In SQL (Structured Query Language), a transaction is a unit of work that


performs one or more operations (like INSERT, UPDATE, DELETE) on the database.

To ensure data integrity and reliability, every transaction must follow the ACID
properties:

🔠 ACID stands for:

 A – Atomicity
 C – Consistency
 I – Isolation
 D – Durability

Each ensures that your database remains accurate, reliable, and recoverable
even in case of failure, crash, or concurrent access.

🔷 A – Atomicity

📌 Definition:

A transaction must be all or nothing — if any part of the transaction fails, the
entire transaction is rolled back.

🔧 Example:
If the second update fails, the first update is also undone.

🧠 Think of atomicity as an indivisible operation — like turning on a light switch:


it's either on or off, never in between.

🧭 C – Consistency

📌 Definition:

A transaction must bring the database from one valid state to another. It must
follow defined rules, such as constraints, triggers, and referential integrity.

🔧 Example:

Without consistency, you might insert invalid or corrupt data into the system.

🔗 I – Isolation

📌 Definition:

Transactions should operate independently. The operations of one transaction


should not interfere with another, especially when running concurrently.

🔧 Example:

Two transactions updating the same account balance:


Depending on the isolation level (e.g., READ COMMITTED, SERIALIZABLE), SQL
manages locks to avoid race conditions, dirty reads, or phantom reads.

💾 D – Durability

📌 Definition:

Once a transaction is committed, its changes are permanent, even if the system
crashes.

 Changes are written to disk, not just held in memory.


 SQL databases like PostgreSQL, MySQL, and SQL Server use write-ahead
logs (WALs) or transaction logs to ensure this.

🔧 Example:

After COMMIT, even if the server crashes, the changes are still there after restart.

🔁 Real-World Analogy

Imagine transferring money between two bank accounts:

Step Operation ACID Property


1 Deduct money from A Atomicity
2 Add money to B Atomicity
3 Check constraints (no overdraft) Consistency
4 Avoid interference from other transfers Isolation
5 Make it permanent if successful Durability

If anything fails, the transaction rolls back to the original state.

🧠 ACID vs BASE

Feature ACID (SQL) BASE (NoSQL)


Philosophy Strong consistency Eventual consistency
Transactions Rigid, all-or-nothing Flexible, less strict
Use Case Banking, ERP, CRM Social media, IoT, big data
Examples MySQL, PostgreSQL Cassandra, MongoDB

🧪 How SQL Supports Transactions

SQL provides commands to manage ACID-compliant transactions:

📊 Relevance in Data Science

Use Case ACID Relevance


ETL Pipelines Ensure no partial loading if a step fails
📊 Data Warehousing Guarantee integrity of bulk operations
Store and retrieve training/testing datasets
🔁 Model Training
reliably
🧹 Data Cleaning Scripts Rollback if cleaning step corrupts the data
🔄 Batch Updates to Maintain feature store consistency in ML
Features workflows

✅ In short, ACID ensures trust in the data you use to train, evaluate, and deploy
models.
✅ Summary Table

Property Purpose What It Prevents


Atomicity All-or-nothing execution Partial updates
Consistenc
Maintain database validity Constraint violations
y
Independent concurrent Dirty/phantom reads, race
Isolation
transactions conditions
Ensure committed data survives
Durability Data loss on crashes
failures

🧰 BONUS: Check ACID Compliance in SQL

🧱 SQL Data Types: Detailed


Overview
In SQL, data types define the kind of data that can be stored in each
column of a table. Choosing the correct data type:

 Ensures data accuracy


 Optimizes storage and performance
 Helps in data validation and integrity

🔢 1. Numeric Data Types


Used for storing numbers — both integers and decimals.

🔹 Exact Numeric Types (integers, fixed-point)

Data Type Description Example


INT or INTEGER Whole numbers 1, 200, -12
SMALLINT Smaller integers (less storage) 1, 100
92233720368
BIGINT Very large integers
54775807
DECIMAL(p,s) or Exact precision decimals. p = total DECIMAL(5,2
NUMERIC(p,s) digits, s = digits after decimal ) → 123.45
TINYINT (some
0 to 255 (1 byte) 200
DBMS)
BIT 0 or 1 (binary) 1

🔹 Approximate Numeric Types (floating-point)

Data Type Description Example


Approximate decimal with precision
FLOAT(p) 3.141592
p
REAL Less precision than FLOAT 3.14
DOUBLE 3.14159265
More precise floating-point
PRECISION 35

📝 2. Character/String Data Types


Used to store textual data.
Data
Description Example
Type
CHAR(5) stores 'John '
CHAR(n) Fixed-length string
(pads space)
VARCHAR VARCHAR(50) stores
Variable-length string, up to n
(n) 'DataScience'
Long variable-length text (no length
TEXT Paragraphs, comments
limit in many DBMS)
NCHAR,
Unicode characters (multi-language Unicode text like Hindi,
NVARCHA
support) Chinese, etc.
R

📅 3. Date & Time Data Types


Used for handling dates and timestamps.

Data
Description Example
Type
DATE Stores only date '2025-04-14'
TIME Stores only time '14:30:00'
DATETI '2025-04-14
Combines date and time
ME 14:30:00'
TIMEST Similar to DATETIME but auto-updates (in
Used in logs
AMP some DBMS)
YEAR Stores a year in 2 or 4 digits 2025

🟢 4. Boolean Data Types


Data Type Description Example
BOOLEAN Stores TRUE or FALSE TRUE, FALSE
BIT(1) Used in place of BOOLEAN in some DBMS 0, 1
🧱 5. Binary Data Types
Used to store binary data like images, files, and encrypted data.

Data Type Description Example


BINARY(n) Fixed-length binary BINARY(8)
VARBINARY(n) Variable-length binary VARBINARY(255)
BLOB (Binary Large Object) Stores large binary data Images, PDFs

💾 6. Special/Misc Data Types


Data Type Description Use Case
Restricts column to predefined
ENUM 'Male', 'Female'
values
Allows storing multiple predefined SET('Red','Blue','Gre
SET
values en')
Stores JSON-formatted data
JSON APIs, nested structures
(MySQL, PostgreSQL)
Stores XML data (SQL Server,
XML Web data, configs
Oracle)
GEOMETRY,
POINT, Spatial types (PostGIS) Geo-data, maps
POLYGON
550e8400-e29b-41d4-
UUID Universally Unique Identifier
a716-446655440000

📊 Choosing the Right Data Type:


Best Practices
Consideration Best Practice
Use smallest data type possible (e.g., SMALLINT instead
💽 Storage
of INT)
⚡ Performance Avoid overusing TEXT, BLOB for searchable columns
✅ Validation Use ENUM, BOOLEAN, or CHECK constraints
📅 Timestamps Use TIMESTAMP for auto-log events or audits
🌐 International
Use NVARCHAR or NCHAR for multilingual data
Support

🧠 Relevance to Data Science


Data Science Task SQL Data Type Role
Mapping CSV/JSON/Excel fields to appropriate SQL
📥 Data Ingestion
types
Detecting type mismatches (e.g., date stored as
🔎 Data Cleaning
string)
📊 Feature Type affects aggregations (e.g., dates for time-
Engineering series)
📦 Storing Model Store cleaned structured data with appropriate
Inputs types
🧠 ML Metadata Use JSON, TEXT, TIMESTAMP for tracking
Logging experiments
Numeric and datetime types are crucial for
📈 Analytics
metrics, trends

✅ Summary Table
Category Common Types Example Use
Numeric INT, DECIMAL, FLOAT Age, Salary
String CHAR, VARCHAR, TEXT Names, Emails
Date/Time DATE, TIME, DATETIME, TIMESTAMP Logs, Events
Boolean BOOLEAN, BIT Status Flags
Binary BLOB, VARBINARY Images, Files
Categorical, Structured
Special ENUM, JSON, UUID
Data
📘 What is DDL (Data
Definition Language)?
DDL stands for Data Definition Language, which is a subset of SQL
commands used to define and modify the structure of database
objects such as:

 Tables
 Schemas
 Views
 Indexes
 Constraints

These statements don’t manipulate actual data but change the schema of
a database.

🔧 Key DDL Commands

1. CREATE

Used to create a new database object, like a table or view.


🔹 Syntax:

🔹 Example:

✅ This creates a new Employees table with columns for ID, name, age, and
hire date.

2. DROP

Used to completely remove an object from the database (table, view, etc.).
🔹 Syntax:

🔹 Example:

⚠️This deletes the Employees table and all data inside it permanently.

3. TRUNCATE

Used to delete all records from a table quickly without logging individual
row deletions. The table structure remains.

🔹 Syntax:
🔹 Example:

⚠️All rows are deleted, but the table itself still exists.

4. ALTER

Used to modify the structure of an existing table — add, modify, rename,


or delete columns.

🔹 Add a Column:

🔹 Modify a Column:
🔹 Drop a Column:

🔹 Rename a Table (DBMS-specific):

💡 Summary Table
Comma
Purpose Can It Be Undone?
nd
Create a
CREATE ❌ No (manual drop needed)
table/view/index/schema
Permanently delete table or
DROP ❌ No
object
TRUNCAT
Delete all rows, keep structure ❌ No
E
ALTER Modify table structure ⚠️Depends on DBMS

🔍 DDL vs DML (Quick Difference)


Feature DDL DML
Data Definition
Full Form Data Manipulation Language
Language
Focus Table/schema structure Data inside tables
Examples CREATE, ALTER, DROP SELECT, INSERT, UPDATE,
DELETE
Affects
❌ Structure only ✅ Data itself
Data?

🧠 DDL in Data Science: Why It


Matters
In data science and analytics projects, DDL plays a foundational role in
creating and managing the environments where data is stored and
processed.

Use Case Role of DDL


📥 Data Ingestion Create tables to store raw, processed, and
Pipelines clean data
Define tables optimized for analytics (fact/dim
Schema Design
models)
Create tables to log model runs,
🧪 Experiment Tracking
hyperparameters
🧹 Cleaning Workflows Use TRUNCATE to clear staging/temporary tables
Alter schemas dynamically during data
🔁 ETL/ELT Pipelines
transformation
📊 Reporting &
Use views and structured tables for BI tools
Dashboards

✅ Recap
Comman Typical Use in Data Science
Description
d Projects
Create structured storage for Raw, processed, or analytic
CREATE
data tables
Remove outdated or test
DROP Clean up schemas and projects
structures
Quickly wipe tables before Reset intermediate or staging
TRUNCATE
refresh tables
Evolve table structure with Add new features/columns
ALTER
pipeline dynamically

📂 Data Manipulation
Language (DML) in SQL
DML (Data Manipulation Language) refers to the subset of SQL used to
manipulate and manage data stored in database tables.

Unlike DDL, which deals with the structure (schemas, tables), DML focuses on the
data itself — inserting, updating, deleting, and querying records.

Core DML Commands

🔍 1. SELECT – Retrieve Data

Used to fetch data from a table.

🔹 Syntax:
🔹 Example:

✅ Fetches names and ages of employees older than 30.

🔹 Advanced Usage:

 SELECT *: Select all columns


 ORDER BY, GROUP BY, LIMIT
 Joins across multiple tables

🧩 2. INSERT – Add New Data

Used to insert new rows into a table.

🔹 Syntax:

🔹 Example:
✅ Adds a new employee record.

🔹 Bulk Insert:

✏️3. UPDATE – Modify Existing Data

Used to change values of one or more rows.

🔹 Syntax:

🔹 Example:

✅ Updates Bob’s age to 36.


4. DELETE – Remove Data

Used to delete rows that meet a condition.

🔹 Syntax:

🔹 Example:

✅ Deletes Charlie's record from the table.

⚠️Without a WHERE clause, it will delete all records!

🔄 Summary of DML Commands


Comma Affects Affects
Purpose
nd Structure? Data?
SELECT Retrieve data ❌ No ✅ Yes
INSERT Add new records ❌ No ✅ Yes
Modify existing
UPDATE ❌ No ✅ Yes
records
Remove existing
DELETE ❌ No ✅ Yes
records

🧠 Why DML is Crucial in Data


Science
In Data Science, we don’t just analyze data — we must also:

 Ingest it into databases


 Clean, transform, and filter it
 Store results of models or processing
 Query for specific patterns, trends, and summaries

Data Science Task DML Usage


📥 Ingest cleaned data INSERT
🧹 Remove dirty records DELETE with conditions
🔄 Feature transformation UPDATE columns with calculated values
📊 Querying for analysis SELECT with filters, joins, aggregates
📈 Reporting/Dashboarding Frequent SELECT queries
🧠 Model prediction storage INSERT or UPDATE model outputs
🔁 Pipeline testing/resetting DELETE, INSERT, UPDATE as needed

🧪 Example Use in a Data Science


Workflow
Let’s say you're working on a customer churn model:

1. 🧹 Clean incoming raw data


2. 🧠 Add predicted churn score

3. 📈 Generate insights for dashboard

4. 💾 Insert model results into an audit table

✅ Recap
Comman
Core Function Data Science Use Case
d
SELECT Fetch data Analysis, visualization, reporting
Store model outputs, clean/staged
INSERT Add new data
records
Add engineered features, update
UPDATE Modify data
scores
Remove unwanted
DELETE Filter bad data, reset data for pipelines
data
🔐 Data Control Language
(DCL) in SQL – Full
Overview
📌 What is DCL?
Data Control Language (DCL) is a subset of SQL used to control access and
permissions to data and database objects. It plays a critical role in database
security, ensuring that only authorized users can read, modify, or manipulate
data.

DCL is mainly used by database administrators (DBAs) or system-level users.

Key DCL Commands


✅ 1. GRANT – Provide Privileges

The GRANT statement gives specific rights to users or roles for performing actions
on database objects such as tables, views, sequences, or procedures.

🔹 Syntax:
 privilege_list: Permissions like SELECT, INSERT, UPDATE, DELETE, EXECUTE,
etc.
 object_name: Table, view, or other object to which the privileges apply.
 TO: The user(s) or role(s) receiving the permission.
 WITH GRANT OPTION: Allows the recipient to pass on the same privileges to
others.

🔹 Example:

This allows analyst_user to read and insert into the Employees table, and also to
grant the same rights to other users.

🚫 2. REVOKE – Remove Privileges

The REVOKE command is used to take back privileges that were previously
granted. It's essential for removing access when users change roles, leave a team,
or when permissions need to be tightened.

🔹 Syntax:
🔹 Example:

This removes the ability to insert data into the Employees table from analyst_user.

🧠 Importance of DCL in Real-World


Use
Use Case How DCL Helps
✅ Secure data access Only authorized users can view/modify data
✅ Regulatory compliance Restrict access to sensitive information
✅ Manage roles across Assign different access to data scientists,
teams analysts, interns, etc.
✅ Audit and accountability Control and track who can do what
✅ Temporary permissions Grant access temporarily for specific tasks
✅ Minimize risk of data
Least privilege access via GRANT & REVOKE
breaches

💡 DCL in the Context of Data Science


While data scientists may not frequently write DCL statements themselves,
understanding them is important when:

 Collaborating with DBAs to request access to datasets stored in production


databases.
 Designing secure pipelines where data needs to be accessed or written
under controlled permissions.
 Working in shared environments with role-based access control (RBAC).
 Deploying models or ETL scripts that write back predictions or
transformations to databases (you might need INSERT or UPDATE
permissions).

Examples:

 A data scientist needs access to customer data → DBA uses GRANT SELECT
ON customers TO data_scientist;
 Once the project ends, access is revoked → REVOKE SELECT ON customers
FROM data_scientist;

🔄 Summary of DCL Commands


Com
Function Purpose Can Be Passed Down?
mand
Assign Allow user to ✅ Yes (with WITH GRANT
GRANT
privileges query/modify database OPTION)
REVOK Remove Take back access to
✅ Yes
E privileges database objects

✅ Best Practices
 🔐 Always follow least privilege principle – only give the access users
actually need.
 📋 Document all GRANT and REVOKE activities for audit trails.
 👥 Prefer role-based access (TO role_name) over individual user access.
 🧪 Regularly review user privileges in production systems.
🔁 Transaction Control
Language (TCL) in SQL
📌 What is TCL?
Transaction Control Language (TCL) consists of SQL commands used to manage
transactions in a database. A transaction is a logical unit of work that contains
one or more SQL statements. TCL commands help ensure data integrity,
consistency, and recovery in case of errors.

🔄 What is a Transaction?
A transaction in SQL is a sequence of operations performed as a single logical
unit of work. These operations must be:

 Atomic (all-or-nothing)
 Consistent (maintains data integrity)
 Isolated (no interference between transactions)
 Durable (results persist after commit)

These are the ACID properties of transactions (which you've already studied).

Key TCL Commands

✅ 1. COMMIT

Purpose:
Permanently saves all changes made in the current transaction to the database.
🔹 Syntax:

🔹 Use Case:

When you're done with a successful set of operations and want to save them
permanently.

🔹 Example:

✅ Both operations are saved as a successful transaction.

🔁 2. ROLLBACK

Purpose:
Reverses all changes made in the current transaction. It undoes operations since
the last COMMIT or SAVEPOINT.
🔹 Syntax:

🔹 Use Case:

When an error occurs or data conditions aren't met, and you want to undo the
changes.

🔹 Example:

❌ Changes are undone; inventory stays the same.

🧷 3. SAVEPOINT

Purpose:
Sets a checkpoint within a transaction. You can roll back to a specific savepoint
rather than the beginning of the transaction.
🔹 Syntax:

🔹 Example:

✅ Only the quantity update is undone; price change is saved.

🔒 When Do You Use TCL?


Situation Use TCL Command
Successful task COMMIT
Error in multi-step transaction ROLLBACK
SAVEPOINT +
Want to partially undo steps
ROLLBACK
💡 How TCL Works in Real Databases
 Autocommit Mode (e.g., MySQL by default): Every statement is auto-
committed unless inside a transaction block.
 Manual Commit Mode (e.g., PostgreSQL, Oracle): You explicitly use
BEGIN, COMMIT, and ROLLBACK.

You can turn off autocommit in MySQL:

🔍 TCL in Data Science Workflows


Though data scientists often focus on querying, analyzing, or visualizing data, TCL
commands are useful when:

Scenario in Data Science


TCL Usage
Workflow
Data preprocessing script modifies Use BEGIN, validate logic, then
raw data COMMIT
ETL pipeline loads intermediate ROLLBACK on failure ensures
tables consistency
Testing model predictions by Roll back test inserts after
inserting them evaluation
Wrap in transaction to avoid partial
Batch inserts into a feature store
failure

✅ Summary Table
Comm
Description When to Use
and
Permanently saves transaction
COMMIT After successful operation
changes
ROLLBA Undoes changes made during the
On error or validation failure
CK transaction
SAVEPO For partial undo within a
Marks a point to roll back to
INT transaction

🚀 Bonus: Python Integration


Example (with sqlite3)

✅ This is how a transaction can be managed in a Data Science pipeline.


🔒 SQL Constraints – Full
Detailed Analysis
📌 What Are SQL Constraints?
SQL Constraints are rules applied to columns in a database table to enforce
data integrity, consistency, and validity. They restrict the type of data that can
be stored in a column to prevent invalid or incorrect entries.

🧩 Types of SQL Constraints

1. PRIMARY KEY

 Definition: Uniquely identifies each record in a table.


 Properties:
o Must be unique
o Cannot be NULL
o One per table
 Often used as the main identifier of a row.

🔹 Syntax:
🔹 Example:

emp_id = 101 appears only once in the table — no duplicates or NULLs allowed.

📊 Data Science Relevance:

 Used in data modeling to uniquely identify each row (e.g., customer ID,
transaction ID).
 Essential when joining tables using foreign keys.

2. FOREIGN KEY

 Definition: A key in one table that refers to the PRIMARY KEY in another
table.
 Purpose: Maintains referential integrity between two related tables.

🔹 Syntax:

🔹 Example:

You can only insert emp_id in Orders that already exists in Employees.
📊 Data Science Relevance:

 Maintains clean and related data across tables (e.g., linking customers to
orders).
 Helps normalize data and reduce redundancy.

3. UNIQUE

 Definition: Ensures all values in a column are distinct.


 Difference from PRIMARY KEY: Allows one NULL, and multiple UNIQUE
constraints per table.

🔹 Syntax:

🔹 Example:

Each email must be different; prevents duplicate user records.

📊 Data Science Relevance:

 Helps ensure data uniqueness in identifiers like usernames, emails, SSNs,


etc.
 Useful in data cleaning steps to detect duplicates.
4. NOT NULL

 Definition: Prevents NULL values from being inserted into a column.


 Ensures that a column always has a value.

🔹 Syntax:

🔹 Example:

You can’t insert a product without a name.

📊 Data Science Relevance:

 Ensures completeness of critical fields.


 Simplifies analysis and modeling since nulls often require special handling.

5. CHECK

 Definition: Ensures that the value in a column satisfies a condition.


 Works like a validation rule.
🔹 Syntax:

🔹 Example:

Only employees 18 or older can be inserted.

📊 Data Science Relevance:

 Validates the quality of incoming data, reducing the need for extra checks
later.
 Prevents invalid or extreme values (e.g., age < 0, salary < 0).

6. DEFAULT

 Definition: Assigns a default value to a column when no value is provided.

🔹 Syntax:
🔹 Example:

If no country is given, it defaults to 'USA'.

📊 Data Science Relevance:

 Avoids NULLs during ingestion.


 Helpful in data imputation and missing value handling during analysis.

🧠 Summary Table
Allows
Cons Allows
Ensures Duplicate Notes
traint NULL?
s?
PRIMA
Uniquely identifies
RY ❌ ❌ Only one per table
rows
KEY
FOREI Links tables
✅ Points to PRIMARY KEY
GN (referential ✅
(usually) in another table
KEY integrity)
UNIQU ✅ (one Can apply to multiple
Distinct values ❌
E NULL) columns
NOT Value must be Used on important
❌ ✅
NULL present fields
Value meets Depend
CHECK ✅ Use for validations
condition s
DEFAU Value when none is
✅ ✅ Reduces NULLs
LT provided

📌 Constraints in Data Science and


ETL Pipelines
Data Science Workflow SQL Constraint Used
Enforcing unique customer IDs PRIMARY KEY or UNIQUE
Making sure no missing target variable NOT NULL
Auto-setting default status on new rows DEFAULT
Ensuring valid data ranges (e.g., age >
CHECK
0)
Maintaining table relationships FOREIGN KEY

✅ Best Practices
 Use PRIMARY KEY and FOREIGN KEY to build normalized and relational
schemas.
 Combine NOT NULL and CHECK to ensure valid and complete data.
 Use DEFAULT to avoid NULLs where practical.
 Apply UNIQUE to maintain data integrity in identifying fields.

Here’s a complete, detailed, and clear analysis of Operators in SQL, covering:

✅ Arithmetic Operators
✅ Comparison Operators
✅ Logical Operators
✅ Bitwise Operators
✅ Compound Operators

And their usage in SQL queries + relevance in Data Science workflows.

🧠 What Are Operators in


SQL?
SQL operators are special symbols or keywords used to perform operations on
data within queries. They help manipulate values, filter data, and build complex
conditions.
➕ 1. Arithmetic Operators
Used to perform mathematical operations on numeric values.

Operator Description Example Result


+ Addition 10 + 5 15
- Subtraction 10 - 5 5
* Multiplication 10 * 5 50
/ Division 10 / 5 2
% Modulus (remainder) 10 % 3 1

📌 Usage:

📊 Relevance to Data Science:

 Useful in feature engineering, data transformation, discount


calculations, and unit conversions.

🔁 2. Comparison (Relational)
Operators
Used to compare values. They return TRUE, FALSE, or NULL.

Operator Description Example


= Equal to salary = 50000
!= or <> Not equal to age != 30
> Greater than marks > 50
< Less than price < 100
>= Greater than or equal age >= 18
<= Less than or equal score <= 40
BETWEEN Between a range age BETWEEN 20 AND 30
country IN ('USA',
IN Value in a list
'UK')
LIKE Pattern match (strings) name LIKE 'A%'
IS NULL Checks for NULL email IS NULL

📌 Usage:

📊 Relevance to Data Science:

 Crucial in filtering datasets, condition-based queries, and exploratory


analysis.

🔣 3. Logical Operators
Used to combine multiple conditions in a WHERE clause.

Operat
Description Example
or
True if both conditions age > 18 AND status =
AND
true 'active'
True if either condition
OR city = 'NY' OR city = 'LA'
true
NOT Reverses condition NOT status = 'inactive'
📌 Usage:

📊 Relevance to Data Science:

 Used in data subsetting, multi-condition filtering, and rule-based logic.

🧮 4. Bitwise Operators (rare in SQL


but still exist in some databases)
Bitwise operators manipulate bits directly. Mostly used in low-level systems or
metadata flags.

Operator Description Example Result


& Bitwise AND 5 & 3 1
` ` Bitwise OR `5
^ Bitwise XOR 5 ^ 3 6
~ Bitwise NOT ~5 -6
<< Left shift 5 << 1 10
>> Right shift 5 >> 1 2

Bitwise operators are supported in SQL Server and PostgreSQL, but not
commonly used in data analytics tasks.

📊 Relevance to Data Science:

 Rarely used but sometimes useful in flags, metadata manipulation, or


advanced ID encoding.
🔗 5. Compound (Assignment)
Operators
Used to perform operation and assign value in one step.

Operator Description SQL Support


+= Add and assign SQL Server
-= Subtract and assign SQL Server
*= Multiply and assign SQL Server
/= Divide and assign SQL Server

Not standard in all SQL dialects (e.g., not in MySQL or PostgreSQL).

📌 Example (SQL Server):

📊 Relevance to Data Science:

 Speeds up bulk updates in feature tables or batch processing scenarios.

🧪 Practical Data Science Use Case


Example
Imagine you’re analyzing sales data:
✔️Uses:

 Arithmetic (price * quantity)


 Comparison (price > 500)
 Logical (AND)
 Filtering (BETWEEN)

🧠 Final Thoughts: Why Are SQL


Operators Crucial in Data Science?
Role Operator Usage Example
Feature Engineering Compute age, revenue, etc.
WHERE age > 25 AND income <
Conditional Filtering
50000
Segmentation and Bucketing CASE WHEN... THEN...
Query Optimization Using correct logical conditions
Exploratory Data Analysis Filtering, grouping, ranking
🧾 CLAUSES in SQL (Detailed
Analysis)
Clauses in SQL are building blocks of queries that help us filter, group, sort,
and refine results from databases. The ones covered here are used primarily in
SELECT statements.

🔍 1. WHERE Clause
✅ Purpose:

Filters records before any grouping or aggregation happens.

✅ Syntax:

✅ Example:
✅ Notes:

 You can use comparison, logical, and pattern-matching operators.


 Does not work with aggregate functions (like COUNT, SUM, etc.).

📊 Relevance in Data Science:

Used heavily for:

 Filtering large datasets


 Preparing subsets of data
 Querying specific criteria (e.g., customers with high spend)

📦 2. GROUP BY Clause
✅ Purpose:

Groups rows that have the same values in specified columns into summary rows,
often used with aggregation functions (COUNT(), SUM(), AVG()).

✅ Syntax:
✅ Example:

✅ Notes:

 Groups must match selected columns unless inside an aggregate.


 Comes after WHERE and before HAVING.

📊 Relevance in Data Science:

Used for:

 Generating summary statistics


 Performing feature aggregation
 Data transformation and pivoting

🧮 3. HAVING Clause
✅ Purpose:

Filters groups after aggregation has taken place.


✅ Syntax:

✅ Notes:

 WHERE filters before grouping; HAVING filters after.


 Only used with aggregates or grouped data.

✅ Example:

📊 Relevance in Data Science:

 Filter aggregated data (e.g., products sold more than X times)


 Detect outliers based on group-level stats
 Analyze behavior of grouped entities (e.g., user cohorts)
🔃 4. ORDER BY Clause
✅ Purpose:

Sorts the result set in ascending (default) or descending order.

✅ Syntax:

✅ Example:

✅ Notes:

 Can sort by column position (e.g., ORDER BY 2 DESC)


 Useful for ranked lists, top-N queries

📊 Relevance in Data Science:

 Useful in ranking, sorting most/least frequent items


 Helpful for generating top 10, leaderboards, or distribution checks
🧩 Real-world Query Example (All
Clauses Together)

What this does:

1. Filters only active employees → WHERE


2. Groups them by department → GROUP BY
3. Aggregates average salary per department
4. Filters only departments where avg_salary > 60,000 → HAVING
5. Sorts the result in descending order of salary → ORDER BY

📊 Relevance of These Clauses in


Data Science:
Clause Use in Data Science
WHERE Data cleaning, subset selection
GROUP
Aggregation for summaries, transformations, feature creation
BY
HAVING Post-aggregation filtering, analysis on grouped data
ORDER Ranking, top/bottom N analysis, sorting for reports &
BY dashboards
These clauses are foundational for any SQL-based data preprocessing, feature
engineering, exploratory data analysis, or report building in Data Science
pipelines.

You might also like