Module_5
Module_5
Let’s focus on MySQL (most commonly used) and SQLite (simplest to start with).
🔧 1. Download MySQL
Visit: https://dev.mysql.com/downloads/
🔧 2. Install MySQL
🔧 3. Verify Installation
Data Modeling is the process of defining and organizing data elements, their
relationships, and structure to represent how data will be stored, accessed, and
updated in a database system.
🧠 Think of data modeling as creating a blueprint for your data, just like how an
architect creates a blueprint for a building.
Component Description
Entities Things or objects (e.g., Customer, Product, Order)
Attributes Properties of entities (e.g., CustomerName, OrderDate)
Relationshi
How entities are linked (e.g., one-to-many, many-to-many)
ps
Primary Key Unique identifier for an entity
Foreign Key Connects rows across different tables
🧭 Types of Data Models
Type Description
Conceptual High-level view of data, focuses on business entities and
Model relationships
Logical
Adds structure: tables, columns, data types, keys, constraints
Model
Physical Maps to actual database implementation (SQL, NoSQL), with
Model indexes, storage
🧾 Entities:
🧩 Relationships:
Even in pandas, you mimic data models using DataFrames and their
relationships:
Practice Benefit
🔍 Define clear
Better joins, queries, and aggregations
relationships
🧼 Normalize your data Reduce redundancy and improve consistency
📅 Include time dimension Enables time-based analysis (e.g., cohort, seasonality)
🔐 Use appropriate keys Maintain referential integrity
Avoid over-nesting or overly complex joins in large
📏 Design for scalability
datasets
✅ Summary
Tables (Entities)
Columns (Attributes)
Relationships (Foreign Keys)
Constraints (Rules to ensure data integrity)
Component Description
Entity (Table) Represents a real-world object (Customer, Order, Product)
Attribute
A field in a table (e.g., Name, Email, OrderDate)
(Column)
Primary Key
Uniquely identifies each record in a table
(PK)
Foreign Key (FK) Links tables to establish relationships
Rules to maintain data accuracy (NOT NULL, UNIQUE, CHECK,
Constraints
etc.)
Indexes Improve query performance by allowing faster data access
Entities:
✅ Normalization
✅ Indexing
🧠 Best Practices
✅ Final Summary
Aspect Details
What SQL-based structure to define how data is stored and related
Why Ensures data integrity, scalability, and efficient querying
MySQL Workbench, dbdiagram.io, ERDPlus, PostgreSQL, SQL
Key Tools
Server
Used in Data ETL, Feature Engineering, BI Reporting, Data Cleaning, and
Science Predictive Modeling
🧱 1. Normalization in SQL
📌 What is Normalization?
🧪 Why Normalize?
Benefit Description
✅ Reduces Redundancy Avoids storing the same data in multiple places
✅ Improves Consistency Updating in one table updates all related info
✅ Enhances Integrity Data relationships and rules are strictly maintained
✅ Scales Well Easier to expand, change, or migrate
🧬 Normal Forms (NFs)
Here’s a breakdown of the most common normal forms used in SQL databases:
✅ Example:
OrderID Product
101 Apple
102 Banana
❌ Avoid:
OrderID Products
101 Apple, Banana
Must be in 1NF
No partial dependencies (non-key columns should depend on the whole
primary key)
Must be in 2NF
No transitive dependencies (non-key attribute depends on another non-
key attribute)
❌ Bad (denormalized):
It consists of:
🔎 Components
Component Description
Fact Table Stores measurable metrics (sales, revenue, etc.)
Dimension Stores descriptive attributes (product name, customer region,
Table etc.)
✅ Star Schema SQL Example
🔸 Dimension Tables
🔸 Fact Table
🔁 Difference: Normalization vs Star Schema
✅ Summary
To ensure data integrity and reliability, every transaction must follow the ACID
properties:
A – Atomicity
C – Consistency
I – Isolation
D – Durability
Each ensures that your database remains accurate, reliable, and recoverable
even in case of failure, crash, or concurrent access.
🔷 A – Atomicity
📌 Definition:
A transaction must be all or nothing — if any part of the transaction fails, the
entire transaction is rolled back.
🔧 Example:
If the second update fails, the first update is also undone.
🧭 C – Consistency
📌 Definition:
A transaction must bring the database from one valid state to another. It must
follow defined rules, such as constraints, triggers, and referential integrity.
🔧 Example:
Without consistency, you might insert invalid or corrupt data into the system.
🔗 I – Isolation
📌 Definition:
🔧 Example:
💾 D – Durability
📌 Definition:
Once a transaction is committed, its changes are permanent, even if the system
crashes.
🔧 Example:
After COMMIT, even if the server crashes, the changes are still there after restart.
🔁 Real-World Analogy
🧠 ACID vs BASE
✅ In short, ACID ensures trust in the data you use to train, evaluate, and deploy
models.
✅ Summary Table
Data
Description Example
Type
DATE Stores only date '2025-04-14'
TIME Stores only time '14:30:00'
DATETI '2025-04-14
Combines date and time
ME 14:30:00'
TIMEST Similar to DATETIME but auto-updates (in
Used in logs
AMP some DBMS)
YEAR Stores a year in 2 or 4 digits 2025
✅ Summary Table
Category Common Types Example Use
Numeric INT, DECIMAL, FLOAT Age, Salary
String CHAR, VARCHAR, TEXT Names, Emails
Date/Time DATE, TIME, DATETIME, TIMESTAMP Logs, Events
Boolean BOOLEAN, BIT Status Flags
Binary BLOB, VARBINARY Images, Files
Categorical, Structured
Special ENUM, JSON, UUID
Data
📘 What is DDL (Data
Definition Language)?
DDL stands for Data Definition Language, which is a subset of SQL
commands used to define and modify the structure of database
objects such as:
Tables
Schemas
Views
Indexes
Constraints
These statements don’t manipulate actual data but change the schema of
a database.
1. CREATE
🔹 Example:
✅ This creates a new Employees table with columns for ID, name, age, and
hire date.
2. DROP
Used to completely remove an object from the database (table, view, etc.).
🔹 Syntax:
🔹 Example:
⚠️This deletes the Employees table and all data inside it permanently.
3. TRUNCATE
Used to delete all records from a table quickly without logging individual
row deletions. The table structure remains.
🔹 Syntax:
🔹 Example:
⚠️All rows are deleted, but the table itself still exists.
4. ALTER
🔹 Add a Column:
🔹 Modify a Column:
🔹 Drop a Column:
💡 Summary Table
Comma
Purpose Can It Be Undone?
nd
Create a
CREATE ❌ No (manual drop needed)
table/view/index/schema
Permanently delete table or
DROP ❌ No
object
TRUNCAT
Delete all rows, keep structure ❌ No
E
ALTER Modify table structure ⚠️Depends on DBMS
✅ Recap
Comman Typical Use in Data Science
Description
d Projects
Create structured storage for Raw, processed, or analytic
CREATE
data tables
Remove outdated or test
DROP Clean up schemas and projects
structures
Quickly wipe tables before Reset intermediate or staging
TRUNCATE
refresh tables
Evolve table structure with Add new features/columns
ALTER
pipeline dynamically
📂 Data Manipulation
Language (DML) in SQL
DML (Data Manipulation Language) refers to the subset of SQL used to
manipulate and manage data stored in database tables.
Unlike DDL, which deals with the structure (schemas, tables), DML focuses on the
data itself — inserting, updating, deleting, and querying records.
🔹 Syntax:
🔹 Example:
🔹 Advanced Usage:
🔹 Syntax:
🔹 Example:
✅ Adds a new employee record.
🔹 Bulk Insert:
🔹 Syntax:
🔹 Example:
🔹 Syntax:
🔹 Example:
✅ Recap
Comman
Core Function Data Science Use Case
d
SELECT Fetch data Analysis, visualization, reporting
Store model outputs, clean/staged
INSERT Add new data
records
Add engineered features, update
UPDATE Modify data
scores
Remove unwanted
DELETE Filter bad data, reset data for pipelines
data
🔐 Data Control Language
(DCL) in SQL – Full
Overview
📌 What is DCL?
Data Control Language (DCL) is a subset of SQL used to control access and
permissions to data and database objects. It plays a critical role in database
security, ensuring that only authorized users can read, modify, or manipulate
data.
The GRANT statement gives specific rights to users or roles for performing actions
on database objects such as tables, views, sequences, or procedures.
🔹 Syntax:
privilege_list: Permissions like SELECT, INSERT, UPDATE, DELETE, EXECUTE,
etc.
object_name: Table, view, or other object to which the privileges apply.
TO: The user(s) or role(s) receiving the permission.
WITH GRANT OPTION: Allows the recipient to pass on the same privileges to
others.
🔹 Example:
This allows analyst_user to read and insert into the Employees table, and also to
grant the same rights to other users.
The REVOKE command is used to take back privileges that were previously
granted. It's essential for removing access when users change roles, leave a team,
or when permissions need to be tightened.
🔹 Syntax:
🔹 Example:
This removes the ability to insert data into the Employees table from analyst_user.
Examples:
A data scientist needs access to customer data → DBA uses GRANT SELECT
ON customers TO data_scientist;
Once the project ends, access is revoked → REVOKE SELECT ON customers
FROM data_scientist;
✅ Best Practices
🔐 Always follow least privilege principle – only give the access users
actually need.
📋 Document all GRANT and REVOKE activities for audit trails.
👥 Prefer role-based access (TO role_name) over individual user access.
🧪 Regularly review user privileges in production systems.
🔁 Transaction Control
Language (TCL) in SQL
📌 What is TCL?
Transaction Control Language (TCL) consists of SQL commands used to manage
transactions in a database. A transaction is a logical unit of work that contains
one or more SQL statements. TCL commands help ensure data integrity,
consistency, and recovery in case of errors.
🔄 What is a Transaction?
A transaction in SQL is a sequence of operations performed as a single logical
unit of work. These operations must be:
Atomic (all-or-nothing)
Consistent (maintains data integrity)
Isolated (no interference between transactions)
Durable (results persist after commit)
These are the ACID properties of transactions (which you've already studied).
✅ 1. COMMIT
Purpose:
Permanently saves all changes made in the current transaction to the database.
🔹 Syntax:
🔹 Use Case:
When you're done with a successful set of operations and want to save them
permanently.
🔹 Example:
🔁 2. ROLLBACK
Purpose:
Reverses all changes made in the current transaction. It undoes operations since
the last COMMIT or SAVEPOINT.
🔹 Syntax:
🔹 Use Case:
When an error occurs or data conditions aren't met, and you want to undo the
changes.
🔹 Example:
🧷 3. SAVEPOINT
Purpose:
Sets a checkpoint within a transaction. You can roll back to a specific savepoint
rather than the beginning of the transaction.
🔹 Syntax:
🔹 Example:
✅ Summary Table
Comm
Description When to Use
and
Permanently saves transaction
COMMIT After successful operation
changes
ROLLBA Undoes changes made during the
On error or validation failure
CK transaction
SAVEPO For partial undo within a
Marks a point to roll back to
INT transaction
1. PRIMARY KEY
🔹 Syntax:
🔹 Example:
emp_id = 101 appears only once in the table — no duplicates or NULLs allowed.
Used in data modeling to uniquely identify each row (e.g., customer ID,
transaction ID).
Essential when joining tables using foreign keys.
2. FOREIGN KEY
Definition: A key in one table that refers to the PRIMARY KEY in another
table.
Purpose: Maintains referential integrity between two related tables.
🔹 Syntax:
🔹 Example:
You can only insert emp_id in Orders that already exists in Employees.
📊 Data Science Relevance:
Maintains clean and related data across tables (e.g., linking customers to
orders).
Helps normalize data and reduce redundancy.
3. UNIQUE
🔹 Syntax:
🔹 Example:
🔹 Syntax:
🔹 Example:
5. CHECK
🔹 Example:
Validates the quality of incoming data, reducing the need for extra checks
later.
Prevents invalid or extreme values (e.g., age < 0, salary < 0).
6. DEFAULT
🔹 Syntax:
🔹 Example:
🧠 Summary Table
Allows
Cons Allows
Ensures Duplicate Notes
traint NULL?
s?
PRIMA
Uniquely identifies
RY ❌ ❌ Only one per table
rows
KEY
FOREI Links tables
✅ Points to PRIMARY KEY
GN (referential ✅
(usually) in another table
KEY integrity)
UNIQU ✅ (one Can apply to multiple
Distinct values ❌
E NULL) columns
NOT Value must be Used on important
❌ ✅
NULL present fields
Value meets Depend
CHECK ✅ Use for validations
condition s
DEFAU Value when none is
✅ ✅ Reduces NULLs
LT provided
✅ Best Practices
Use PRIMARY KEY and FOREIGN KEY to build normalized and relational
schemas.
Combine NOT NULL and CHECK to ensure valid and complete data.
Use DEFAULT to avoid NULLs where practical.
Apply UNIQUE to maintain data integrity in identifying fields.
✅ Arithmetic Operators
✅ Comparison Operators
✅ Logical Operators
✅ Bitwise Operators
✅ Compound Operators
📌 Usage:
🔁 2. Comparison (Relational)
Operators
Used to compare values. They return TRUE, FALSE, or NULL.
📌 Usage:
🔣 3. Logical Operators
Used to combine multiple conditions in a WHERE clause.
Operat
Description Example
or
True if both conditions age > 18 AND status =
AND
true 'active'
True if either condition
OR city = 'NY' OR city = 'LA'
true
NOT Reverses condition NOT status = 'inactive'
📌 Usage:
Bitwise operators are supported in SQL Server and PostgreSQL, but not
commonly used in data analytics tasks.
🔍 1. WHERE Clause
✅ Purpose:
✅ Syntax:
✅ Example:
✅ Notes:
📦 2. GROUP BY Clause
✅ Purpose:
Groups rows that have the same values in specified columns into summary rows,
often used with aggregation functions (COUNT(), SUM(), AVG()).
✅ Syntax:
✅ Example:
✅ Notes:
Used for:
🧮 3. HAVING Clause
✅ Purpose:
✅ Notes:
✅ Example:
✅ Syntax:
✅ Example:
✅ Notes: