Course: MSc DS
SQL Programming
Module: 1
Preface
In the dynamic realm of Data Science, proficiency in database
management and SQL programming forms the cornerstone of
cultivating insightful analyses and fostering informed decision-
making processes. As we stand at the precipice of a data-driven
era, the "SQL Programming - Master of Science in Data Science"
course has been meticulously designed to empower learners with
the essential knowledge and skills required to navigate the
complex waters of database architecture and manipulation.
In this course, we embark on a structured journey that will
transform novices into adept professionals, capable of leveraging
the power of databases to extract, manipulate, and analyse data
with finesse. We initiate this voyage with an exploration of the
foundational concepts of databases, progressively moving
towards the nuanced intricacies of SQL and PL/SQL. Our carefully
curated syllabus encapsulates a rich blend of theoretical
knowledge and hands-on practice, aimed at facilitating a deep,
holistic understanding of the subject matter.
Learning Objectives:
1. Differentiate Database Types
2. Master RDBMS Concepts
3. Develop SQL Proficiency
4. Implement Data Manipulation Techniques
5. Manage Database Structures and Permissions
Structure:
1.1 Understanding Databases and SQL
1.2 Types of Databases
1.3 Introduction to RDBMS
1.4 Basics of SQL
1.5 Summary
1.6 Keywords
1.7 Self-Assessment Questions
1.8 Case Study
1.9 Reference
1.1 Understanding Databases and SQL
A database is a structured collection of data that can be easily
accessed, managed, and updated. Databases can vary widely in
terms of complexity and purpose, from simple spreadsheets that
store personal data to massive systems that support internet-
scale applications and transactions.
1.1.1 Key characteristics of a database:
● Structured Storage: Data in a database is typically organised
in tables, with rows and columns, allowing for efficient data
retrieval.
● Data Integrity: Databases have mechanisms like constraints
and triggers to ensure data remains accurate and consistent.
● Concurrency Control: Allows multiple users or applications
to access and modify the data simultaneously, without
conflicts.
● Data Security: Includes features like access controls,
encryption, and backups to safeguard data.
1.1.2 Historical Perspective: The Evolution of Data Storage
The storage and retrieval of data have a storied history that
stretches back before the invention of computers. Over time, as
technology progressed and the needs of organisations changed,
data storage evolved in tandem.
● Pre-electronic Era: Physical ledgers, scrolls, and manuscripts.
● Punched Cards: Used in early computing to store and
retrieve data.
● Magnetic Tapes: Sequential data storage, with improved
capacity over punched cards.
● Relational Databases: Introduced by E.F. Codd in the 1970s,
they organised data into tables with relationships. SQL
(Structured Query Language) was developed to interact with
these systems.
● NoSQL Databases: Emerged to cater to the scalability and
flexibility needs of modern applications, supporting
unstructured data and various data models.
● In-memory Databases: For high-performance applications,
where data is stored in RAM for rapid access.
1.1.3 The Critical Role of Databases in Modern Data Science
In the era of Big Data, databases are no longer just passive
repositories but active contributors to the data science pipeline.
● Data Staging: Databases serve as initial staging areas, where
raw data is ingested, cleaned, and preprocessed for analysis.
● Integration Point: They facilitate the integration of disparate
data sources, providing a unified view of data.
● Scalability: Modern databases can handle petabytes of data,
ensuring that data scientists have the resources to manage
and analyse large datasets.
● Real-time Analysis: With real-time databases, data science
applications can now provide instant insights and analytics.
1.1.4 Facilitating Efficient Data Management
Efficient data management is pivotal to ensure the reliability,
availability, and performance of data-driven applications.
● Data Normalisation: Avoids data redundancy and maintains
data integrity.
● Transaction Management: Ensures that a series of
operations succeed together or fail together, preserving
data consistency.
● Indexing: Speeds up data retrieval operations.
● Backup and Recovery: Safeguards against data loss and
provides mechanisms to restore data in case of failures.
1.1.5 Supporting Advanced Analytics and Machine Learning
The synergistic relationship between databases and advanced
analytics is driving the next wave of technological innovations.
● Data Warehousing: Specialised databases designed for
analytical processing and business intelligence.
● Data Lakes: Store vast amounts of raw data, structured and
unstructured, for complex analytical processes.
● Integrated ML Modules: Some modern databases have
built-in support for machine learning, allowing for model
training directly within the database.
● Graph Databases: Support complex network-based analyses,
beneficial in fields like social network analysis and
bioinformatics.
1.2 Types of Databases
1.2.1 Exploring Relational Databases
Relational databases have been a cornerstone in the world of data
management for many decades. They model data as a set of
related tables, each comprising rows and columns.
● Key Concepts:
o Tables: A table is a structured set of data made up of
rows and columns. It represents a specific entity type,
such as 'Customers' or 'Orders'.
o Rows: A row (often referred to as a record or tuple)
represents a single, implicitly structured data item in a
table.
o Relationships: In relational databases, tables can be
related to one another, allowing for efficient data
retrieval and ensuring data integrity. These
relationships are based on primary and foreign keys.
Popular Relational Database Management Systems (RDBMS):
● Oracle: A widely used commercial RDBMS solution.
● MySQL: An open-source RDBMS that's widely adopted for
web applications.
● Microsoft SQL Server: A commercial solution from Microsoft,
popular in enterprise environments.
● PostgreSQL: An advanced open-source RDBMS that supports
both SQL and procedural languages.
1.2.2 Diving into Non-Relational Databases
As data needs have evolved, there's been a surge in non-relational
databases, often labelled NoSQL. These databases do not rely on
the traditional table-based relational model.
Categories of NoSQL:
o Document: They store data in documents, typically
JSON-like. Example: MongoDB.
o Graph: Focused on storing relationships. Nodes
represent entities, and edges represent the
relationships. Example: Neo4j.
o Key-Value: As the name suggests, they store data as
key-value pairs. Examples: Redis, DynamoDB.
o Column-store: Designed for storing data tables as
sections of columns rather than rows. Example:
Cassandra.
1.2.3 The Rise of Distributed Databases: An Overview
With the advent of big data and globalised applications,
distributed databases have gained immense traction. These
databases span across multiple machines or even across wide
geographical regions, ensuring high availability, fault tolerance,
and scalability.
Characteristics:
● Scalability: Can grow in size and workload by simply
adding more machines to the network.
● Fault Tolerance: Even if one node fails, the system
continues to operate.
● Consistency: Despite being distributed, these databases
strive to ensure that all nodes reflect the same data.
Comparing Relational and Non-Relational Databases
● Strengths and Weaknesses:
Relational Databases:
o Strengths: Data integrity, mature, standardised
query language (SQL), ACID transactions.
o Weaknesses: Can face scalability issues, rigid
schema, might be overkill for simple use cases.
Non-Relational Databases:
o Strengths: Highly scalable, flexible schemas,
often faster writes.
o Weaknesses: Might lack full ACID transactions,
less mature than RDBMS, diverse ecosystem can
make selection challenging.
1.2.4 Decision Framework: Choosing the Right Database for Your
Project
Selecting the appropriate database type hinges on the specific
requirements of a project. Some guiding principles include:
● Data Structure: If data is relational and requires strong
integrity, an RDBMS might be preferable. For hierarchical,
graph-based, or unstructured data, NoSQL could be more
apt.
● Scale: For applications expecting massive scale and growth,
NoSQL or distributed databases can offer more flexibility.
● Query Complexity: If the project requires complex queries,
an RDBMS with SQL can be beneficial.
● Consistency Requirements: For projects that require high
levels of data consistency, relational databases might be the
best fit.
1.3 Introduction to RDBMS
Relational Database Management Systems (RDBMS) underpin a
significant portion of the data infrastructure across industries
today. From transaction processing systems to data warehouses,
RDBMS plays a pivotal role in facilitating structured data storage,
retrieval, and management.
1.3.1 The Architecture and Principles of RDBMS
● Logical View and Physical View: At the core, an RDBMS can
be seen through two lenses: the logical view, which defines
the schema, tables, relationships, and more, and the
physical view, which pertains to data storage mechanisms,
data access paths, and physical database design.
● Tables and Relations: Central to RDBMS is the concept of
tables (or relations). Each table consists of rows and columns,
with the emphasis on ensuring that data is structured in a
way that logical relationships between these tables are
maintainable.
● Data Independence: This is the principle by which changes
in the schema at one level (say physical) do not necessitate a
change at another level (say logical), offering a level of
insulation.
1.3.2 The Concept of Normalisation and its Importance
● Defining Normalisation: In essence, normalisation is a
systematic approach to breaking down a table into two or
more related tables to eliminate data redundancy and
ensure data is stored logically.
● Forms of Normalisation: There are multiple normal forms
(from 1NF to 5NF, BCNF, and more), with each successive
form addressing certain types of anomalies or redundancies.
● Importance: Normalisation ensures efficient data usage,
enhances database performance, and streamlines the
integrity constraints enforcement.
1.3.3 Key Features and Benefits of RDBMS
● Consistency and Integrity: RDBMS offers tools and
mechanisms to enforce data integrity through the use of
primary keys, foreign keys, and other constraints.
● SQL (Structured Query Language): A standardised language
for querying and manipulating data.
● Concurrency Control: Multiple users can access the
database concurrently without compromising the
consistency of data.
● Backup and Recovery: Robust systems in place to back up
data and restore them in case of failures.
1.3.4 ACID Properties: Ensuring Data Integrity and Reliability
● Atomicity: Ensures that transactions are treated as a single
unit. Either they are completed fully or not at all.
● Consistency: Guarantees that a transaction brings a
database from one valid state to another.
● Isolation: Makes sure that concurrent transactions appear
to be executed sequentially.
● Durability: Once a transaction has been committed, it
remains so, even in the event of power losses or system
crashes.
1.3.5 Flexibility and Scalability: Meeting Modern Data Needs
● Dynamic SQL: Enables constructing SQL statements
dynamically at runtime, offering a high level of flexibility in
querying.
● Horizontal Scalability: With advancements, many RDBMSs
can scale horizontally across clusters, accommodating
growth in data.
● Interfacing with Modern Technologies: RDBMSs have
evolved to interface seamlessly with big data technologies,
APIs, and other modern tech solutions.
Security and Access Control Mechanisms
● Authentication: Ensures that only authorised users can
access the database.
● Authorization: Determines what operations an
authenticated user can perform.
● Encryption: Data at rest or in transit is encrypted, ensuring
it's not easily readable if intercepted.
● Auditing: Tracking mechanisms to record who did what and
when in the database, useful for compliance and forensic
analysis.
1.4 Basics of SQL
1.4.1 SQL: Origins and Evolution
● SQL, which stands for Structured Query Language, emerged
during the 1970s at IBM. It was conceptualised as a domain-
specific language for managing and manipulating relational
databases.
● As relational database management systems (RDBMS)
became more popular, SQL became the standard language
for database operations. It was formally adopted as a
standard by ANSI in 1986 and by ISO in 1987.
● Over the years, different versions and dialects of SQL have
emerged as various database systems implemented and
extended the language. Yet, at its core, SQL remains
consistent and is universally recognized.
Why SQL is Essential for Data Scientists
● Data Retrieval and Cleaning: Most of the data that data
scientists encounter is stored in databases. SQL enables
them to retrieve, clean, and transform this data for analysis.
● Complex Analytics: Beyond basic retrieval, SQL allows for
complex computations, aggregations, and joins that are
essential for data-driven decision making.
● Interactivity with RDBMS: With SQL, data scientists can
interact directly with databases, ensuring data integrity and
enabling real-time data analytics.
1.4.2 Data Manipulation Language (DML) in Action
● DML refers to the subset of SQL commands used for data
manipulation, including retrieving, storing, modifying, and
deleting data.
o SELECT: Retrieves data from a table
o INSERT: Adds new data to a table
o UPDATE: Modifies existing data in a table
o DELETE: Removes data from a table
CRUD Operations: Select, Insert, Update, and Delete
● CRUD stands for Create, Read, Update, and Delete. These
operations form the basis of any database interaction.
o Create: Corresponds to the INSERT command in SQL.
o Read: Executed via the SELECT command.
o Update: Achieved through the UPDATE command.
o Delete: Carried out with the DELETE command.
Filtering and Sorting: The Power of the WHERE Clause
● The WHERE clause in SQL permits filtering of records based
on specified conditions, allowing users to extract meaningful
data.
● Additionally, the ORDER BY keyword can be used in
conjunction with WHERE to sort the results based on
particular columns.
1.4.3 Exploring Data Definition Language (DDL)
● DDL encompasses SQL commands that define or modify the
structure of database objects.
o CREATE TABLE: Defines a new table
o ALTER TABLE: Modifies an existing table (e.g., adding
or removing columns)
o DROP TABLE: Deletes an existing table
Schema Management: Creating, Altering, and Dropping Tables
● A schema in RDBMS refers to the organised collection of
database objects like tables, views, indexes, etc.
● Proper schema management ensures database integrity,
optimises performance, and eases data retrieval.
Indexing: Boosting Query Performance
● Indexes are used to speed up the retrieval of records in a
database table.
● By creating indexes on columns that are frequently queried,
you can significantly enhance query performance, especially
in large datasets.
Mastering Data Control Language (DCL)
● DCL involves SQL commands that control access to data and
database objects.
o GRANT: Provides specific privileges to users or roles
o REVOKE: Removes specific privileges from users or
roles
1.4.4 Granting and Revoking Access: The Role of Permissions
● In a multi-user database environment, managing who can do
what is crucial. With SQL's GRANT and REVOKE commands,
administrators can specify permissions at granular levels,
ensuring data security and integrity.
Transactions and Locking Mechanisms: Ensuring Data
Consistency
● Transactions ensure that a series of SQL commands are
executed completely or not at all, maintaining database
integrity.
● Locking mechanisms prevent multiple transactions from
conflicting with each other, especially in multi-user
environments.
1.5 Summary
❖ Centralised systems that store, manage, and retrieve
information. They're pivotal in data science for data
management and analytics.
❖ Relational Databases use tables to store data and define
relationships between them. Non-Relational Databases
Flexible data models that can include document stores, key-
value pairs, and more, optimised for specific use-cases.
❖ Relational Database Management Systems allow efficient
organisation and retrieval of data using SQL, while ensuring
data integrity, security, and reliability.
❖ Structured Query Language, a standard language used to
interact with relational databases, covering data definition,
manipulation, and control.
❖ A subset of SQL, the Data Manipulation Language deals with
data operations, such as inserting, updating, retrieving, and
deleting records.
❖ Data Definition Language focuses on the structure and
schema of the database (like creating or altering tables),
while Data Control Language handles permissions and access
rights for data security.
1.6 Keywords
● Database: A database is an organised collection of data,
stored and accessed electronically. Databases help in
efficient data management and retrieval. They can be
classified into various types based on their structure and
use-case, such as relational and non-relational databases.
● RDBMS (Relational Database Management System):
RDBMS is a type of database management system where
data is stored in tables (relation) and the relationships
between these tables are established using keys. Examples
of RDBMS include Oracle, MySQL, and PostgreSQL. The core
principle behind an RDBMS is the concept of normalisation
which aims to minimise redundancy and dependency by
organising fields and tables of a database.
● NoSQL: NoSQL, or "not only SQL," represents a broad class
of database management systems that differ from
traditional RDBMS. They do not require fixed table schemas,
avoid join operations, and typically scale horizontally. Types
include document stores (e.g., MongoDB), key-value stores
(e.g., Redis), graph databases (e.g., Neo4j), and columnar
databases (e.g., Cassandra).
● SQL (Structured Query Language): SQL is a standard
programming language specifically designed for managing
and manipulating relational databases. It allows users to
query the database (using DML operations), define it (using
DDL operations), and set permissions (using DCL operations).
● ACID Properties: ACID stands for Atomicity, Consistency,
Isolation, and Durability. These are a set of properties that
guarantee reliable processing of database transactions. They
ensure that even in the event of a system failure, the
database remains in a consistent state.
● Normalisation: Normalisation is a process in relational
database design that reduces data redundancy and
dependency by organising data into separate tables based
on their dependencies on the primary key. It involves
dividing a database into two or more tables and defining
relationships between the tables. This process helps in
ensuring data integrity and optimising storage.
1.7 Self-Assessment Questions
1. How does normalisation within an RDBMS help in reducing
data redundancy?
2. What are the primary differences between a relational
database and a NoSQL database when it comes to data
storage structures and querying capabilities?
3. Which SQL command, falling under the category of Data
Manipulation Language (DML), would you use to add new
records into a database table?
4. What are the ACID properties in RDBMS, and why are they
crucial for ensuring data reliability and consistency?
5. Which component of the Data Control Language (DCL) is
responsible for granting specific permissions to a user for
accessing certain parts of a database?
1.8 Case Study
Title: Optimising Retail Operations with SQL
Background:
One of India's burgeoning e-commerce platforms, "ShopSutra,"
noticed a dip in its sales over a span of three months. Concerned
about the downturn, the company decided to analyse its vast
database to uncover the potential causes and develop data-driven
strategies for improvement.
Challenge:
The primary database, built on an RDBMS, stored information on
products, transactions, customer reviews, and inventory levels.
While the data was comprehensive, extracting meaningful insights
required advanced SQL skills.
Solution: An experienced data scientist from the "ShopSutra"
team started by analysing the sales data using SQL queries to
categorise products by sales volume and identify any patterns.
The queries revealed that while electronics and home appliances
were still top sellers, there was a sharp decline in the fashion
category.
A deeper dive into customer reviews and ratings for fashion
products using SQL queries pointed towards the recurring
complaint of product misrepresentation in images. Many
customers felt that the actual product differed significantly from
its online representation.
Furthermore, analysing the inventory data using SQL showed that
several top-rated fashion products were frequently out of stock,
leading to missed sales opportunities.
With the insights gained, "ShopSutra" implemented corrective
measures. They established stringent quality checks for product
images and descriptions in the fashion category. Inventory
management was overhauled using data-driven predictive models
to ensure high-demand items remained in stock.
Results: After three months, there was a noticeable uptick in sales,
especially in the fashion category. Positive customer reviews
increased, and the out-of-stock issues were significantly reduced.
"ShopSutra" acknowledged the power of SQL programming in
helping them turn their business around by providing actionable
insights.
Questions:
1. What primary issue did "ShopSutra" identify as a cause for
their declining sales in the fashion category?
2. How did SQL help in addressing the inventory management
problem faced by "ShopSutra"?
3. Based on the case study, how crucial do you think data
accuracy (in terms of product images and descriptions) is for
e-commerce businesses, and how can SQL play a role in
monitoring such accuracy?
1.9 References
1. "Database System Concepts" by Abraham Silberschatz,
Henry F. Korth, and S. Sudarshan.
2. "SQL Performance Explained" by Markus Winand.
3. "Designing Data-Intensive Applications" by Martin
Kleppmann.
4. "The Data Warehouse Toolkit: The Definitive Guide to
Dimensional Modelling" by Ralph Kimball and Margy Ross.
5. "NoSQL Distilled: A Brief Guide to the Emerging World of
Polyglot Persistence" by Pramod J. Sadalage and Martin
Fowler.