SQL Material
SQL Material
Language
The way of storing relational data
Data Engineering
How the data will be used in the future so that the format
you use will make sense. Here are some of the questions
you might want to consider
• How do I store multimodal data, e.g., a sample that
might contain both images and texts?
• Where do I store my data so that it’s cheap and still fast
to access?
Data Engineering
Row-Major and Column Major
• Row-major formats are better when you have to do a lot of writes, whereas column-major ones are
better when you have to do a lot of column-based reads.
Row-Major and Column Major
• Consider that you want to store the
number 1000000. If you store it in a text
file, it’ll require 7 characters, and if each
character is 1 byte, it’ll require 7 bytes. If
you store it in a binary file as int32, it’ll
take only 32 bits or 4 bytes.
Data Models
• Data models describe how data is represented. Consider cars in the real world. In a
database, a car can be described using its make, its model, its year, its color, and its
price
• Alternatively, you can also describe a car using its owner, its license plate, and its history
of registered addresses. This is another data model for cars.
• Two types of Data Models: Relational models and NoSQL models.
Data Models
• Data models describe how data is represented. Consider cars in the real world. In a
database, a car can be described using its make, its model, its year, its color, and its
price
• Alternatively, you can also describe a car using its owner, its license plate, and its history
of registered addresses. This is another data model for cars.
• Two types of Data Models: Relational models and NoSQL models.
Data Models
• Data models describe how data is represented. Consider cars in the real world. In a
database, a car can be described using its make, its model, its year, its color, and its
price
• Alternatively, you can also describe a car using its owner, its license plate, and its history
of registered addresses. This is another data model for cars.
• Two types of Data Models: Relational models and NoSQL models.
Relational Data Model
NoSQL Data Model
• All documents in a document database are assumed to be encoded in the same format.
• Each document has a unique key that represents that document, which can be used to retrieve it.
• A document is often a single continuous string, encoded as JSON, XML, or a binary format like BSON
(Binary JSON)
Graph Data Model
• The graph model is built around the concept of a “graph.”
• A graph consists of nodes and edges, where the edges represent the relationships between the
nodes.
• A database that uses graph structures to store its data is called a graph database.
Structured and Unstructured
Declarative & Imperative
• In the declarative paradigm, you specify the outputs you want, and the computer
figures out the steps needed to get you the queried outputs.
• In the imperative paradigm, you specify the steps needed for an action and the
computer executes these steps to return the outputs
OLTP vs OLAP
• OLTP: Online Transaction Processing
• OLTP systems are designed to support everyday transaction-
oriented applications in industries such as banking, retail, logistics,
etc.
• Prioritizes fast query processing and maintaining data integrity in
multi-access environments.
• Data is often current, not historical.
• Examples: A bank's system where customers withdraw or deposit
money; a retailer's system where customers make purchases.
OLTP vs OLAP
• OLAP: Online Analytical Processing
• OLAP systems are designed to support complex queries and offer
business insights. They facilitate multi-dimensional analytical
queries, providing a platform for business intelligence and data
mining.
• Simple relationships with fewer joins.
• Aggregated data.
• Commonly uses schemas like star and snowflake.
• Examples: An e-commerce company analysing sales trends over
the past year; a system providing business performance metrics.
What is Relational Database
• A relational database is a collection of information that
organizes data in predefined relationships where data
is stored in one or more tables (or "relations") of
columns and rows
Different SQL Tools
• MySQL: An open-source relational database management
system, owned by Oracle Corporation. One of the most popular
databases for web-based applications
• Microsoft SQL: A relational database management system
developed by Microsoft. Used for a variety of applications ranging
from small applications to large scale enterprise applications. SQL
Server uses T-SQL as its primary querying language
• PostgreSQL: It's an open-source relational database
management system (RDBMS). Known for its extensibility and SQL
compliance. It's not just an SQL processing tool but also offers
"NoSQL" capabilities.
Different SQL Tools
• PL/SQL(Procedural Language for SQL): Predominantly used in
Oracle Databases for writing stored procedures, functions, and
triggers.
• SQLite: A C-language library that offers a lightweight, disk-based
database, which doesn’t require a separate server process. It's
serverless, self-contained, and zero-configuration.
4 Stages of DBMS
• Database Management Systems are having 4 Important
Characteristics.
• Data Definition – Define the data being tracked
• Data Manipulation – Add, Update & Remove the Data
• Data Retrieval - Extract and Report the data available in
database
• Administration - defining users on the system, security,
monitoring, system administration
Database Tables
• A database table is a lot like a spreadsheet. • Data is kept in
Columns and Rows.
• Each Column is assigned:
• A Unique Name, identifying a human readable name of the
column. (ie FIRST_NAME, LAST_NAME)
• A Data Type (ie - String, Date, Time, Number, etc)
• Optionally, constraints (ie - Is a value required?, Length of String,
etc) • Each Row is a distinct database Record.
Primary Key & Surrogate Key
• A Primary Key is an optional special database column or columns
used to identify a database record.
A Surrogate Key is a type of Primary Key which used a unique
generated value.
• Should have no business value, and should never change.
Data Relationships
• One to One - Record in Table A matches exactly one record in Table B
• One to Many - Record in Table A matches many in Table B, but Table B matches
only one record
in Table A. (Think - An Order with multiple items)
• Many to Many - Record in Table A matches many in Table B, and Table B
matches many records in Table A.
Data Relationships
• One to One - Record in Table A matches exactly one record in Table B
• One to Many - Record in Table A matches many in Table B, but Table B matches
only one record
in Table A. (Think - An Order with multiple items)
• Many to Many - Record in Table A matches many in Table B, and Table B
matches many records in Table A.
Data Relationships
Data Relationships
DDL
• DDL - Data Definition Language (ie CREATE TABLE...) is used to
define the relational model
• Under the covers, the RDBMS will store data about your tables in
catalog tables
• The software is used to enforce data being stored conforms to the
rules you’ve defined for the data.
DML
• DML - Data Manipulation Language
Allows you to add (INSERT), change (UPDATE), or remove (DELETE)
data.
The RDBMS enforces data manipulation adheres to the rules of the
Data Definition.
The RDBMS allows set up ‘rules’ for multi-user systems.
These rules manage what happens in competing conditions. (what
happens when two users want to update the same data, at the
same time)
Retrieval
• Data Retrieval is the act of pulling data out of the database
• The RDBMS determines the optimal way to retrieve data out of the
database. • Multi-table joins can become very complex.
• Consider tables with billions and billions of rows.
• Reports can go from seconds, to hours when the retrieval strategy
is wrong.
• The RDBMS also considers what happens when updates occur
while your report is running.
Character Set
• Computers are driven off of binary information - ie 1’s and zeros. •
A ‘bit’ is binary one or zero.
• A byte is a collection of eight bits (10000111) = 70
• ASCII - American Standard Code for Information Interchange
• One of the first ‘character’ sets
• Limited to 128 characters (mostly letters, numbers, common
punctuation)
• UTF-8 is highly popular used for email / web. 1 - 4 bytes long.
• Up to 1,112,064 characters
Data Normalization
Database Normalization is the most important factor in Database
design or Data modeling. Database Normalization is the process to
eliminate data redundancies and store the data logically to make
data management easier
• First Normal Form (1NF)
• Second Normal Form (2NF)
• Third Normal Form (3NF)
• Fourth Normal Form
• Fifth Normal Form
• Boyce Codd Normal Form(BCNF)
First Normal Form
In the first normal form, each column must contain only one value.
No table should store repeating groups of related data. The easiest
way to follow the first normal form is to inspect the database table
horizontally.
Second Normal Form
In the second normal form, first, the database must be in the first
normal form and there should not be any partial dependency. If
there are duplicate values in the row, they should be stored in their
own separate tables and linked to the table using foreign keys.
Third Normal Form
In the third normal form, the database is already in the third normal
form, if it is in the second normal form. Every non-key column must
be mutually independent. Identify any columns in the table that are
interdependent and break those columns into their own separate
tables.
Third Normal Form
Functional Dependency: When there is a relationship exists
between the primary key and non-key attribute within a table it is
called functional dependency.
• X -> Y
• Here, X is known as determinant, and Y is known as the
dependent.