SQL Data Warehousing
SQL Data Warehousing
• The Server is the one who provides data access or requested services to the clients.
• One of the main benefits of a client-server architecture is that it is more scalable than a centralized
architecture. As the number of clients and/or the amount of data increases, the server can be
upgraded, or additional servers can be added to handle the load. This allows the system to continue
functioning smoothly even as it grows in size.
2. Servers:
• A server generally acts as the backend and provides a fine standardized interface to different workstations so
that clients need not be aware of the specifics of the system (i.e., the software and hardware) that is providing
the service.
• For example: the client asks for the webpage, then the server responds with the webpage to the client.
3. Networking Devices:
• Networking devices act as a specific medium that connects different workstations or clients to a powerful
server. , For example, Networking devices used in client-server architecture have different purposes and
properties;
• For isolated network segmentation, bridges are used. For making a connection to a server, various workstation
hubs are used.
Steps involved in the client-server model are:
• First, the client sends their request via a network-enabled device.
• Then, the network server accepts and processes the user request.
• Finally, the server delivers the response to the client.
Single-tier Architecture
• The application on the client end interacts with an application server which further
communicates with the database system.
• The end-user has no idea about the existence of the database beyond the application server. The
database also has no idea about any other user beyond the application. The 3-Tier architecture is
used in the case of the large web application.
• In DBMS, the 3-tier architecture is a client-server architecture that separates the user
interface, application processing, and data management into three distinct tiers or layers.
• Presentation Tier: The presentation tier is the user interface or client layer of the
application. It is responsible for presenting data to the user and receiving input from the
user. This tier can be a web browser, mobile app, or desktop application.
• Application Tier: The application tier is the middle layer of the 3-tier architecture. It is
responsible for processing and managing the business logic of the application. This tier
communicates with the presentation tier to receive user input and communicates with the
data management tier to retrieve or store data. This tier may include application servers,
web servers, or APIs.
• Data Management Tier: The data management tier is the bottom layer of the 3-tier
architecture. It is responsible for managing and storing data. This tier can include
databases, data warehouses, or data lakes. The data management tier communicates with
the application tier to receive or store data.
Data Warehousing
What is a Data Warehouse?
• A Data Warehouse consists of data from multiple heterogeneous data sources and is used for
analytical reporting and decision making.
• Data Warehouse is a central place where data is stored from different data sources and
applications.
• The goal is to produce statistical results that may help in decision makings.
• For example, a college might want to see quick different results, like how the placement of CS
students has improved over the last 10 years, in terms of salaries, counts, etc.
• A Data Warehouse is used for reporting and analyzing
information and is used to store both historical and
current data.
• The data in DW system is used for Analytical reporting,
which is later used by Business Analysts, Sales
Managers or Knowledge workers for decision-making.
• In the above image, you can see that the data is coming
from multiple heterogeneous data sources to a Data
Warehouse.
• Common data sources for a data warehouse includes −
• Operational databases
• Flat Files (xls, csv, txt files)
• Data in data warehouse is accessed by BI (Business
Intelligence) users for Analytical Reporting, Data
Mining and Analysis.
• This is used for decision making by Business Users,
Sales Manager, Analysts to define future strategy.
Applications of Data Warehousing
• Data Warehousing can be applied anywhere where we have a huge amount of data and we want to
see statistical results that help in decision making.
• Social Media Websites: The social networking websites like Facebook, Twitter, Linkedin, etc. are
based on analyzing large data sets. These sites gather data related to members, groups, locations,
etc., and store it in a single central repository. Being a large amount of data, Data Warehouse is
needed for implementing the same.
• Banking: Most of the banks these days use warehouses to see the spending patterns of
account/cardholders. They use this to provide them with special offers, deals, etc.
• Government: Government uses a data warehouse to store and analyze tax payments which are
used to detect tax thefts.
• There can be many more applications in different sectors like E-Commerce, telecommunications,
Transportation Services, Marketing and Distribution, Healthcare, and Retail.
ADVANTAGES:
Improved data quality: Data warehousing can help improve data quality by
consolidating data from various sources into a single, consistent view.
Better decision-making: With a data warehouse, businesses can analyze data and gain
insights into trends and patterns that can inform better decision-making.
Scalability: Data warehousing is highly scalable and can handle large amounts of data
from different sources.
DISADVANTAGES:
Cost: Building a data warehouse can be expensive, requiring significant
investments in hardware, software, and personnel.
Data security: Data warehousing can pose data security risks, and businesses
must take measures to protect sensitive data from unauthorized access or
breaches.
Data Warehouse Architecture
• The architecture can be divided into three major components - the
data source layer, the data warehouse layer, and the data access layer.
• Data Source Layer - This layer consists of the systems that provide data to the data
warehouse. These systems can include operational databases, external data sources,
and other systems that generate or capture data. In this layer, data is extracted from
various sources and transformed into a format that can be loaded into the data
warehouse.
• Data Warehouse Layer - This layer is the central repository of data that has been
collected from the data source layer. This is where the data is stored in a structured
format, making it easier to analyze and query. The data warehouse layer is divided
into two components - the staging area and the data warehouse database. The
staging area is used to store data before it is transformed and loaded into the data
warehouse database.
• Data Access/Analysis Layer - This layer consists of the tools and interfaces that
allow users to access and analyze the data stored in the data warehouse. This layer
includes tools such as Business Intelligence software, SQL clients, and spreadsheets.
The data access layer provides users with various options to access the data
warehouse, depending on their specific needs.
Functions of Data Warehouse Tools and Utilities
Data warehouse tools and utilities are designed to perform various functions that help
manage and analyze data stored in a data warehouse.
• Data Extraction - This involves extracting data from various sources, such as
transactional databases, operational systems, and external data sources. The data is
then cleaned, transformed, and loaded into the data warehouse.
• Data Cleaning - This involves identifying and correcting errors or inconsistencies in
the data. Data cleaning ensures that the data is accurate and reliable for analysis.
• Data Transformation - This involves converting the data into a format that is suitable
for analysis. Data transformation may involve merging data from multiple sources,
reformatting data, or creating new variables.
• Data Integration - This involves integrating data from multiple sources into a single
data warehouse. This allows for a more comprehensive view of an organization's
data, which can improve decision-making.
• Data Storage - Data warehouse tools and utilities provide various storage options,
such as relational databases, columnar databases, or cloud-based storage. The choice
of storage depends on the size and complexity of the data and the organization's
needs.
Query Processing
What is Query Processing ?
• Query Processing includes translations on high level Queries into low level expressions that can
be used at physical level of file system, query optimization and actual execution of query to get
the actual result.
Parsing and Translation
The first step in query processing is Parsing and Translation.
• The fired queries undergo lexical, syntactic, and semantic analysis.
• Essentially, the query gets broken down into different tokens and white spaces are removed
along with the comments (Lexical Analysis).
• In the next step, the query gets checked for the correctness, both syntax and semantic wise. The
query processor first checks the query if the rules of SQL have been correctly followed or not
(Syntactic Analysis).
• Finally, the query processor checks if the meaning of the query is right or not. Things like if the
table(s) mentioned in the query are present in the DB or not? if the column(s) referred from all
the table(s) are actually present in them or not? (Semantic Analysis)
Once the above mentioned checks pass, the flow moves to convert all the tokens into relational
expressions, graphs, and trees. This makes the processing of the query easier for the other parsers.
Let's consider the same query (mentioned below as well) as an example and see how the
flow works.
SELECT emp_name FROM employee WHERE salary>10000;
• The above query would be divided into the following tokens: SELECT, emp_name, FROM,
employee, WHERE, salary, >, 10000.
• DBMS starts finding the most efficient way to execute the given query.
• The optimization process follows some factors indexing, joins, CPU time, Number of tuples to be scanned,
Disk access time, number of operations and other optimization mechanisms for the query.
• So, query optimization tells the DBMS what the best execution plan should be.
• The main goal of this step is to retrieve the required data with minimal cost in terms of resources and
time.
Evaluation
• After finding the best execution plan, the DBMS starts the execution of the optimized query. And it gives
the results from the database.
• In this step, DBMS can perform operations on the data. These operations are selecting the data, inserting
something, updating the data, and so on.
• Once everything is completed, DBMS returns the result after the evaluation step.
Time A B
T1 READ(DT) ------
T2 DT=DT+500 ------
T3 WRITE(DT) ------
T4 ------ READ(DT)
T5 ------ COMMIT
T6 ROLLBACK ------
Transaction A reads the value of data DT as 1000 and modifies it to 1500 which gets stored in the temporary
buffer. The transaction B reads the data DT as 1500 and commits it and the value of DT permanently gets
changed to 1500 in the database DB. Then some server errors occur in transaction A and it wants to get rollback
to its initial value, i.e., 1000 and then the dirty read problem occurs.
The Unrepeatable read problem
Consider two transactions A and B performing read/write operations on a data DT in the database
DB. The current value of DT is 1000: The following table shows the read/write operations in A and B
transactions.
Time A B
T1 READ(DT) ------
T2 ------ READ(DT)
T3 DT=DT+500 ------
T4 WRITE(DT) ------
T5 ------ READ(DT)
Transaction A and B initially read the value of DT as 1000. Transaction A modifies the value of DT from 1000 to
1500 and then again transaction B reads the value and finds it to be 1500. Transaction B finds two different
values of DT in its two different read operations.
Phantom Read Problem
Consider two transactions A and B performing read/write operations on a data DT in the database
DB. The current value of DT is 1000: The following table shows the read/write operations in A and B
transactions.
Time A B
T1 READ(DT) ------
T2 ------ READ(DT)
T3 DELETE(DT) ------
T4 ------ READ(DT)
Transaction B initially reads the value of DT as 1000. Transaction A deletes the data DT from the
database DB and then again transaction B reads the value and finds an error saying the data DT does
not exist in the database DB.
Lost Update Problem
Consider two transactions A and B performing read/write operations on a data DT in the database
DB. The current value of DT is 1000: The following table shows the read/write operations in A and B
transactions.
Time A B
T1 READ(DT) ------
T2 DT=DT+500 ------
T3 WRITE(DT) ------
T4 ------ DT=DT+300
T5 ------ WRITE(DT)
T6 READ(DT) ------
Transaction A initially reads the value of DT as 1000. Transaction A modifies the value of DT from
1000 to 1500 and then again transaction B modifies the value to 1800. Transaction A again reads DT
and finds 1800 in DT and therefore the update done by transaction A has been lost.
Lock-Based Protocols
• To attain consistency, isolation between the transactions is the most important tool.
Isolation is achieved if we disable the transaction to perform a read/write operation. This
is known as locking an operation in a transaction. Through lock-based protocols, desired
operations are freely allowed to perform locking the undesired operations.
• Shared Lock(S): The locks which disable the write operations but allow read operations
for any data in a transaction are known as shared locks. They are also known as read-only
locks and are represented by 'S'.
• Exclusive Lock(X): The locks which allow both the read and write operations for any data
in a transaction are known as exclusive locks. This is a one-time use mode that can't be
utilized on the exact data item twice. They are represented by 'X'.
Time-based Protocols
• According to this protocol, every transaction has a timestamp attached to it. The
timestamp is based on the time in which the transaction is entered into the system.
There is read and write timestamps associated with every transaction which consists of
the time at which the latest read and write operations are performed respectively.
• The timestamp ordering protocol uses timestamp values of the transactions to resolve
the conflicting pairs of operations. Thus, ensuring serializability among transactions.
Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has entered
the system at 7:00 time and transaction T2 has entered the system at 7:09 time. T1 has the
higher priority, so it executes first as it is entered the system first.
The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write'
operation on a data.
Basic Timestamp ordering protocol works as follows:
1. Check the following condition whenever a transaction Ti issues a Read (X) operation:
• If W_TS(X) >TS(Ti) then the operation is rejected.
• If W_TS(X) <= TS(Ti) then the operation is executed.
• Timestamps of all the data items are updated.
2. Check the following condition whenever a transaction Ti issues a Write(X) operation:
• If W_TS(X) > TS(Ti) then the operation is rejected and Ti is rolled back otherwise the operation is
executed.
• If R_TS(X) > TS(Ti) then the operation is rejected.
Where,
TS(TI) denotes the timestamp of the transaction Ti.
R_TS(X) denotes the Read time-stamp of data-item X.
W_TS(X) denotes the Write time-stamp of data-item X.