0% found this document useful (0 votes)
17 views

SQL Data Warehousing

The document discusses client-server architectures and databases. It explains the basic components and advantages of client-server models, including single-tier, two-tier and three-tier architectures. It also covers topics like distributed databases, data warehousing and their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

SQL Data Warehousing

The document discusses client-server architectures and databases. It explains the basic components and advantages of client-server models, including single-tier, two-tier and three-tier architectures. It also covers topics like distributed databases, data warehousing and their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

UNIT-3

OPERATION AND MANAGEMENT


• Client / Server and Databases
• Distributed Databases
Client / Server and Databases
• A client-server for DBMS is one in which data is stored on a central server, but clients connect to
that server in order to access and manipulate the data.

• Clients are the ones who request services.

• The Server is the one who provides data access or requested services to the clients.
• One of the main benefits of a client-server architecture is that it is more scalable than a centralized
architecture. As the number of clients and/or the amount of data increases, the server can be
upgraded, or additional servers can be added to handle the load. This allows the system to continue
functioning smoothly even as it grows in size.

• Another advantage of a client-server architecture is that it is more fault-tolerant than a centralized


architecture. If a single server goes down, other servers can take over its responsibilities, and clients
can still access the data. This makes the system less likely to experience downtime, which is a
crucial factor in many business environments.
The Essential Client-server Architecture
1. Workstations (clients):
Components
• A workstation is a system of users that is sometimes also named the client's computer. The client's computer
generally acts as the front end and provides an interface to the user
• For example: Visiting any website, you request the webpage from its domain. So here you are acting as a client.

2. Servers:
• A server generally acts as the backend and provides a fine standardized interface to different workstations so
that clients need not be aware of the specifics of the system (i.e., the software and hardware) that is providing
the service.
• For example: the client asks for the webpage, then the server responds with the webpage to the client.

3. Networking Devices:
• Networking devices act as a specific medium that connects different workstations or clients to a powerful
server. , For example, Networking devices used in client-server architecture have different purposes and
properties;
• For isolated network segmentation, bridges are used. For making a connection to a server, various workstation
hubs are used.
Steps involved in the client-server model are:
• First, the client sends their request via a network-enabled device.
• Then, the network server accepts and processes the user request.
• Finally, the server delivers the response to the client.
Single-tier Architecture

• In this architecture, the database is directly available to


the user. It means the user can directly access DBMS and
can use it.
• Any changes done here will directly be done on the
database itself. It doesn't provide a handy tool for end
users.
• The 1-Tier architecture is used for development of the
local application, where programmers can directly
communicate with the database for the quick response.
Two Tier Architecture

• The 2-Tier architecture is the same as the basic client-server.


• In the two-tier architecture, applications on the client end can
directly communicate with the database on the server side. For
this interaction, APIs like ODBC, and JDBC are used.
• The user interfaces and application programs are run on the
client side.
• The server side is responsible to provide the functionalities like
query processing and transaction management. To
communicate with the DBMS, the client-side application
establishes a connection with the server side.
Three Tier Architecture
• The 3-Tier architecture contains another layer between the client and server. In this architecture,
the client can't directly communicate with the server.

• The application on the client end interacts with an application server which further
communicates with the database system.

• The end-user has no idea about the existence of the database beyond the application server. The
database also has no idea about any other user beyond the application. The 3-Tier architecture is
used in the case of the large web application.
• In DBMS, the 3-tier architecture is a client-server architecture that separates the user
interface, application processing, and data management into three distinct tiers or layers.

• Presentation Tier: The presentation tier is the user interface or client layer of the
application. It is responsible for presenting data to the user and receiving input from the
user. This tier can be a web browser, mobile app, or desktop application.

• Application Tier: The application tier is the middle layer of the 3-tier architecture. It is
responsible for processing and managing the business logic of the application. This tier
communicates with the presentation tier to receive user input and communicates with the
data management tier to retrieve or store data. This tier may include application servers,
web servers, or APIs.

• Data Management Tier: The data management tier is the bottom layer of the 3-tier
architecture. It is responsible for managing and storing data. This tier can include
databases, data warehouses, or data lakes. The data management tier communicates with
the application tier to receive or store data.
Data Warehousing
What is a Data Warehouse?
• A Data Warehouse consists of data from multiple heterogeneous data sources and is used for
analytical reporting and decision making.

• Data Warehouse is a central place where data is stored from different data sources and
applications.

• The goal is to produce statistical results that may help in decision makings.

• For example, a college might want to see quick different results, like how the placement of CS
students has improved over the last 10 years, in terms of salaries, counts, etc.
• A Data Warehouse is used for reporting and analyzing
information and is used to store both historical and
current data.
• The data in DW system is used for Analytical reporting,
which is later used by Business Analysts, Sales
Managers or Knowledge workers for decision-making.
• In the above image, you can see that the data is coming
from multiple heterogeneous data sources to a Data
Warehouse.
• Common data sources for a data warehouse includes −
• Operational databases
• Flat Files (xls, csv, txt files)
• Data in data warehouse is accessed by BI (Business
Intelligence) users for Analytical Reporting, Data
Mining and Analysis.
• This is used for decision making by Business Users,
Sales Manager, Analysts to define future strategy.
Applications of Data Warehousing
• Data Warehousing can be applied anywhere where we have a huge amount of data and we want to
see statistical results that help in decision making.
• Social Media Websites: The social networking websites like Facebook, Twitter, Linkedin, etc. are
based on analyzing large data sets. These sites gather data related to members, groups, locations,
etc., and store it in a single central repository. Being a large amount of data, Data Warehouse is
needed for implementing the same.
• Banking: Most of the banks these days use warehouses to see the spending patterns of
account/cardholders. They use this to provide them with special offers, deals, etc.
• Government: Government uses a data warehouse to store and analyze tax payments which are
used to detect tax thefts.

• There can be many more applications in different sectors like E-Commerce, telecommunications,
Transportation Services, Marketing and Distribution, Healthcare, and Retail.
ADVANTAGES:
Improved data quality: Data warehousing can help improve data quality by
consolidating data from various sources into a single, consistent view.

Faster access to information: Data warehousing enables quick access to information,


allowing businesses to make better, more informed decisions faster.

Better decision-making: With a data warehouse, businesses can analyze data and gain
insights into trends and patterns that can inform better decision-making.

Reduced data redundancy: By consolidating data from various sources, data


warehousing can reduce data redundancy and inconsistencies.

Scalability: Data warehousing is highly scalable and can handle large amounts of data
from different sources.
DISADVANTAGES:
Cost: Building a data warehouse can be expensive, requiring significant
investments in hardware, software, and personnel.

Complexity: Data warehousing can be complex, and businesses may need to


hire specialized personnel to manage the system.

Time-consuming: Building a data warehouse can take a significant amount of


time, requiring businesses to be patient and committed to the process.

Data integration challenges: Data from different sources can be challenging to


integrate, requiring significant effort to ensure consistency and accuracy.

Data security: Data warehousing can pose data security risks, and businesses
must take measures to protect sensitive data from unauthorized access or
breaches.
Data Warehouse Architecture
• The architecture can be divided into three major components - the
data source layer, the data warehouse layer, and the data access layer.
• Data Source Layer - This layer consists of the systems that provide data to the data
warehouse. These systems can include operational databases, external data sources,
and other systems that generate or capture data. In this layer, data is extracted from
various sources and transformed into a format that can be loaded into the data
warehouse.
• Data Warehouse Layer - This layer is the central repository of data that has been
collected from the data source layer. This is where the data is stored in a structured
format, making it easier to analyze and query. The data warehouse layer is divided
into two components - the staging area and the data warehouse database. The
staging area is used to store data before it is transformed and loaded into the data
warehouse database.
• Data Access/Analysis Layer - This layer consists of the tools and interfaces that
allow users to access and analyze the data stored in the data warehouse. This layer
includes tools such as Business Intelligence software, SQL clients, and spreadsheets.
The data access layer provides users with various options to access the data
warehouse, depending on their specific needs.
Functions of Data Warehouse Tools and Utilities
Data warehouse tools and utilities are designed to perform various functions that help
manage and analyze data stored in a data warehouse.
• Data Extraction - This involves extracting data from various sources, such as
transactional databases, operational systems, and external data sources. The data is
then cleaned, transformed, and loaded into the data warehouse.
• Data Cleaning - This involves identifying and correcting errors or inconsistencies in
the data. Data cleaning ensures that the data is accurate and reliable for analysis.
• Data Transformation - This involves converting the data into a format that is suitable
for analysis. Data transformation may involve merging data from multiple sources,
reformatting data, or creating new variables.
• Data Integration - This involves integrating data from multiple sources into a single
data warehouse. This allows for a more comprehensive view of an organization's
data, which can improve decision-making.
• Data Storage - Data warehouse tools and utilities provide various storage options,
such as relational databases, columnar databases, or cloud-based storage. The choice
of storage depends on the size and complexity of the data and the organization's
needs.
Query Processing
What is Query Processing ?
• Query Processing includes translations on high level Queries into low level expressions that can
be used at physical level of file system, query optimization and actual execution of query to get
the actual result.
Parsing and Translation
The first step in query processing is Parsing and Translation.
• The fired queries undergo lexical, syntactic, and semantic analysis.
• Essentially, the query gets broken down into different tokens and white spaces are removed
along with the comments (Lexical Analysis).
• In the next step, the query gets checked for the correctness, both syntax and semantic wise. The
query processor first checks the query if the rules of SQL have been correctly followed or not
(Syntactic Analysis).
• Finally, the query processor checks if the meaning of the query is right or not. Things like if the
table(s) mentioned in the query are present in the DB or not? if the column(s) referred from all
the table(s) are actually present in them or not? (Semantic Analysis)

Once the above mentioned checks pass, the flow moves to convert all the tokens into relational
expressions, graphs, and trees. This makes the processing of the query easier for the other parsers.
Let's consider the same query (mentioned below as well) as an example and see how the
flow works.
SELECT emp_name FROM employee WHERE salary>10000;

• The above query would be divided into the following tokens: SELECT, emp_name, FROM,
employee, WHERE, salary, >, 10000.

The tokens (and hence the query) get validated for


• The name of the queried table is looked into the data dictionary table.
• The name of the columns mentioned (emp_name and salary) in the tokens are validated for
existence.
• The type of column(s) being compared have to be of the same type (salary and the value
10000 should have the same data type).
Optimization
After doing query parsing,

• DBMS starts finding the most efficient way to execute the given query.

• The optimization process follows some factors indexing, joins, CPU time, Number of tuples to be scanned,
Disk access time, number of operations and other optimization mechanisms for the query.

• These help in determining the most efficient query execution plan.

• So, query optimization tells the DBMS what the best execution plan should be.

• The main goal of this step is to retrieve the required data with minimal cost in terms of resources and
time.
Evaluation
• After finding the best execution plan, the DBMS starts the execution of the optimized query. And it gives
the results from the database.

• In this step, DBMS can perform operations on the data. These operations are selecting the data, inserting
something, updating the data, and so on.

• Once everything is completed, DBMS returns the result after the evaluation step.

This result is shown in a suitable format.


Concurrency Management
Concurrency Management in DBMS is a procedure of managing simultaneous transactions ensuring
their atomicity, isolation, consistency and serializability.
Several problems that arise when numerous transactions execute simultaneously in a random manner
are referred to as concurrency control problems.
• The dirty read problem occurs when a transaction reads the data that has been updated by another
transaction that is still uncommitted.
• The unrepeatable read problem occurs when two or more different values of the same data are
read during the read operations in the same transaction.
• The phantom read problem occurs when the read data got deleted by another transaction and on
applying read operation to it shows errors.
• The Lost Update problem arises when an update in the data is done over another update but by two
different transactions.
To maintain consistency and serializability during the execution of concurrent transactions some rules
are made. These rules are known as concurrency control protocols.
The Dirty Read Problem
Consider two transactions A and B performing read/write operations on a data DT in the database
DB. The current value of DT is 1000: The following table shows the read/write operations in A and B
transactions.

Time A B
T1 READ(DT) ------
T2 DT=DT+500 ------
T3 WRITE(DT) ------
T4 ------ READ(DT)
T5 ------ COMMIT
T6 ROLLBACK ------

Transaction A reads the value of data DT as 1000 and modifies it to 1500 which gets stored in the temporary
buffer. The transaction B reads the data DT as 1500 and commits it and the value of DT permanently gets
changed to 1500 in the database DB. Then some server errors occur in transaction A and it wants to get rollback
to its initial value, i.e., 1000 and then the dirty read problem occurs.
The Unrepeatable read problem
Consider two transactions A and B performing read/write operations on a data DT in the database
DB. The current value of DT is 1000: The following table shows the read/write operations in A and B
transactions.

Time A B
T1 READ(DT) ------
T2 ------ READ(DT)
T3 DT=DT+500 ------
T4 WRITE(DT) ------
T5 ------ READ(DT)

Transaction A and B initially read the value of DT as 1000. Transaction A modifies the value of DT from 1000 to
1500 and then again transaction B reads the value and finds it to be 1500. Transaction B finds two different
values of DT in its two different read operations.
Phantom Read Problem
Consider two transactions A and B performing read/write operations on a data DT in the database
DB. The current value of DT is 1000: The following table shows the read/write operations in A and B
transactions.
Time A B
T1 READ(DT) ------
T2 ------ READ(DT)
T3 DELETE(DT) ------
T4 ------ READ(DT)

Transaction B initially reads the value of DT as 1000. Transaction A deletes the data DT from the
database DB and then again transaction B reads the value and finds an error saying the data DT does
not exist in the database DB.
Lost Update Problem
Consider two transactions A and B performing read/write operations on a data DT in the database
DB. The current value of DT is 1000: The following table shows the read/write operations in A and B
transactions.
Time A B
T1 READ(DT) ------
T2 DT=DT+500 ------
T3 WRITE(DT) ------
T4 ------ DT=DT+300
T5 ------ WRITE(DT)
T6 READ(DT) ------
Transaction A initially reads the value of DT as 1000. Transaction A modifies the value of DT from
1000 to 1500 and then again transaction B modifies the value to 1800. Transaction A again reads DT
and finds 1800 in DT and therefore the update done by transaction A has been lost.
Lock-Based Protocols
• To attain consistency, isolation between the transactions is the most important tool.
Isolation is achieved if we disable the transaction to perform a read/write operation. This
is known as locking an operation in a transaction. Through lock-based protocols, desired
operations are freely allowed to perform locking the undesired operations.

• There are two kinds of locks used in Lock-based protocols:

• Shared Lock(S): The locks which disable the write operations but allow read operations
for any data in a transaction are known as shared locks. They are also known as read-only
locks and are represented by 'S'.

• Exclusive Lock(X): The locks which allow both the read and write operations for any data
in a transaction are known as exclusive locks. This is a one-time use mode that can't be
utilized on the exact data item twice. They are represented by 'X'.
Time-based Protocols
• According to this protocol, every transaction has a timestamp attached to it. The
timestamp is based on the time in which the transaction is entered into the system.
There is read and write timestamps associated with every transaction which consists of
the time at which the latest read and write operations are performed respectively.

• The timestamp ordering protocol uses timestamp values of the transactions to resolve
the conflicting pairs of operations. Thus, ensuring serializability among transactions.
Let's assume there are two transactions T1 and T2. Suppose the transaction T1 has entered
the system at 7:00 time and transaction T2 has entered the system at 7:09 time. T1 has the
higher priority, so it executes first as it is entered the system first.
The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write'
operation on a data.
Basic Timestamp ordering protocol works as follows:
1. Check the following condition whenever a transaction Ti issues a Read (X) operation:
• If W_TS(X) >TS(Ti) then the operation is rejected.
• If W_TS(X) <= TS(Ti) then the operation is executed.
• Timestamps of all the data items are updated.
2. Check the following condition whenever a transaction Ti issues a Write(X) operation:
• If W_TS(X) > TS(Ti) then the operation is rejected and Ti is rolled back otherwise the operation is
executed.
• If R_TS(X) > TS(Ti) then the operation is rejected.
Where,
TS(TI) denotes the timestamp of the transaction Ti.
R_TS(X) denotes the Read time-stamp of data-item X.
W_TS(X) denotes the Write time-stamp of data-item X.

You might also like