Data Warehousing
Data Warehousing
Characteristics:
• The database resides in a single central location.
• Users access the database via terminals connected to the central system.
• Centralized control over data ensures consistency and integrity.
• Simplified administration and maintenance.
Advantages:
• Data Consistency: Since the database is centralized, it reduces the risk of data
redundancy and inconsistency.
• Simplified Management: All operations like backup, recovery, and updates are easier
to manage.
• Security: Central control makes it easier to enforce strict access controls and security
measures.
Disadvantages:
• Single Point of Failure: If the central server fails, the entire system becomes
unavailable.
• Scalability Issues: Handling a large number of users can overwhelm the server.
• Network Dependence: Performance depends heavily on the quality of the network
connection.
Use Cases:
• Small organizations with limited users.
• Systems where high-level security and consistency are critical (e.g., banking systems).
2. Client-Server Architecture
In a client-server DBMS architecture, the system is divided into two parts: the client and the
server. The server hosts the database, and clients access it over a network.
Characteristics:
• The server is responsible for data storage, query processing, and transaction
management.
• The client provides the user interface and sends requests to the server for data
processing.
• Communication between client and server happens over a network.
Advantages:
• Scalability: New clients can be added without significant changes to the server.
• Distributed Workload: Processing is shared between client and server, improving
performance.
• Flexibility: Clients can run on different platforms, and multiple servers can be used.
Disadvantages:
• Complexity: Setting up and managing client-server systems is more complex than
centralized systems.
• Network Dependency: The system's performance depends on network bandwidth
and reliability.
• Data Synchronization: Maintaining consistency across distributed systems can be
challenging.
Use Cases:
• Large organizations with multiple users accessing the database concurrently.
• Systems requiring distributed computing capabilities (e.g., web applications,
enterprise systems).
Q.2.Explain the term Design of Parallel Systems.
Parallel systems are computing systems designed to perform multiple computations
simultaneously, enhancing performance, scalability, and reliability. In the context of
databases, parallel database systems aim to divide large tasks (like querying or data
processing) into smaller sub-tasks executed concurrently across multiple processors or nodes.
2. Concurrency Control
Concurrency control ensures that multiple transactions can execute simultaneously without
compromising the consistency and integrity of the database.
Goals of Concurrency Control:
• Data Consistency:
o Ensure that the database remains consistent after simultaneous transactions.
• Isolation:
o Prevent conflicts by isolating transactions.
• Avoidance of Anomalies:
o Prevent problems like dirty reads, lost updates, and uncommitted data
dependencies.
Concurrency Issues:
1. Dirty Reads:
o A transaction reads uncommitted changes made by another transaction.
2. Lost Updates:
o Two transactions overwrite each other's updates.
3. Non-Repeatable Reads:
o Data read in one transaction changes when read again in the same transaction.
Key Challenges
1. Data Localization:
o Determining where the required data resides in the distributed system.
2. Communication Costs:
o Reducing the cost of transferring data between nodes.
3. Join Processing:
o Efficiently joining tables stored across different nodes.
4. Fault Tolerance:
o Handling failures during query execution in distributed environments.
5. Consistency:
o Ensuring consistent query results across all nodes.
2. Data Localization:
o Identify the locations of the data required to execute the sub-queries.
o Use metadata or catalogs maintained at a central node or distributed among
nodes.
3. Query Optimization:
o Generate an optimal query execution plan considering:
▪ Data Distribution: Data replication or partitioning.
▪ Cost Estimation: Minimize communication costs and disk I/O.
o Techniques include:
▪ Heuristic Optimization: Rules to minimize communication (e.g., move
smaller tables to the location of larger ones for joins).
▪ Cost-Based Optimization: Evaluating alternative execution plans using
cost metrics.
4. Query Execution:
o Sub-queries are executed on relevant nodes, and intermediate results are
aggregated.
o Operations like filtering, sorting, or joining are carried out at appropriate nodes
to minimize data transfer.
5. Result Assembly:
o Combine intermediate results into the final result set and return it to the user.
Use Cases
1. Global Enterprises:
o Companies with geographically distributed data centers.
2. Cloud Databases:
o Systems like Google BigQuery or Amazon Redshift.
3. Data Warehousing:
o Distributed query processing is crucial for handling large-scale analytical
workloads.
Q.5.Explain the term Server System Architectures.
Server system architectures refer to the design and organization of hardware and software
components in a server environment to provide efficient, scalable, and reliable services to
clients. These architectures are critical in database management systems (DBMS) to handle
multiple requests, manage resources, and optimize performance in various environments.
2. Client-Server Architecture
• Description:
o A server manages data and resources while clients handle user interfaces and
requests.
o Communication between client and server occurs via a network.
• Advantages:
o Scalability: More clients can be added without affecting server performance
significantly.
o Centralized control of data and applications.
• Disadvantages:
o Dependency on network reliability.
o Increased complexity compared to centralized systems.
• Use Cases:
o Web applications, enterprise systems, email servers.
1. Location Management
Location management involves tracking and updating the location of a mobile user or device
to ensure efficient delivery of data and services. It is particularly important in cellular
networks, where users frequently move between different cells or areas.
Applications
1. Telecommunications:
o Cellular networks rely heavily on location and handoff management for voice
and data services.
2. Transportation Systems:
o Tracking and communication with moving vehicles (e.g., in logistics and public
transport).
3. IoT Networks:
o Mobile IoT devices require efficient location and handoff management for
continuous operation.
4. Real-Time Applications:
o Video streaming, online gaming, and conferencing depend on seamless
handoff for uninterrupted service.
Q.3. Describe in detail Multidimensional Database.
A Multidimensional Database (MDB) is a type of database optimized for analytical and
business intelligence applications. Unlike traditional relational databases, which store data in
two-dimensional tables, MDBs organize data into a multidimensional structure. This design
allows for the efficient analysis of large datasets from multiple perspectives, making MDBs
particularly useful for Online Analytical Processing (OLAP).
2.ETL Process:
The ETL process is essential for preparing data for the data warehouse. It involves the
following steps:
1. Extract:
o Data is extracted from various heterogeneous data sources (e.g., relational
databases, flat files, external systems).
o This step ensures that the data is gathered without affecting the source
systems.
2. Transform:
o The extracted data is cleaned, formatted, and converted into a suitable format
for analysis.
o Tasks include removing duplicates, handling missing values, data validation,
and applying business rules.
o Data might be aggregated or summarized during this step.
3. Load:
o The transformed data is loaded into the data warehouse.
o Data can be loaded in a batch process or incrementally, depending on the
warehouse’s needs.
o The data is stored in optimized structures like star or snowflake schemas for
fast retrieval.
Q.2. Define data warehouse. Explain any 3 architectural types of DW.
A Data Warehouse (DW) is a centralized repository that stores integrated data from multiple
heterogeneous sources. It is designed for analytical querying and reporting, allowing
businesses to consolidate and analyze data for decision-making. Data in a warehouse is
typically structured for fast query performance and is optimized for reporting, trends analysis,
and business intelligence tasks. Unlike operational databases (OLTP), a data warehouse
supports OLAP (Online Analytical Processing) for querying large volumes of data.
2. Snowflake Schema:
o Structure: The snowflake schema is a variation of the star schema where
dimension tables are normalized. In this schema, each dimension is broken
down into multiple related tables (e.g., a time dimension could be split into
year, quarter, and month tables).
o Characteristics:
▪ Normalized dimension tables, meaning they are broken down into
smaller tables to reduce redundancy.
▪ More complex than the star schema but reduces storage space because
of normalization.
▪ The fact table still remains at the center.
o Advantages:
▪ Less data redundancy due to normalization.
▪ Better for scenarios where dimensions have complex relationships.
o Disadvantages:
▪ More complex queries due to the need to join multiple tables.
▪ Slightly slower query performance compared to the star schema due to
the normalization.
o Example:
▪ Fact Table: Sales.
▪ Dimension Tables: Time (broken down into Year, Quarter, Month),
Product (broken down into Product Category, Product Subcategory),
Customer (split into City, State, Country).
2.Additivity of Facts
The additivity of facts refers to the ability of facts (measures) to be summed across one or
more dimensions in the fact table. It helps in understanding how different types of facts
behave when aggregated.
1. Fully Additive Facts:
o These facts can be summed across all dimensions.
o Example: Sales Amount or Quantity Sold can be summed across time, product,
and region dimensions.
o Usage: Useful for metrics that are inherently cumulative.
2. Semi-Additive Facts:
o These facts can be summed across some dimensions but not all.
o Example: Account Balances can be summed across regions but not across time
(since summing balances across time does not make sense).
o Usage: Used for metrics like inventory levels, account balances, or daily
averages.
3. Non-Additive Facts:
o These facts cannot be summed across any dimension.
o Example: Ratios or Percentages, such as profit margin or average discount.
o Usage: These metrics often require other aggregation techniques, like
averaging or calculating weighted sums.
UNIT – 4
Q.1. Explain OLAP architecture with a neat diagram.
OLAP (Online Analytical Processing) is a system designed to help users perform
multidimensional analysis on large volumes of data. OLAP systems allow users to interact with
data in a more intuitive, multidimensional format, enabling complex queries and data
exploration. The architecture of an OLAP system typically involves several layers that manage
the storage, processing, and presentation of data.
OLAP Architecture Components:
1. Data Source Layer:
o The data source layer includes operational databases, external data sources,
and the data warehouse where data is stored.
o The data is extracted, transformed, and loaded (ETL) into the OLAP system for
analysis.
2. OLAP Engine:
o This is the core processing component of the OLAP system.
o It supports different OLAP models (ROLAP, MOLAP, or HOLAP) and is
responsible for processing the data to provide multidimensional views.
o The OLAP engine performs operations like slice, dice, drill-down, roll-up, and
pivot on multidimensional data cubes or relational tables.
3. OLAP Cube or Data Warehouse:
o The OLAP cube (in MOLAP) or a relational table (in ROLAP) stores
multidimensional data, where dimensions define different aspects of data, and
facts represent quantitative metrics.
o Dimensions could include Time, Geography, Products, etc.
o Facts could include metrics like Sales, Profit, Quantity Sold, etc.
4. Client/Presentation Layer:
o This is the layer where users interact with the OLAP system. It provides tools
for querying, reporting, and visualizing the results.
o Users can use OLAP tools, reporting dashboards, or business intelligence
software to perform analysis and view the results in various formats (charts,
graphs, tables).
1. Roll-Up (Aggregation)
Roll-up refers to the process of summarizing data by climbing up the hierarchy in a dimension.
It aggregates data from a lower level to a higher level. This operation reduces the level of
detail.
• Example:
Consider a sales data cube with the following dimensions:
o Time: Day → Month → Year
o Product: Product A, Product B
o Location: City → State → Country
If we roll-up the data along the Time dimension, we will aggregate daily sales to monthly sales,
then monthly to yearly sales.
o Before Roll-up:
Sales for January 1st, 2024, for Product A in New York.
o After Roll-up:
Total sales for January for Product A in New York.
2. Drill-Down (Deaggregation)
Drill-down is the opposite of roll-up. It allows the user to navigate from summary data to
more detailed data, essentially breaking down higher-level data into finer granularity.
• Example:
Using the same sales data cube, if you start with yearly data, drilling down would allow
you to view monthly or daily sales. For example:
o Before Drill-down:
Total sales for 2024.
o After Drill-down:
Sales data for January 2024, or even more specifically, for January 1st, 2024.
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the
OLAP cube. It can be done by:
• Climbing up in the concept hierarchy
• Reducing the dimensions
• In the cube given in the overview section, the roll-up operation is performed
by climbing up in the concept hierarchy of Location dimension (City ->
Country).
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
In the cube given in the overview section, a sub-cube is selected by selecting following
dimensions with criteria:
• Location = “Delhi” or “Kolkata”
• Time = “Q1” or “Q2”
• Item = “Car” or “Bus”
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-
cube creation. In the cube given in the overview section, Slice is performed on the
dimension Time = “Q1”.
5. Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation,
performing pivot operation gives a new view of it.
Q.4.Explain Data Sparsity and Data Explosion in detail.
1.Data Sparsity
Data sparsity refers to the situation where a large proportion of the data in a dataset is
missing, null, or irrelevant. In the context of OLAP or multidimensional databases, sparsity
occurs when only a small portion of the possible combinations of dimensions has actual
data, while the majority have no data (empty or zero). This leads to storage inefficiencies
and increased complexity when processing or querying the database.
• Example:
A sales database with the dimensions Time, Product, and Location may have many
combinations of product, time, and location that have no sales data, making the cube
sparse. For instance, sales may exist only for specific products in certain months and
locations, while other combinations may have no data at all.
Challenges:
• Wasted storage space for empty cells.
• Increased complexity in query processing due to the need to handle missing data.
2.Data Explosion
Data explosion refers to a rapid and large increase in the volume of data, especially in
multidimensional databases. It occurs when the number of dimensions and possible
combinations grows significantly, causing the data size to expand exponentially. This can lead
to challenges in storage, performance, and management.
• Example:
A sales database with dimensions like Time, Product, Location, and Customer can
generate an extremely large cube as the number of dimensions increases. Adding a
new dimension like Customer Type can cause the size of the data cube to explode,
making it difficult to store and analyze efficiently.
Challenges:
• Increased storage requirements.
• Performance degradation due to the exponential growth of data.
• Difficulty in managing and querying the large dataset effectively.
Q.5. Differentiate between Roll-up and Drill-down.
Example:
• Roll-Up:
o Moving from daily data to monthly or yearly data.
o Before Roll-Up: Sales for January 1st, 2024.
o After Roll-Up: Total sales for January 2024 or 2024.
• Drill-Down:
o Breaking down yearly data into months or daily data.
o Before Drill-Down: Sales for 2024.
o After Drill-Down: Sales for January 2024, then January 1st, 2024.