ADBMS IMP Questions
ADBMS IMP Questions
June 2024 :
Module :1
1.Different between parallel and distributed database(2024-june)
2.define a parallel database. Explain different types of parallel database architectures in details. (2024-june)
3. Define Abstract data type. Discuss the Operation on Structured Data. (2024-june)
7/2023
16 JAN 2024
7.ADT
9.OODBMS vs ORDBMS
1/03/2023
Module :2
1.Differentiate DM vs OLAP
2. What is dimensional modelling? Discuss dimensional modelling techniques in data warehouse. (2024-june)
7/2023
16 JAN 2024
6.OLAP operations.
1/03/2023
9.OLAP vs OLTP
10. star flake, snowflake and fact constellation Schema
Module :3
1. Why data preprocessing is important in data mining? (2024-june)
2. explain data Reduction techniques in detail. (2024-june)
7/2023
3.Data preprocessing ?
16 JAN 2024
7.Define Data mining. Explain Data preprocessing techniques used in data mining process.
1/03/2023
Module :4
1. What is association rule mining, and what is its primary goal? What is the Apriori-Algorithm. How does the Apriori
algorithm handle the generation of frequent itemsets? (2024-june)
7/2023
16 JAN 2024
1/03/2023
3. Explain the Decision tree used in classification. Compare ID3, C4.5, CART classification algorithms.
4. Explain Association classification
Module :5
1. Define Classification. Explain KNN algorithm with suitable example. (2024-june)
7/2023
16 JAN 2024
4.Decision tree
5.Define and explain Bayesian and Naïve Bayesian classification.
1/03/2023
6. Regression Analysis.
7.Bayes theorem
Module :6
1. What is agglomerative clustering? Explain with an example. (2024-june)
7/2023
16 JAN 2024
1/03/2023
5.Hierarchical clustering
Module :7
1. Write a note on web usage mining. (2024-june)
2. Explain the text mining and discuss in brief the information retrieval methods(2024-june)
7/2023
16 JAN 2024
4. Web mining
1/03/2023
Features
There are parallel working of CPUs
It improves performance
It divides large tasks into various other tasks
Completes work very quickly
Features
1. It is a group of logically related shared data
2. The data gets split into various fragments
3. There may be a replication of fragments
4. The sites are linked by a communication network
The main difference between the parallel and distributed databases is that the former is tightly coupled and
then later loosely coupled.
Advantages:
Simple to implement.
Effective communication among the processors
Less communication overhead
Disadvantages:
Limited degree of parallelism
Addition of processor would slow down the existing processors.
Cache-coherency need to be maintained
Bandwidth issue
Shared disk architecture:
As shown in figure 5, in shared disk architecture each processor has its own private memory sharing the
single mass storage in common.
Advantages:
Fault tolerance is achieved
Interconnection to the memory is not a bottleneck
Supports large number of processors
Disadvantages:
Limited scalability
Inter-processor communication is slow
Applications:
Digital Equipment Corporation(DEC).
3. Define Abstract data type. Discuss the Operation on Structured Data. (2024-june)
Answer:
CLOB
The CLOB data type stores character large objects. CLOB can store up to 4 gigabytes of character
data.
NCLOB
The NCLOB data type stores character large objects in multibyte national character set. NCLOB can
store up to 4 gigabytes of character data.
BFILE
The BFILE data type enables access to binary file LOBs that are stored in file systems outside the
Oracle database. A BFILE column stores a locator, which serves as a pointer to a binary file on the
server's file system. The maximum file size supported is 4 gigabytes
II. Variable-Sized Array (VARRAY)
Items of type VARRAY are called varrays.
They allow you to associate a single identifier with an entire collection.
This association lets you manipulate the collection as a whole and reference individual elements
easily.
To reference an element, you use standard subscripting syntax
A varray has a maximum size, which you must specify in its type definition.
Its index has a fixed lower bound of 1 and an extensible upper bound.
Thus, a varray can contain a varying number of elements, from zero (when empty) to themaximum
specified in its type definition.
The basic Oracle syntax for the CREATE TYPE statement for a VARRAY type definition would be:
CREATE OR REPLACE TYPE name-of-type IS VARRAY(nn) of type;
Homogeneous DDB: Executes on the same operating system using the same application process
carrying same hardware devices.
Heterogeneous DDB: Executes on different operating systems with different application procedures
carrying different hardware devices.
Advantages of Distributed Database:
Modular development.
Server failure will not affect the entire data set.
Common architectural models are:
1. Client - Server Architecture for DDBMS
2. Peer - to - Peer Architecture for DDBMS
3. Multi - DBMS Architecture
Client - Server Architecture for DDBMS:
Is a two-level architecture in which the functionality is divided into servers and clients. Server functions
include primarily encompass data management, query processing, optimization and transaction
management whereas the client functions include particularly user interface with common functionalities like
consistency checking and transaction management.
Client - server architectures are classified as:
Single Server Multiple Client:
Example:
Create type Address
CREATE OR REPLACE TYPE address AS OBJECT
(
street char(20),
city char(20),
state char(2),
zip char(5)
);
Create a table called test_adt with the following columns, and describe the new test_adt table
distributed databases are classified into homogeneous and heterogeneous, each with further sub-
divisions. Examples: Apache Ignite, Apache Cassandra, Apache HBase, Couchbase Server, Amazon
SimpleDB, Clusterpoint, and FoundationDB
A design technique for organizing data in a data A technology for analyzing data in
Definition
warehouse using fact and dimension tables. multidimensional structures.
Data structure and schema design (e.g., star Data aggregation, slicing, dicing, and
Focus
schema, snowflake schema). visualization.
Fact tables, dimension tables, schemas (star, Cubes, measures, dimensions, and
Components
snowflake). hierarchies.
ROLAP :
ROLAP works with data that exist in a relational database. Facts and dimension tables are stored as
relational tables.
It also allows multidimensional analysis of data and is the fastest growing OLAP.
Advantages of ROLAP model:
High data efficiency. It offers high data efficiency because query performance and access language
are optimized particularly for the multidimensional data analysis.
Scalability. This type of OLAP system offers scalability for managing large volumes of data, and even
when the data is steadily increasing.
Drawbacks of ROLAP model:
Demand for higher resources: ROLAP needs high utilization of manpower, software, and hardware
resources.
Aggregately data limitations. ROLAP tools use SQL for all calculation of aggregate data. However,
there are no set limits to the for handling computations.
Slow query performance. Query performance in this model is slow when compared with MOLAP
MOLAP :
MOLAP uses array-based multidimensional storage engines to display multidimensional views of
data. Basically, they use an OLAP cube.
Multidimensional OLAP (MOLAP) is a classical OLAP that facilitates data analysis by using a
multidimensional data cube.
Data is pre-computed, pre-summarized, and stored in a MOLAP
Using a MOLAP, a user can use multidimensional view data with different facts.
MOLAP has all possible combinations of data already stored in a multidimensional array. MOLAP can
access this data directly.
Hence, MOLAP is faster compared to Relational Online Analytical Processing (ROLAP).
MOLAP Advantages
MOLAP can manage, analyze and store considerable amounts of multidimensional data.
Fast Query Performance due to optimized storage, indexing, and caching.
Smaller sizes of data as compared to the relational database.
Automated computation of higher level of aggregates data.
Help users to analyze larger, less-defined data.
MOLAP is easier to the user that's why It is a suitable model for inexperienced users.
MOLAP cubes are built for fast data retrieval and are optimal for slicing and dicing operations.
All calculations are pre-generated when the cube is created.
MOLAP Disadvantages
One major weakness of MOLAP is that it is less scalable than ROLAP as it handles only a limited
amount of data.
The MOLAP also introduces data redundancy as it is resource intensive
MOLAP Solutions may be lengthy, particularly on large data volumes.
MOLAP products may face issues while updating and querying models when dimensions are more
than ten.
MOLAP is not capable of containing detailed data.
The storage utilization can be low if the data set is highly scattered.
It can handle the only limited amount of data therefore, it's impossible to include a large amount of
data in the cube itself.
Hybrid OLAP :
Hybrid OLAP is a mixture of both ROLAP and MOLAP.
It offers fast computation of MOLAP and higher scalability of ROLAP. HOLAP uses two databases.
Aggregated or computed data is stored in a multidimensional OLAP cube
Detailed information is stored in a relational database.
Other Types of OLAP
There are some other types of OLAP Systems that are used in analyzing databases. Some of them are
mentioned below.
Web OLAP (WOLAP): It is a Web browser-based technology. In traditional OLAP application is
accessible by the client/server but this OLAP application is accessible by the web browser. It is a
three-tier architecture that consists of a client, middleware, and database server. The most appealing
features of this style of OLAP were (past tense intended, since few products categorize themselves
this way) the considerably lower investment involved on the client side (“all that’s needed is a
browser”) and enhanced accessibility to connect to the data. A Web-based application requires no
deployment on the client machine. All that is needed is a Web browser and a network connection to
the intranet or Internet.
Desktop OLAP (DOLAP): DOLAP stands for desktop analytical processing. Users can download the
data from the source and work with the dataset, or on their desktop. Functionality is limited compared
to other OLAP applications. It has a cheaper cost.
Mobile OLAP (MOLAP): MOLAP is wireless functionality for mobile devices. User work and access
the data through mobile devices.
Spatial OLAP (SOLAP): Merge capabilities of both Geographic Information Systems (GIS) and OLAP
into the single user interface, SOLAP egress. SOLAP is created because the data come in the form of
alphanumeric, image, and vector. This provides the easy and quick exploration of data that resides in
a spatial database.
Three-Tier Architecture
Definition: The most common architecture, which separates the data warehouse into three distinct
layers for enhanced scalability, flexibility, and performance.
Components:
1. Bottom Tier:
Data sources: Operational databases, external data, and flat files.
ETL processes: Data extraction, transformation, and loading into the warehouse.
Data staging area: Temporary storage for ETL processes.
2. Middle Tier:
OLAP (Online Analytical Processing) server: Provides multidimensional data views.
Business logic layer: Executes complex transformations and calculations.
3. Top Tier:
Client tools: Dashboards, reporting tools, and analytical applications for end users.
Advantages:
o Scalable for large datasets.
Limitations:
o Higher implementation cost and complexity.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into highly detailed data. It can
be done by:
Moving down in the concept hierarchy
Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving down in the concept
hierarchy of Time dimension (Quarter -> Month).
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP cube. It
can be done by:
Climbing up in the concept hierarchy
Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by climbing up in the concept
hierarchy of Location dimension (City -> Country).
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the cube
given in the overview section, a sub-cube is selected by selecting following dimensions with criteria:
Location = “Delhi” or “Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube creation. In
the cube given in the overview section, Slice is performed on the dimension Time = “Q1”.
5. Pivot: It is also known as rotation operation as it rotates the current view to get a new view of the
representation. In the sub-cube obtained after the slice operation, performing pivot operation gives a
new view of it.
7.Define Data warehouse. Explain ETL process in details.
Data Warehouse :
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than
transaction processing.
It includes historical data derived from transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for
decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of
users.
It is not used for daily operations and transaction processing but used for making decisions.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate, understand, and
query.
4. Data warehousing is an efficient method to manage demand for lots of information from lots of users.
5. Data warehousing provide the capabilities to analyze a large amount of historical data.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process in
which an ETL tool extracts the data from various data source systems, transforms it in the staging area, and
then finally, loads it into the Data Warehouse system.
ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse builder, CloverETL, and
MarkLogic.
Data Warehouses: Most commonly used Data Warehouses are Snowflake, Redshift, BigQuery, and
Firebolt.
differentiate OLAP vs OLTP
In an OLAP database, tables are not In an OLTP database, tables are normalized
Normalized
normalized. (3NF).
The data is used in planning, problem- The data is used to perform day-to-day
Usage of data
solving, and decision-making. fundamental operations.
A large amount of data is stored The size of the data is relatively small as the
Volume of data
typically in TB, PB historical data is archived in MB, and GB.
Backup and It only needs backup from time to time The backup and recovery process is
Recovery as compared to OLTP. maintained rigorously
Operations Only read and rarely write operations. Both read and write operations.
Database
Design with a focus on the subject. Design that is focused on the application.
Design
Snowflake Schema:
Snowflake Schema in data warehouse is a logical arrangement of tables in a multidimensional database
such that the ER diagram resembles a snowflake shape. A Snowflake Schema is an extension of a Star
Schema, and it adds additional dimensions. The dimension tables are normalized which splits data into
additional tables. In the following Snowflake Schema example, Country is further normalized into an
individual table.
Characteristics of Snowflake Schema:
The main benefit of the snowflake schema it uses smaller disk space.
Easier to implement a dimension is added to the Schema
Due to multiple tables query performance is reduced
Galaxy Schema:
A Galaxy Schema contains two fact table that share dimension tables between them. It is also called Fact
Constellation Schema. The schema is viewed as a collection of stars hence the name Galaxy Schema.
As you can see in above example, there are two facts table
1. Expense
2. Revenue.
In Galaxy schema shared dimensions are called Conformed Dimensions.
Characteristics of Galaxy Schema:
The dimensions in this schema are separated into separate dimensions based on the various levels of
hierarchy.For example, if geography has four levels of hierarchy like region, country, state, and city then
Galaxy schema should have four dimensions. Moreover, it is possible to build this type of schema by
splitting the one-star schema into more Star schemes. The dimensions are large in this schema which is
needed to build based on the levels of hierarchy. This schema is helpful for aggregating fact tables for
better understanding.
Why data preprocessing is important in data mining
Data Preprocessing is a technique to turn raw and crude information gathered from diverse sources
into clean and consistent dataset. Data Preprocessing is one of the most vital steps in the data
mining process. Data Preprocessing involves Data Cleaning, Data Integration, Data Transformation,
Data Reduction etc.
Importance of Data Preprocessing in Data Mining
1. Improves Data Quality
Real-world data may have missing values, errors, or outliers that can negatively affect the accuracy of
mining algorithms.
Preprocessing ensures data is accurate, complete, and reliable, leading to better analytical results.
2. Handles Missing Data
Missing values can arise due to human errors, hardware failures, or data collection issues.
Techniques like mean substitution, regression imputation, or deletion are applied to address these
gaps, ensuring the dataset is usable.
3. Removes Noise and Outliers
Noisy data can result from sensor errors, human input errors, or system malfunctions.
Techniques like smoothing, binning, and clustering are used to reduce noise and improve the quality
of patterns extracted.
4. Ensures Consistency
Data inconsistencies, such as different formats, naming conventions, or measurement units, can
hinder analysis.
Preprocessing resolves such inconsistencies by standardizing the data.
5. Reduces Complexity
High-dimensional data can lead to challenges in processing and visualization.
Dimensionality reduction techniques like Principal Component Analysis (PCA) or feature selection
simplify data without significant loss of information.
6. Enhances Algorithm Efficiency
Clean and formatted data reduces computational complexity and improves the performance of data
mining algorithms.
Ensures that algorithms focus on extracting meaningful patterns rather than dealing with noisy or
irrelevant data.
7. Facilitates Better Understanding
Data preprocessing organizes and summarizes the dataset, making it easier for analysts to interpret
the data and derive insights.
8. Ensures Reproducibility
Standardizing the preprocessing steps ensures that the results are reproducible and consistent across
different datasets or scenarios.
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or smaller representations
of the data instead of actual data, it is important to only store the model parameter. Or non-parametric
methods such as clustering, histogram, and sampling.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with
intervals. We replace many constant values of the attributes by labels of small intervals. This means that
mining results are shown in a concise, and easily understandable way.
Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole
set of attributes and repeat this method up to the end, then the process is known as top-down
discretization also known as splitting.
Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded through a combination
of the neighborhood values in the interval, that process is called bottom-up discretization.