ADBMS Exam Question Answers
ADBMS Exam Question Answers
Ans:
Key Components:
Types of Parallelism:
1. Homogeneous: All databases in the system use the same DBMS software
and are similar in structure.
2. Heterogeneous: Databases use different DBMS software, and the data
structures can vary across sites.
Key Differences:
Benefits:
Ans:
Differences Between Architecture of Parallel and Distributed Databases
(Short Answer)
Summary:
Ans:
Phases:
1. Phase 1 (Voting Phase): The coordinator sends a prepare message
to all participants. Each participant responds with either Vote
Commit (if it can commit the transaction) or Vote Abort (if it
cannot).
2. Phase 2 (Commit/Abort Phase): Based on the votes from
participants:
If all participants vote Commit, the coordinator sends a
Commit message, and all participants commit the
transaction.
If any participant votes Abort, the coordinator sends an
Abort message, and all participants abort the transaction.
Drawbacks:
Phases:
1. Phase 1 (Can Commit Phase): The coordinator sends a Can
Commit message to all participants. Participants reply with Yes (if
they can commit) or No (if they cannot).
2. Phase 2 (Pre-Commit Phase): If all participants respond with Yes,
the coordinator sends a Pre-Commit message to all participants,
signaling they should prepare to commit.
3. Phase 3 (Commit Phase): After receiving the Pre-Commit
message, all participants respond with Commit to indicate they
have successfully committed the transaction.
Advantages:
Drawbacks:
o Slightly more complex than 2PC, as it introduces an additional
phase.
o Still vulnerable to some failure scenarios, especially with network
partitioning.
Key Differences:
Summary:
Ans:
Structured Data Types, Object-Oriented, or Complex Data Types in
ORDBMS (Short Answer)
Examples:
o Arrays: A collection of elements of the same data type (e.g., an
array of integers).
o Records: A composite type where multiple attributes (fields) of
potentially different data types are grouped together, similar to a
row in a relational database but with a custom structure.
Usage: Structured data types are useful when you need to store complex
data that is related, such as a set of coordinates (latitude, longitude) or a
list of values associated with a particular entity.
Features:
o Encapsulation: Data and methods (functions) that operate on the
data are encapsulated together.
o Inheritance: Data types can inherit properties and behaviors from
other types, enabling reuse of data structures.
o Polymorphism: The same operation can work on different data
types, allowing flexibility.
Example:
o User-defined types (UDTs): These are types that users define,
such as a Person object with attributes like name, age, and
address, and methods like getFullName() or
calculateAge().
Usage: OODTs allow for modeling complex data in a way that aligns
more closely with real-world entities and relationships, providing a
powerful abstraction for data storage.
Complex data types allow users to store intricate and hierarchical data structures
in ORDBMS. These types are designed to handle multi-valued, nested, or non-
tabular data that traditional relational databases cannot easily manage.
Examples:
o Multiset: A collection that can hold multiple values of the same
type, allowing for duplicates.
o Nested Tables: Tables within tables, enabling storage of
hierarchical data.
o XML or JSON: Support for storing XML or JSON formatted data
directly in the database.
Usage: Complex data types are often used to store data that has inherent
hierarchical structures (e.g., JSON for storing nested data, XML for
structured documents).
Summary:
Structured Data Types: Composite types like arrays and records used for
organizing data.
Object-Oriented Data Types: Types that model real-world objects,
supporting encapsulation, inheritance, and polymorphism.
Complex Data Types: Data types that allow storing hierarchical, multi-
valued, or non-tabular data, like nested tables or XML.
5. Timeout-Based Detection
1. Complexity in Design:
o ORDBMS integrates relational and object-oriented concepts,
which can make the system design more complex.
o Designing a data model that takes full advantage of both relational
and object-oriented features requires deep expertise and careful
planning.
2. Performance Overhead:
o The additional layers of abstraction introduced by object-oriented
features (such as inheritance, encapsulation) can lead to
performance degradation.
o Object-relational mapping (ORM) can be slow due to the need to
map complex object structures to relational tables, especially with
large datasets.
3. Integration with Existing Systems:
o ORDBMS may have compatibility issues when integrating with
legacy relational databases or third-party systems that don't
support object-oriented features.
o Data migration from relational databases to ORDBMS is often
complex and resource-intensive.
4. Learning Curve:
o Database administrators and developers may face a steep learning
curve when transitioning from traditional relational databases to an
ORDBMS.
o New tools, techniques, and paradigms must be learned, which may
slow down development.
5. Query Language Complexity:
o ORDBMS often uses extended query languages, such as SQL with
object-oriented extensions, which may require more advanced
knowledge.
o Query optimization and execution plans can be more complicated
compared to traditional relational databases, leading to potential
inefficiencies.
6. Lack of Standardization:
o There is no universal standard for ORDBMS, leading to vendor-
specific implementations and lack of portability across different
database systems.
o Different ORDBMS products may have varying implementations
of object-oriented features, making it difficult to switch between
them.
7. Increased Maintenance Efforts:
o The additional features provided by ORDBMS (e.g., object types,
inheritance, polymorphism) can increase maintenance efforts for
both data structures and application code.
o Changes in object structures may require frequent updates to the
database schema and application logic.
Summary:
Ans:
Summary:
1. Data Sources Layer (bottom tier) for data extraction and integration.
2. Data Storage Layer (middle tier) for storing cleaned and transformed
data.
3. Data Presentation Layer (top tier) for delivering insights and reports to
end users.
This architecture ensures a scalable and efficient way to manage, process, and
analyze large datasets.
Ans:
The ETL process is a crucial part of Data Warehousing used to move data
from various sources into a data warehouse or database for analysis. It consists
of three main steps:
1. Extract:
o Description: Data is extracted from different source systems (e.g.,
relational databases, flat files, APIs, or external sources).
o Goal: To collect raw data from heterogeneous sources and bring it
to a staging area for further processing.
2. Transform:
o Description: In this step, the extracted data is transformed into the
desired format or structure to be stored in the target database.
o Tasks:
Cleansing: Remove or correct invalid data.
Filtering: Eliminate unnecessary data.
Aggregating: Summarize data (e.g., calculating totals or
averages).
Data Enrichment: Enhance data with additional
information.
Data Mapping: Convert data to match the format of the
target system.
o Goal: To ensure data quality, consistency, and suitability for
analysis.
3. Load:
o Description: The transformed data is loaded into the data
warehouse or target storage system.
o Types of Load:
Full Load: All data is loaded into the data warehouse.
Incremental Load: Only new or changed data is loaded.
o Goal: To store the data in a manner that is optimized for querying
and analysis.
Summary:
ETL is vital for data integration and ensuring that data is ready for decision-
making in a business environment.
Q9 What is datamart. Characteristics and benefits of datamart.
Characteristics of a Datamart:
Benefits of a Datamart:
1. Faster Query Performance: Smaller datasets lead to faster query
response times.
2. Cost-Effective: Less expensive to implement and maintain than a full
data warehouse.
3. User-Friendly: Tailored to the needs of specific users or departments.
4. Improved Decision-Making: Provides relevant and focused data for
better insights.
5. Simplified Data Management: Easier to manage and update than large
data warehouses.
Ans:
1. Star Schema:
o Structure: A central fact table surrounded by dimension tables.
o Fact Table: Contains quantitative data (e.g., sales, revenue) and
foreign keys to dimension tables.
o Dimension Tables: Contain descriptive, categorical data (e.g.,
time, location, product).
o Example: A sales data model where the fact table contains sales
transactions and dimension tables contain information about
products, time, and customers.
2. Snowflake Schema:
o Structure: Similar to the star schema but with normalized
dimension tables (further breaking down dimensions into sub-
dimensions).
o Fact Table: Same as in star schema.
o Dimension Tables: More normalized, reducing redundancy by
breaking down categories.
o Example: A sales data model where the product dimension is
broken into product category, subcategory, and product name.
3. Fact Constellation Schema:
o Structure: Multiple fact tables share common dimension tables.
o Fact Tables: Represent different business processes (e.g., sales and
inventory).
o Dimension Tables: Shared across multiple fact tables.
o Example: A sales and inventory data model where both tables use
common dimensions like time, store, and product.
Key Differences:
Ans:
Datamart
OLAP refers to systems that allow users to analyze large volumes of data
interactively from multiple perspectives.
OLAP provides fast query performance and multidimensional views of
data (e.g., sales by time, region, product).
It supports complex calculations, trend analysis, and decision-making.
Summary:
Ans:
ANS:
Key Differences:
1. ROLLUP:
2. DRILL DOWN:
3. SLICE:
4. DICE:
Definition: Similar to slice but involves selecting specific values from
multiple dimensions, creating a subcube of data.
Example: Viewing sales data for specific regions and specific years,
creating a smaller data set by selecting particular values for multiple
dimensions (e.g., "North Region" and "2019").
Summary:
Ans:
1. Marketing:
o Customer Segmentation: Identifying distinct groups of customers
for targeted marketing.
o Market Basket Analysis: Analyzing purchasing patterns to
recommend related products.
2. Finance:
o Fraud Detection: Identifying unusual patterns in transactions to
detect fraudulent activities.
o Credit Scoring: Evaluating customer credit risk using historical
data.
3. Healthcare:
o Disease Prediction: Identifying patterns in patient data to predict
health conditions or diseases.
o Treatment Optimization: Finding the most effective treatments
based on patient profiles.
4. Retail:
o Inventory Management: Predicting demand for products to
optimize stock levels.
o Customer Churn Prediction: Identifying customers likely to
leave and taking preventive actions.
5. Manufacturing:
o Quality Control: Identifying defects and process improvements by
analyzing production data.
o Predictive Maintenance: Forecasting equipment failures based on
usage patterns.
Summary:
1. Data Cleaning:
2. Data Transformation:
3. Data Integration:
4. Data Reduction:
Dimensionality Reduction: Reducing the number of features using
techniques like PCA (Principal Component Analysis) to simplify models
without losing significant information.
Feature Selection: Selecting the most relevant features for analysis,
reducing the dataset size and improving performance.
5. Discretization:
6. Feature Engineering:
Summary:
Pre-processing techniques are crucial for improving data quality and preparing
it for analysis. These include cleaning, transforming, integrating, reducing,
discretizing, and engineering features to ensure the data is accurate, relevant,
and in a suitable format for modeling.
Q17 Issues during data integration.
Ans:
Issues during Data Integration arise when combining data from multiple
sources, leading to challenges in ensuring consistency, quality, and
compatibility.
1. Data Inconsistency:
2. Data Redundancy:
3. Data Conflicts:
4. Missing Data:
5. Semantic Mismatch:
Problem: Different data sources might use different names or structures
for the same concepts (e.g., "customer" in one source might be called
"client" in another).
Solution: Use mapping techniques or data dictionaries to align semantic
meaning.
Summary:
Summary:
Key Concepts:
Example:
In a grocery store, market basket analysis may reveal that customers who
buy bread are also likely to buy butter. This insight can help businesses
with product placement, cross-selling, or targeted promotions.
Q21 Explain in detail associative classification method?
How It Works:
Key Concepts:
Advantages:
Example:
Rule: "If Age = 30-40, Gender = Female, then Class = 'High Income'".
o Given an instance (Age = 35, Gender = Female), the rule would
classify it as High Income based on the association rules.
Summary:
Ans: Classification:
Classification is a supervised learning technique in data mining and machine
learning where the goal is to predict the categorical class label of an object
based on its features. The model is trained on labeled data, where each instance
has an associated class label. The trained model is then used to classify new,
unseen instances.
Steps in ID3:
1. Select the Best Feature: At each node, choose the feature that provides
the highest information gain (the feature that best separates the data).
o Information Gain measures the effectiveness of a feature in
classifying data. It is calculated using Entropy (a measure of
impurity) and the difference in entropy before and after splitting
on a feature.
2. Split the Dataset: Divide the dataset into subsets based on the selected
feature.
3. Recursion: Apply the same process to each subset, recursively building
the tree.
4. Stopping Criteria: Stop when:
o All data in a subset belong to the same class.
o No more features are available to split the data.
o A predefined depth is reached.
Example:
Given a small dataset for classifying whether someone will play tennis based on
weather conditions:
Steps:
Resulting Tree:
Outlook
/ | \
Sunny Overcast Rain
/ \ | |
High Low Yes Mild, Cool
| | |
No No Yes, Yes
Summary:
Ans: Classification:
Classification is a supervised learning technique in machine learning where the
goal is to assign a class label to an object based on its features. The model is
trained on labeled data and is used to predict the class of new, unseen instances.
CART is a decision tree algorithm used for both classification and regression
tasks. It builds a binary tree by recursively splitting the data into subsets based
on feature values, aiming to improve class purity (for classification) or
minimize variance (for regression).
3. Recursively Apply:
o This process is repeated recursively for each subset until the
stopping criteria are met, such as a minimum node size or a
predefined tree depth.
5. Pruning:
o After constructing the tree, it may be pruned to avoid overfitting.
Pruning removes branches that have little predictive power or
contribute to model complexity.
Example:
Given a dataset of customers with features like Age, Income, and Purchase
(whether the customer made a purchase or not), the CART algorithm might
build a tree that splits based on Income first (e.g., "Income > 50,000"), then
further splits on Age (e.g., "Age ≤ 30").
Summary:
CART builds binary decision trees using Gini Impurity for classification.
It recursively splits the data to create pure or homogeneous subsets,
assigning class labels at leaf nodes.
Pruning is used to prevent overfitting and improve the model's
generalization ability.
Advantages:
Example:
For a dataset with attributes like Age, Income, and Education, C4.5 might
create a decision tree where a node tests Income > 50,000, then further splits
based on Age < 30 or Education = Bachelor.
Summary:
Ans: Regression:
Equation:
For simple linear regression with one feature, the equation is:
Where:
1. Fit the Model: The model finds the best-fit line by minimizing the sum of
squared errors (differences between predicted and actual values).
2. Make Predictions: Once the model is trained, it can predict the target
value for new input features by applying the learned coefficients (β₀ and
β₁).
Example:
If you're predicting a person's salary (y) based on their years of experience (x),
linear regression would find the best-fit line to predict salary based on
experience. If the line is:
For two points P(x1,y1)P(x_1, y_1) and Q(x2,y2)Q(x_2, y_2), the Euclidean
distance is calculated as:
Numerical Example:
Consider a dataset with points labeled as Class 1 (C1) and Class 2 (C2):
Point X Y Class
P1 2 3 C1
P2 6 7 C2
P3 3 3 C1
P4 5 5 C2
Now, we want to predict the class of a new point P0P_0 with coordinates (4,4)
(4, 4), using k = 3 nearest neighbors.
Since the majority of the nearest neighbors (2 out of 3) belong to Class C1, the
predicted class for P0P_0 is C1.
Summary:
Steps:
Example:
Given points: A(1, 2), B(2, 3), C(6, 5), D(7, 8), the agglomerative process
would look like:
Divisive clustering starts with all data points in one cluster and recursively splits
the cluster into smaller clusters until each data point is in its own cluster.
Steps:
Example:
Given the same points: A(1, 2), B(2, 3), C(6, 5), D(7, 8):
Summary:
Euclidean Distance:
To measure the similarity between a data point and a centroid, the Euclidean
distance is used:
Where (x1,y1)(x_1, y_1) and (x2,y2)(x_2, y_2) are the coordinates of two
points.
Numerical Example:
Given Data:
Points:
P1: (1, 2)
P2: (2, 3)
P3: (6, 5)
P4: (7, 8)
Cluster 1: P1, P2
Cluster 2: P3, P4
Step 3: Update Centroids:
Step 5: Convergence:
Since the centroids no longer change, the algorithm has converged, and the final
clusters are:
Cluster 1: P1, P2
Cluster 2: P3, P4
Summary:
1. Boolean Model:
o Uses logical operators (AND, OR, NOT) to retrieve documents
based on keyword matches.
o Example: Searching for documents containing "data" AND
"mining".
2. Vector Space Model:
o Represents documents and queries as vectors in a multi-
dimensional space, where each dimension corresponds to a term.
o Similarity between a query and documents is measured using
cosine similarity.
o Example: A query is represented as a vector, and the cosine
similarity between the query and document vectors determines
relevance.
3. Probabilistic Model:
o Estimates the probability of a document being relevant to a given
query, often using models like BM25.
o It ranks documents based on the likelihood of relevance.
4. Latent Semantic Indexing (LSI):
o Reduces the dimensionality of the term-document matrix to capture
underlying patterns between terms and documents, improving
search accuracy for synonymy and polysemy issues.
5. Machine Learning-Based IR:
o Uses machine learning algorithms to rank documents based on user
interaction and feedback. It improves with more data and user
feedback.
Summary:
Text mining extracts valuable insights from unstructured text data.
Information Retrieval methods like Boolean, Vector Space,
Probabilistic, LSI, and machine learning-based techniques are used to
find relevant documents from a large collection based on a query.
Summary:
Web Content Mining: Extracts data from the content of web pages.
Web Structure Mining: Analyzes web page link structures.
Web Usage Mining: Studies user behavior and interaction with websites.