0% found this document useful (0 votes)
13 views41 pages

ADBMS IMP Questions

The document outlines a series of questions and topics related to databases, data mining, and data warehousing, categorized into modules. Key topics include differences between parallel and distributed databases, abstract data types, OLAP models, data preprocessing, and various classification algorithms. Additionally, it discusses advantages and disadvantages of different database architectures and techniques used in data mining and web mining.

Uploaded by

fannyskylark2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views41 pages

ADBMS IMP Questions

The document outlines a series of questions and topics related to databases, data mining, and data warehousing, categorized into modules. Key topics include differences between parallel and distributed databases, abstract data types, OLAP models, data preprocessing, and various classification algorithms. Additionally, it discusses advantages and disadvantages of different database architectures and techniques used in data mining and web mining.

Uploaded by

fannyskylark2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Questions :

June 2024 :

Module :1
1.Different between parallel and distributed database(2024-june)

2.define a parallel database. Explain different types of parallel database architectures in details. (2024-june)

3. Define Abstract data type. Discuss the Operation on Structured Data. (2024-june)

7/2023

4.different between ORDBMS and OODBMS

5.Explain Distributed database architecture ?

6. what is the abstract data type. Explain with suitable example

16 JAN 2024

7.ADT

8. what is distributed databases ? Explain types of Distributed Database architecture.

9.OODBMS vs ORDBMS

1/03/2023

10. Abstract Data types

11.Explain parallel database architecture

Module :2
1.Differentiate DM vs OLAP

1 . Define OLAP . Discuss OLAP models with a neat diagram(2024-june)

2. What is dimensional modelling? Discuss dimensional modelling techniques in data warehouse. (2024-june)

7/2023

3. Star and Snowflake schema.

4.OLAP vs Data Mining (2-3 module)

5.Explain data warehouse architecture in details.

16 JAN 2024

6.OLAP operations.

7.Define Data warehouse. Explain ETL process in details.

8. differentiate OLAP vs OLTP

1/03/2023

9.OLAP vs OLTP
10. star flake, snowflake and fact constellation Schema

Module :3
1. Why data preprocessing is important in data mining? (2024-june)
2. explain data Reduction techniques in detail. (2024-june)

7/2023

3.Data preprocessing ?

4. Explain the KDD process in detail.

5. Explain Data reduction technique.

16 JAN 2024

6.explain all phases involved in KDD process.

7.Define Data mining. Explain Data preprocessing techniques used in data mining process.

1/03/2023

8. Explain Data preprocessing in detail

9.Explain the KDD process in detail.

Module :4
1. What is association rule mining, and what is its primary goal? What is the Apriori-Algorithm. How does the Apriori
algorithm handle the generation of frequent itemsets? (2024-june)

7/2023

16 JAN 2024

2. Explain Associatve classification

1/03/2023

3. Explain the Decision tree used in classification. Compare ID3, C4.5, CART classification algorithms.
4. Explain Association classification

Module :5
1. Define Classification. Explain KNN algorithm with suitable example. (2024-june)

7/2023

2. Explain Bayesian classification


3. Explain the Decision tree used in classification. Explain ID3 algorithm with a suitable example.

16 JAN 2024

4.Decision tree
5.Define and explain Bayesian and Naïve Bayesian classification.

1/03/2023

6. Regression Analysis.

7.Bayes theorem

8.K-nearest neighbor classification

Module :6
1. What is agglomerative clustering? Explain with an example. (2024-june)

7/2023

2. differentiate Aggolomerative and Divisive clustering

3.Why clustering is an unsupervised technique? Explain K-means alogorithm

16 JAN 2024

4.Define Clustering. Describe Hierarchical clustering in detail.

1/03/2023

5.Hierarchical clustering

Module :7
1. Write a note on web usage mining. (2024-june)
2. Explain the text mining and discuss in brief the information retrieval methods(2024-june)

7/2023

3.text Retrieval Methods ?

4. explain web mining

16 JAN 2024

4. Web mining

5.Text Retrieval Methods

1/03/2023

6.Text Retrieval methods ?


7. web mining
ANSWERS

What is Parallel Database?


A parallel DBMS is a DBMS that runs across multiple processors and is designed to execute operations in
parallel, whenever possible. The parallel DBMS links several smaller machines to achieve the same throughput
as expected from a single large machine.

Features
 There are parallel working of CPUs
 It improves performance
 It divides large tasks into various other tasks
 Completes work very quickly

Advantages of Parallel Databases


 Increased Speed and Efficiency: That is why the application of parallel databases allows executing several
queries at the same time and consequently, decreases the total amount of time needed for processing them.
 Improved Resource Utilization: Most of them take advantage of the multi-processor architecture, thus using as
many CPU cores as possible.
 High Throughput: They perform a large number of computations in less time more than the sequential
methods hence suitable for large-scale uses.

Disadvantages of Parallel Databases


 Complexity in Maintenance: Indeed, employment of parallel databases might be slightly problematic as
synchronization and some other components of the processes are worthy of special attention.
 Higher Costs: While parallel databases may be defined with robust physical databases which require intensive
hardware the databases may be more costly in setting up and even in operation.

What is Distributed Database?


A Distributed database is defined as a logically related collection of data that is shared which is physically
distributed over a computer network on different sites. The Distributed DBMS is defined as, the software that
allows for the management of the distributed database and makes the distributed data available for the users.

Features
1. It is a group of logically related shared data
2. The data gets split into various fragments
3. There may be a replication of fragments
4. The sites are linked by a communication network
The main difference between the parallel and distributed databases is that the former is tightly coupled and
then later loosely coupled.

Advantages of Distributed Databases


 Fault Tolerance: Since the data is distributed to various centers, failure in one center does not result to the
total failure of the system.
 Scalability: Due to the ability of distributed databases, it can easily be scaled up by adding nodes to the
network which makes it ideal for businesses that are expanding.
 Local Autonomy: Every site can administer a separate data base, however they are all in the collective database
at the same time.

Disadvantages of Distributed Databases


 Complex Data Management: Sharing data with multiple facilities presents some problems like latency of data.
 Security Concerns: The more the places there is data stored the higher the risk of breaches hence the need to
enforce very strict security measures.
2.define a parallel database. Explain different types of parallel database architectures in details.
(2024-june) ?
Answer :
Parallel Databases:
 Organizations are in a need to handle and maintain substantial amount of data with higher transfer
rate and greater efficiency of a system.
 Parallel database system improves the performance through parallelization of varied operations like
loading, manipulating, storing, building and evaluating.
 Processing speed and efficiency is increased by using multiple disks and CPUs in parallel.
 Figure 4, 5 and 6 shows the different architecture proposed and successfully implemented in the area
of Parallel Database systems.
 In the figures, P represents Processors, M represents Memory, and D represents Disks/Disk setups.
 Parallel database systems are classified into two groups:
i. Multiprocessor architecture and
ii. Hierarchical System or Non-Uniform Memory Architecture
Multiprocessor architecture:
It has the following alternatives:
 Shared memory architecture
 Shared disk architecture
 Shared nothing architecture
Shared memory architecture:
In shared memory architecture multiple processors share the same single primary/main memory and
have its own hard disk for storage. As shown in the figure 4, several processors are connected
through an interconnection network with Main memory and disk setup. Interconnection network is
usually a high speed network making data sharing easy among the various components.

Advantages:
 Simple to implement.
 Effective communication among the processors
 Less communication overhead

Disadvantages:
 Limited degree of parallelism 
 Addition of processor would slow down the existing processors. 
 Cache-coherency need to be maintained 
 Bandwidth issue
Shared disk architecture:
As shown in figure 5, in shared disk architecture each processor has its own private memory sharing the
single mass storage in common.

Advantages:
 Fault tolerance is achieved 
 Interconnection to the memory is not a bottleneck 
 Supports large number of processors

Disadvantages: 
 Limited scalability 
 Inter-processor communication is slow

Applications:
 Digital Equipment Corporation(DEC).

Shared nothing architecture:


As shown in figure 6, in shared nothing architecture, each processor has its own main memory and mass
storage device setup The entire setup is a collection of individual computers connected via a high speed
communication network.
Advantages: 
 Flexible to add any number of processors 
 Data request can be forwarded via interconnection n/w
Disadvantages: 
 Data partitioning is required 
 Cost of communication is higher
Applications: 
 Teradata 
 OraclenCUBE 
 The Grace and Gamma research prototypes 
 Tandem and etc.

Hierarchical System or Non-Uniform Memory Architecture:


 Non-Uniform Memory Architecture (NUMA), has the non-uniform memory access. 
 Cluster is formed by a group of connected computers including shared nothing, shared disk and etc.
 NUMA takes longer time to communicate among each other as it uses local and remote memory.
Advantages: 
 Improved performance 
 High availability 
 Proper resource utilization 
 Highly Reliable
Disadvantages: 
 High cost 
 Numerous Resources 
 Complexity in managing the systems

3. Define Abstract data type. Discuss the Operation on Structured Data. (2024-june)
Answer:

IV. Abstract Data Types


 ADT (Abstract DataType) is a user defined data type (also referred to as UDT's).
 Abstract Datatypes are data types that consist of one or more subtypes.
 Rather than being constrained to the standard Oracle data types of NUMBER, DATA, and
VARCHAR2, abstract data types can more accurately describe your data.
______________________________________________________________________________________

Explain Abstract Data types?


Answer:
Various Abstract Data types as follows:
I. CLOB, BLOB
II. Varray
III. Nested Tables
IV. Abstract Data Type
V. Methods
VI. Inheritance
Description: -
 Relational database management systems (RDBMSs) are the standard tool for managing business
data.
 They provide reliable access to huge amounts of data for millions of businesses around the world
every day.
 Oracle is an object-relational database management system (ORDBMS), which means that users can
define additional kinds of data--specifying both the structure of the data and the ways of operating on
it--and use these types within the relational model.
 This approach adds value to the data stored in a database.
 User-defined datatypes make it easier for application developers to work with complex data such as
images, audio, and video.
 Object types store structured business data in its natural form and allow applications to retrieve it that
way.
 For that reason, they work efficiently with applications developed using object-oriented programming
techniques.

I. CLOB, BLOB, BFILE
 Large Objects (LOBs) are a set of data types that are designed to hold large amounts of data.
 A LOB can hold up to a maximum size ranging from 8 terabytes to 128 terabytes depending on how
your database is configured.
 Storing data in LOBs enables you to access and manipulate the data efficiently in your application.
 The built-in LOB data types BLOB, CLOB and NCLOB (stored internally), and BFILE (stored
externally), can store large and unstructured data such as text, images and spatial data up to 4
gigabytes in size.
 BLOB
 The BLOB data type stores binary large objects. BLOB can store up to 4 gigabytes of binary data.

 CLOB
 The CLOB data type stores character large objects. CLOB can store up to 4 gigabytes of character
data.

 NCLOB
 The NCLOB data type stores character large objects in multibyte national character set. NCLOB can
store up to 4 gigabytes of character data.

 BFILE
 The BFILE data type enables access to binary file LOBs that are stored in file systems outside the
Oracle database. A BFILE column stores a locator, which serves as a pointer to a binary file on the
server's file system. The maximum file size supported is 4 gigabytes
II. Variable-Sized Array (VARRAY)
 Items of type VARRAY are called varrays.
 They allow you to associate a single identifier with an entire collection.
 This association lets you manipulate the collection as a whole and reference individual elements
easily.
 To reference an element, you use standard subscripting syntax
 A varray has a maximum size, which you must specify in its type definition.
 Its index has a fixed lower bound of 1 and an extensible upper bound.
 Thus, a varray can contain a varying number of elements, from zero (when empty) to themaximum
specified in its type definition.
 The basic Oracle syntax for the CREATE TYPE statement for a VARRAY type definition would be:
CREATE OR REPLACE TYPE name-of-type IS VARRAY(nn) of type;

III. Nested Tables


 Within the database, nested tables can be considered one-column database tables.
 Oracle stores the rows of a nested table in no particular order.
 But, when you retrieve the nested table into a PL/SQL variable, the rows are given consecutive
subscripts starting at 1.
 That gives you array-like access to individual rows.
 PL/SQL nested tables are like one-dimensional arrays.
 You can model multi-dimensional arrays by creating nested tables whose elements are also nested
tables.
 Syntax CREATE Or Replace TYPE type_name AS TABLE OF type;
IV. Abstract Data Types
 ADT (Abstract DataType) is a user defined data type (also referred to as UDT's).
 Abstract Datatypes are data types that consist of one or more subtypes.
 Rather than being constrained to the standard Oracle data types of NUMBER, DATA, and
VARCHAR2, abstract data types can more accurately describe your data.

V. Methods/Member functions in Abstract Data Types


 A function or procedure subprogram associated with the ADT that is referenced as an attribute.
 Typically, you invoke MEMBER methods in a selfish style, such as object_expression.method().
 This class of method has an implicit first argument referenced as SELF in the method body, which
represents the object on which the method was invoked.
4.different between ORDBMS and OODBMS

Explain Distributed database architecture ?


Distributed Database:
In distributed database data is distributed among different database systems of an organization and are
connected via communication links helping the end-users to access the data easily. Examples: Oracle,
Apache Cassandra, HBase, Ignite and etc. Distributed database system can be further classified into:

 Homogeneous DDB: Executes on the same operating system using the same application process
carrying same hardware devices.
 Heterogeneous DDB: Executes on different operating systems with different application procedures
carrying different hardware devices.
Advantages of Distributed Database:
 Modular development.
 Server failure will not affect the entire data set.
Common architectural models are:
1. Client - Server Architecture for DDBMS
2. Peer - to - Peer Architecture for DDBMS
3. Multi - DBMS Architecture
Client - Server Architecture for DDBMS:
Is a two-level architecture in which the functionality is divided into servers and clients. Server functions
include primarily encompass data management, query processing, optimization and transaction
management whereas the client functions include particularly user interface with common functionalities like
consistency checking and transaction management.
Client - server architectures are classified as:
 Single Server Multiple Client:

 Multiple Server Multiple Client :

 Peer- to-Peer Architecture for DDBMS


In Peer-to-Peer architecture, each peer acts both as a client and a server to impart database services and
share their resource with other peers to coordinate their activities.
This architecture commonly has four levels of schemas −
 Global Conceptual Schema: Illustrates the global logical view of data.
 Local Conceptual Schema: Illustrates logical data organization at each site.
 Local Internal Schema − Illustrates physical data organization at each site.
 External Schema − Illustrates user view of data
Multi - DBMS Architectures:
Is an integrated database system formed by a collection of two or more autonomous database systems.
Multi-DBMS can be expressed through six levels of schemas –
 Multi-database View Level − Illustrates multiple user views comprising of subsets of the integrated
distributed database.
 Multi-database Conceptual Level − Illustrates integrated multidatabase that comprises of global logical
multi-database structure definitions.
 Multi-database Internal Level − Illustrates the data distribution across different sites and multi-
database to local data mapping.
 Local database View Level − Illustrates public view of local data.
 Local database Conceptual Level − Illustrates local data organization at each site.
 Local database Internal Level − Illustrates physical data organization at each site.

Two design alternatives for Multi - DBMS Architectures are:


 Model with multi-database conceptual level.
 Model without multi-database conceptual level.
5. what is the abstract data type. Explain with suitable example
Answer :
• ADT (Abstract DataType) is a user defined data type (also referred to as UDT's).
• Abstract Datatypes are data types that consist of one or more subtypes.
• Rather than being constrained to the standard Oracle data types of NUMBER, DATA, and VARCHAR2,
abstract data types can more accurately describe your data.

Example:
Create type Address
CREATE OR REPLACE TYPE address AS OBJECT
(
street char(20),
city char(20),
state char(2),
zip char(5)
);

Create a table called test_adt with the following columns, and describe the new test_adt table

CREATE TABLE test_adt


(
first_name char(20),
last_name char(20),
full_address address
);
Insert five (5) rows into your test_adt table.
INSERT INTO test_adt VALUES ('Joe','Palooka',address('41 Cherise Ave.', 'Minot','ND','66654'));

Show only the last_name, zip, and city columns


SELECT last_name,t.full_address.zip,t.full_address.city FROM test_adt t;
what is distributed databases ? Explain types of Distributed Database architecture.
Distributed Database:
In distributed database data is distributed among different database systems of an organization and are
connected via communication links helping the end-users to access the data easily. Examples: Oracle,
Apache Cassandra, HBase, Ignite and etc. Distributed database system can be further classified into:

distributed databases are classified into homogeneous and heterogeneous, each with further sub-
divisions. Examples: Apache Ignite, Apache Cassandra, Apache HBase, Couchbase Server, Amazon
SimpleDB, Clusterpoint, and FoundationDB

Homogeneous Distributed Databases:


As illustrated in figure , all the sites use identical DBMS & operating systems and have the following
properties:
 Similar software.
 Identical DBMS from the same vendor.
 Aware of all other neighboring sites cooperating with each other to process user requests.
 In case of a single database it is accessed through a single interface.
Types of Homogeneous Distributed Database
1. Autonomous
2. Non-autonomous
 Autonomous: Each database is independent that functions on its personal and are incorporated with
the aid of controlling software and use message passingto share data updates.
 Non-autonomous: Data is distributed across the homogeneous nodes and a central or master DBMS
co-ordinates data updates throughout the sites.

Heterogeneous Distributed Databases:


As illustrated in figure, different sites have different operating systems, DBMS products and data models
and have the following properties are:
 Different sites use varied schemas and software.
 The system is composed of varied DBMSs.
 Complex query processing due to dissimilar schemas.
 Complex transaction processing due to dissimilar software.
 A site is not aware of the other sites leading to limited co-operation in processing user requests.

Types of Heterogeneous Distributed Databases


1. Federated
2. Un-federated
 Federated: These systems are independent in nature and are integrated collectively as to feature
as a single database system.
 Un-federated: These systems employ a central coordinating module through which the databases
are accessed.
Advantages:
 Organizational Structure
 Shareability and Local Autonomy
 Improved Availability 
 Improved Reliability 
 Improved Performance 
 Economics 
 Modular Growth
Disadvantages:
 Complexity
 Cost
 Security
 Integrity Control More Difficult
 Lack of Standards
 Lack of Experience
 Database Design More Complex
MODULE 2
1.Differentiate DM vs OLAP

Difference Between Dimensional Modeling and OLAP


Aspect Dimensional Modeling OLAP (Online Analytical Processing)

A design technique for organizing data in a data A technology for analyzing data in
Definition
warehouse using fact and dimension tables. multidimensional structures.

Analyzes and queries data to generate


Purpose Optimizes database design for query performance.
insights.

Data structure and schema design (e.g., star Data aggregation, slicing, dicing, and
Focus
schema, snowflake schema). visualization.

Fact tables, dimension tables, schemas (star, Cubes, measures, dimensions, and
Components
snowflake). hierarchies.

Reports, charts, dashboards, and drill-


Output A well-designed database structure for analytics.
down insights.
Conclusion
Dimensional modeling structures the database for efficient querying, while OLAP leverages that structure to
perform analytical operations and extract business insights.

Define OLAP . Discuss OLAP models with a neat diagram(2024-june)


 Online Analytical Processing (OLAP) is a category of software that allows users to analyse information
from multiple database systems at the same time.
 It is a technology that enables analysts to extract and view business data from different points of
view.
 Analysts frequently need to group, aggregate, and join data.
 These operations in relational databases are resource intensive. With OLAP data can be pre-
calculated and pre-aggregated, making analysis faster.
 OLAP (Online Analytical Processing) is the technology behind many Business Intelligence (BI)
applications.
 OLAP is a powerful technology for data discovery, including capabilities for limitless report viewing,
complex analytical calculations, and predictive ―” what if ” scenario (budget, forecast) planning.
 OLAP performs multidimensional analysis of business data and provides the capability for complex
calculations, trend analysis, and sophisticated data modelling.
 It is the foundation for many kinds of business applications for Business Performance Management,
Planning, Budgeting, Forecasting, Financial Reporting, Analysis, Simulation Models, Knowledge
Discovery, and Data Warehouse Reporting.
 OLAP enables end-users to perform ad hoc analysis of data in multiple dimensions, thereby providing
the insight and understanding they need for better decision making.
 OLAP databases are divided into one or more cubes.
 The cubes are designed in such a way that creating and viewing reports become easy.
 OLAP stands for Online Analytical Processing

ROLAP :
 ROLAP works with data that exist in a relational database. Facts and dimension tables are stored as
relational tables.
 It also allows multidimensional analysis of data and is the fastest growing OLAP.
Advantages of ROLAP model:
 High data efficiency. It offers high data efficiency because query performance and access language
are optimized particularly for the multidimensional data analysis.
 Scalability. This type of OLAP system offers scalability for managing large volumes of data, and even
when the data is steadily increasing.
Drawbacks of ROLAP model:
 Demand for higher resources: ROLAP needs high utilization of manpower, software, and hardware
resources.
 Aggregately data limitations. ROLAP tools use SQL for all calculation of aggregate data. However,
there are no set limits to the for handling computations.
 Slow query performance. Query performance in this model is slow when compared with MOLAP

MOLAP :
 MOLAP uses array-based multidimensional storage engines to display multidimensional views of
data. Basically, they use an OLAP cube.
 Multidimensional OLAP (MOLAP) is a classical OLAP that facilitates data analysis by using a
multidimensional data cube.
 Data is pre-computed, pre-summarized, and stored in a MOLAP
 Using a MOLAP, a user can use multidimensional view data with different facts.
 MOLAP has all possible combinations of data already stored in a multidimensional array. MOLAP can
access this data directly.
 Hence, MOLAP is faster compared to Relational Online Analytical Processing (ROLAP).

MOLAP Advantages
 MOLAP can manage, analyze and store considerable amounts of multidimensional data.
 Fast Query Performance due to optimized storage, indexing, and caching.
 Smaller sizes of data as compared to the relational database.
 Automated computation of higher level of aggregates data.
 Help users to analyze larger, less-defined data.
 MOLAP is easier to the user that's why It is a suitable model for inexperienced users.
 MOLAP cubes are built for fast data retrieval and are optimal for slicing and dicing operations.
 All calculations are pre-generated when the cube is created.

MOLAP Disadvantages
 One major weakness of MOLAP is that it is less scalable than ROLAP as it handles only a limited
amount of data.
 The MOLAP also introduces data redundancy as it is resource intensive
 MOLAP Solutions may be lengthy, particularly on large data volumes.
 MOLAP products may face issues while updating and querying models when dimensions are more
than ten.
 MOLAP is not capable of containing detailed data.
 The storage utilization can be low if the data set is highly scattered.
 It can handle the only limited amount of data therefore, it's impossible to include a large amount of
data in the cube itself.
Hybrid OLAP :
 Hybrid OLAP is a mixture of both ROLAP and MOLAP.
 It offers fast computation of MOLAP and higher scalability of ROLAP. HOLAP uses two databases.
 Aggregated or computed data is stored in a multidimensional OLAP cube
 Detailed information is stored in a relational database.
Other Types of OLAP
There are some other types of OLAP Systems that are used in analyzing databases. Some of them are
mentioned below.
 Web OLAP (WOLAP): It is a Web browser-based technology. In traditional OLAP application is
accessible by the client/server but this OLAP application is accessible by the web browser. It is a
three-tier architecture that consists of a client, middleware, and database server. The most appealing
features of this style of OLAP were (past tense intended, since few products categorize themselves
this way) the considerably lower investment involved on the client side (“all that’s needed is a
browser”) and enhanced accessibility to connect to the data. A Web-based application requires no
deployment on the client machine. All that is needed is a Web browser and a network connection to
the intranet or Internet.
 Desktop OLAP (DOLAP): DOLAP stands for desktop analytical processing. Users can download the
data from the source and work with the dataset, or on their desktop. Functionality is limited compared
to other OLAP applications. It has a cheaper cost.
 Mobile OLAP (MOLAP): MOLAP is wireless functionality for mobile devices. User work and access
the data through mobile devices.
 Spatial OLAP (SOLAP): Merge capabilities of both Geographic Information Systems (GIS) and OLAP
into the single user interface, SOLAP egress. SOLAP is created because the data come in the form of
alphanumeric, image, and vector. This provides the easy and quick exploration of data that resides in
a spatial database.

What is dimensional modelling? Discuss dimensional modelling techniques in data warehouse.


dimensional modelling
 A dimensional model is a data structure technique optimized for Data warehousing tools.
 The concept of Dimensional Modelling was developed by Ralph Kimball and is comprised of "fact" and
"dimension" tables.
 A Dimensional model is designed to read, summarize, analyze numeric information like values,
balances, counts, weights, etc. in a data warehouse.
 In contrast, relation models are optimized for addition, updating and deletion of data in a real-time
Online Transaction System.
 These dimensional and relational models have their unique way of data storage that has specific
advantages.
 For instance, in the relational mode, normalization and ER models reduce redundancy in data. On the
contrary, dimensional model arranges data in such a way that it is easier to retrieve information and
generate reports.
 Hence, Dimensional models are used in data warehouse systems and not a good fit for relational
systems.
Elements of Dimensional Data Model
 Fact
 Facts are the measurements/metrics or facts from your business process.
 Provides quantitative information about business processes
 E.g For a Sales business process, a measurement would be quarterly sales number
 Other examples of facts are – quantity, sales_amount, total_earnings, profit, margin,
total_turnover, cost etc.
 Dimension
 Dimension provides the context surrounding a business process event.
 It describes facts.
 In simple terms, they give who, what, where of a fact.
 In the Sales business process, for the fact quarterly sales number, dimensions would be
 Who – Customer Names
 Where – Location
 What – Product Name
 In other words, a dimension is a window to view information in the facts.
 Through dimensions we analyze numbers & categorize facts & measures that facilitates end users to
answer business questions.

 Without dimensions we cannot measure facts.


 Using dimensions users can perform operations like sales by customer or sales by customer in
year 2017 or sales by customer in 2017 by particular group.
 To identify dimensions ask questions on context like “What”, “When”, “Where”, “Who”
 Attributes
 The Attributes are the various characteristics of the dimension.
 In the Location dimension, the attributes can be
 State
 Country
 Zipcode etc.
 Attributes are used to search, filter, or classify facts.
 Dimension Tables contain Attributes
 Fact Table
 A fact table is a primary table in a dimensional model.
 The fact table should have a primary (composite) key that is a combination of the foreign keys.
 A fact table works with dimension tables.
 A fact table holds the data to be analyzed, and a dimension table stores data about the ways in
which the data in the fact table can be analyzed.
 Thus, the fact table consists of two types of columns.
 The foreign keys column allows joins with dimension tables, and the measures columns
contain the data that is being analyzed.
 Dimension table
 A dimension table contains dimensions of a fact.
 They are joined to fact table via a foreign key.
 Dimension tables are de-normalized tables.
 The Dimension Attributes are the various columns in a dimension table
 Dimensions offers descriptive characteristics of the facts with the help of their attributes
 No set limit set for given for number of dimensions
 The dimension can also contain one or more hierarchical relationships

Explain data warehouse architecture in details.

Data Warehouse Architecture


A data warehouse (DW) is a centralized repository for storing integrated data from multiple sources,
designed to support analytical queries and decision-making. The architecture of a data warehouse defines its
framework and components.
Basic Single-Tier Architecture
 Definition: This architecture aims to reduce data redundancy by integrating all data processing
functions into a single layer.
 Components:
o Data sources
o A unified database for operational and analytical purposes
 Use Case: Rarely used in practice due to scalability and performance limitations.
 Limitations:
o Poor separation between transactional and analytical processes.
o Not suitable for large-scale systems.
Two-Tier Architecture
 Definition: Separates the data warehouse and client applications but lacks a middle tier for advanced
processing.
 Components:
o Tier 1: Data sources and ETL processes (Extract, Transform, Load).
o Tier 2: Client applications directly access the data warehouse for querying and reporting.
 Advantages:
o Simplified design compared to three-tier architecture.
o Faster query execution for small-scale systems.
 Limitations:
o Scalability issues when data volume grows.
o Limited support for complex business logic.

Three-Tier Architecture
 Definition: The most common architecture, which separates the data warehouse into three distinct
layers for enhanced scalability, flexibility, and performance.
 Components:
1. Bottom Tier:
 Data sources: Operational databases, external data, and flat files.
 ETL processes: Data extraction, transformation, and loading into the warehouse.
 Data staging area: Temporary storage for ETL processes.
2. Middle Tier:
 OLAP (Online Analytical Processing) server: Provides multidimensional data views.
 Business logic layer: Executes complex transformations and calculations.
3. Top Tier:
 Client tools: Dashboards, reporting tools, and analytical applications for end users.
 Advantages:
o Scalable for large datasets.

o Supports complex analytical queries.

o Enhanced data security and performance.

 Limitations:
o Higher implementation cost and complexity.

Properties of Data Warehouse Architectures


1. Subject-Oriented: Data is organized around major business subjects such as customers or sales.
2. Integrated: Combines data from various sources into a consistent format.
3. Time-Variant: Historical data is stored and used for trend analysis.
4. Non-Volatile: Once data is entered into the warehouse, it is not modified.
5. Scalable: Designed to handle increasing data volumes efficiently.
6. Accessible: End users can query the data easily through various tools.

ETL Process in Data Warehouse


ETL (Extract, Transform, Load) is a critical process in data warehousing for preparing data.
1. Extract:
o Retrieves data from heterogeneous sources (databases, files, APIs).
o Ensures data consistency during extraction.
2. Transform:
o Cleans and standardizes data (removing duplicates, handling missing values).
o Applies business rules and transformations (e.g., aggregations, lookups).
3. Load:
o Stores the transformed data in the target data warehouse.
o Can be implemented using batch loading or real-time streaming.
ETL Challenges:
 Managing large data volumes.
 Maintaining data quality and consistency.
 Scheduling and monitoring ETL jobs.

Explain OLAP Operation :


OLAP stands for Online Analytical Processing Server. It is a software technology that allows users to
analyze information from multiple database systems at the same time. It is based on multidimensional data
model and allows the user to query on multi-dimensional data (eg. Delhi -> 2018 -> Sales data). OLAP
databases are divided into one or more cubes and these cubes are known as Hyper-cubes.

OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into highly detailed data. It can
be done by:
 Moving down in the concept hierarchy
 Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving down in the concept
hierarchy of Time dimension (Quarter -> Month).

2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP cube. It
can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by climbing up in the concept
hierarchy of Location dimension (City -> Country).

3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the cube
given in the overview section, a sub-cube is selected by selecting following dimensions with criteria:
 Location = “Delhi” or “Kolkata”
 Time = “Q1” or “Q2”
 Item = “Car” or “Bus”

4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube creation. In
the cube given in the overview section, Slice is performed on the dimension Time = “Q1”.

5. Pivot: It is also known as rotation operation as it rotates the current view to get a new view of the
representation. In the sub-cube obtained after the slice operation, performing pivot operation gives a
new view of it.
7.Define Data warehouse. Explain ETL process in details.
Data Warehouse :
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than
transaction processing.
It includes historical data derived from transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for
decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of
users.
It is not used for daily operations and transaction processing but used for making decisions.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate, understand, and
query.
4. Data warehousing is an efficient method to manage demand for lots of information from lots of users.
5. Data warehousing provide the capabilities to analyze a large amount of historical data.

ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a process in
which an ETL tool extracts the data from various data source systems, transforms it in the staging area, and
then finally, loads it into the Data Warehouse system.

Let us understand each step of the ETL process in-depth:


1. Extraction:
The first step of the ETL process is extraction. In this step, data from various source systems is
extracted which can be in various formats like relational databases, No SQL, XML, and flat files into
the staging area. It is important to extract the data from various source systems and store it into the
staging area first and not directly into the data warehouse because the extracted data is in various
formats and can be corrupted also. Hence loading it directly into the data warehouse may damage it
and rollback will be much more difficult. Therefore, this is one of the most important steps of ETL
process.
2. Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or functions are
applied on the extracted data to convert it into a single standard format. It may involve following
processes/tasks:
 Filtering – loading only certain attributes into the data warehouse.
 Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States,
and America into USA, etc.
 Joining – joining multiple attributes into one.
 Splitting – splitting a single attribute into multiple attributes.
 Sorting – sorting tuples on the basis of some attribute (generally key-attribute).
3. Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is finally
loaded into the data warehouse. Sometimes the data is updated by loading into the data warehouse
very frequently and sometimes it is done after longer but regular intervals. The rate and period of
loading solely depends on the requirements and varies from system to system.
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can transformed
and during that period some new data can be extracted. And while the transformed data is being loaded into
the data warehouse, the already extracted data can be transformed. The block diagram of the pipelining of
ETL process is shown below:

ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse builder, CloverETL, and
MarkLogic.
Data Warehouses: Most commonly used Data Warehouses are Snowflake, Redshift, BigQuery, and
Firebolt.
differentiate OLAP vs OLTP

Category OLAP (Online Analytical Processing) OLTP (Online Transaction Processing)

It is well-known as an online database It is well-known as an online database


Definition
query management system. modifying system.

Consists of historical data from various


Data source Consists of only operational current data.
Databases.

It makes use of a standard database


Method used It makes use of a data warehouse.
management system (DBMS).

It is subject-oriented. Used for Data


It is application-oriented. Used for business
Application Mining, Analytics, Decisions making,
tasks.
etc.

In an OLAP database, tables are not In an OLTP database, tables are normalized
Normalized
normalized. (3NF).

The data is used in planning, problem- The data is used to perform day-to-day
Usage of data
solving, and decision-making. fundamental operations.

It provides a multi-dimensional view of It reveals a snapshot of present business


Task
different business tasks. tasks.

It serves the purpose to extract


It serves the purpose to Insert, Update, and
Purpose information for analysis and decision-
Delete information from the database.
making.

A large amount of data is stored The size of the data is relatively small as the
Volume of data
typically in TB, PB historical data is archived in MB, and GB.

Relatively slow as the amount of data


Very Fast as the queries operate on 5% of
Queries involved is large. Queries may take
the data.
hours.
Category OLAP (Online Analytical Processing) OLTP (Online Transaction Processing)

The OLAP database is not often


The data integrity constraint must be
Update updated. As a result, data integrity is
maintained in an OLTP database.
unaffected.

Backup and It only needs backup from time to time The backup and recovery process is
Recovery as compared to OLTP. maintained rigorously

It is comparatively fast in processing


Processing The processing of complex queries can
because of simple and straightforward
time take a lengthy time.
queries.

This data is generally managed by This data is managed by clerksForex and


Types of users
CEO, MD, and GM. managers.

Operations Only read and rarely write operations. Both read and write operations.

With lengthy, scheduled batch


The user initiates data updates, which are
Updates operations, data is refreshed on a
brief and quick.
regular basis.

Nature of The process is focused on the


The process is focused on the market.
audience customer.

Database
Design with a focus on the subject. Design that is focused on the application.
Design

Improves the efficiency of business


Productivity Enhances the user’s productivity.
analysts.
star flake, snowflake and fact constellation Schema
Following are 3 chief types of multidimensional schemas each having its unique advantages.
 Star Schema  Snowflake Schema  Galaxy Schema
Star Schema: Star Schema in data warehouse, in which the center of the star can have one fact table and a
number of associated dimension tables. It is known as star schema as its structure resembles a star. The
Star Schema data model is the simplest type of Data Warehouse schema. It is also known as Star Join
Schema and is optimized for querying large data sets. In the following Star Schema example, the fact table
is at the center which contains keys to every dimension table like Dealer_ID, Model ID, Date_ID, Product_ID,
Branch_ID & other attributes like Units sold and revenue.

Characteristics of Star Schema:


 Every dimension in a star schema is represented with the only onedimension table.
 The dimension table should contain the set of attributes. 58
 The dimension table is joined to the fact table using a foreign key
 The dimension table are not joined to each other
 Fact table would contain key and measure
 The Star schema is easy to understand and provides optimal disk usage.
 The schema is widely supported by BI Tools

Snowflake Schema:
Snowflake Schema in data warehouse is a logical arrangement of tables in a multidimensional database
such that the ER diagram resembles a snowflake shape. A Snowflake Schema is an extension of a Star
Schema, and it adds additional dimensions. The dimension tables are normalized which splits data into
additional tables. In the following Snowflake Schema example, Country is further normalized into an
individual table.
Characteristics of Snowflake Schema:
 The main benefit of the snowflake schema it uses smaller disk space.
 Easier to implement a dimension is added to the Schema
 Due to multiple tables query performance is reduced

Galaxy Schema:
A Galaxy Schema contains two fact table that share dimension tables between them. It is also called Fact
Constellation Schema. The schema is viewed as a collection of stars hence the name Galaxy Schema.

As you can see in above example, there are two facts table
1. Expense
2. Revenue.
In Galaxy schema shared dimensions are called Conformed Dimensions.
Characteristics of Galaxy Schema:
 The dimensions in this schema are separated into separate dimensions based on the various levels of
hierarchy.For example, if geography has four levels of hierarchy like region, country, state, and city then
Galaxy schema should have four dimensions.  Moreover, it is possible to build this type of schema by
splitting the one-star schema into more Star schemes.  The dimensions are large in this schema which is
needed to build based on the levels of hierarchy.  This schema is helpful for aggregating fact tables for
better understanding.
Why data preprocessing is important in data mining
Data Preprocessing is a technique to turn raw and crude information gathered from diverse sources
into clean and consistent dataset. Data Preprocessing is one of the most vital steps in the data
mining process. Data Preprocessing involves Data Cleaning, Data Integration, Data Transformation,
Data Reduction etc.
Importance of Data Preprocessing in Data Mining
1. Improves Data Quality
 Real-world data may have missing values, errors, or outliers that can negatively affect the accuracy of
mining algorithms.
 Preprocessing ensures data is accurate, complete, and reliable, leading to better analytical results.
2. Handles Missing Data
 Missing values can arise due to human errors, hardware failures, or data collection issues.
 Techniques like mean substitution, regression imputation, or deletion are applied to address these
gaps, ensuring the dataset is usable.
3. Removes Noise and Outliers
 Noisy data can result from sensor errors, human input errors, or system malfunctions.
 Techniques like smoothing, binning, and clustering are used to reduce noise and improve the quality
of patterns extracted.
4. Ensures Consistency
 Data inconsistencies, such as different formats, naming conventions, or measurement units, can
hinder analysis.
 Preprocessing resolves such inconsistencies by standardizing the data.
5. Reduces Complexity
 High-dimensional data can lead to challenges in processing and visualization.
 Dimensionality reduction techniques like Principal Component Analysis (PCA) or feature selection
simplify data without significant loss of information.
6. Enhances Algorithm Efficiency
 Clean and formatted data reduces computational complexity and improves the performance of data
mining algorithms.
 Ensures that algorithms focus on extracting meaningful patterns rather than dealing with noisy or
irrelevant data.
7. Facilitates Better Understanding
 Data preprocessing organizes and summarizes the dataset, making it easier for analysts to interpret
the data and derive insights.
8. Ensures Reproducibility
 Standardizing the preprocessing steps ensures that the results are reproducible and consistent across
different datasets or scenarios.

explain data Reduction techniques in detail


Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the
most important information. This can be beneficial in situations where the dataset is too large to be
processed efficiently, or where the dataset contains a large amount of irrelevant or redundant information.
There are several different data reduction techniques that can be used in data mining, including:
Data Sampling
Dimensionality Reduction:
Data Compression:
Data Discretization:
Feature Selection:
data reduction is an important step in data mining, as it can help to improve the efficiency and performance
of machine learning algorithms by reducing the size of the dataset. However, it is important to be aware of
the trade-off between the size and accuracy of the data, and carefully assess the risks and benefits before
implementing it.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine the information you
gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company every
three months. They involve you in the annual sales, rather than the quarterly average, So we can
summarize the data in such a way that the resulting data summarizes the total sales per year instead of per
quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute required for our
analysis. It reduces data size as it eliminates outdated or redundant features.
 Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide the best of the original
attributes on the set based on their relevance to other attributes. We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes are redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}


Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

 Step-wise Backward Selection –


This selection starts with a set of complete attributes in the original data and at each point, it
eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}


Initial reduced attribute set: {X1, X2
, X3, X4, X5, X6 }
Step-1: {X1, X2, X3, X4, X5}
Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}


 Combination of forwarding and Backward Selection –
It allows us to remove the worst and select the best attributes, saving time and making the process
faster.

4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or smaller representations
of the data instead of actual data, it is important to only store the model parameter. Or non-parametric
methods such as clustering, histogram, and sampling.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with
intervals. We replace many constant values of the attributes by labels of small intervals. This means that
mining results are shown in a concise, and easily understandable way.
 Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole
set of attributes and repeat this method up to the end, then the process is known as top-down
discretization also known as splitting.
 Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded through a combination
of the neighborhood values in the interval, that process is called bottom-up discretization.

Explain Data Preprocessing :


Data Preprocessing
Data preprocessing is a crucial step in the data mining and machine learning process. It involves
transforming raw, inconsistent, and incomplete data into a clean, organized, and usable format to ensure
accurate and reliable analysis.
Why Data Preprocessing is Important
Real-world data often comes from diverse sources, and it may have issues such as:
1. Incomplete Data: Missing values or attributes due to errors in collection or storage.
2. Noisy Data: Errors, outliers, or irrelevant information that disrupts analysis.
3. Inconsistent Data: Discrepancies in formats, names, or coding conventions.
4. Redundant Data: Duplicate or irrelevant entries that increase complexity.
Addressing these issues is essential for generating accurate insights and avoiding misleading conclusions.
Steps in Data Preprocessing
1. Data Cleaning
 Objective: Remove noise, handle missing values, and correct errors.
 Techniques:
o Fill missing values (e.g., mean, median, or mode substitution).
o Remove or replace noisy data (e.g., using smoothing techniques).
o Correct inconsistent entries (e.g., standardizing formats).
2. Data Integration
 Objective: Combine data from multiple sources into a cohesive dataset.
 Techniques:
o Use schema matching or data mapping to align attributes.
o Resolve conflicts caused by different measurement units or naming conventions.
3. Data Transformation
 Objective: Convert data into a suitable format for analysis.
 Techniques:
o Normalization: Scale data to a specific range (e.g., 0–1).
o Aggregation: Summarize data (e.g., converting daily sales into monthly totals).
o Encoding: Convert categorical data into numerical form (e.g., one-hot encoding).
4. Data Reduction
 Objective: Reduce data volume while retaining essential patterns.
 Techniques:
o Dimensionality reduction (e.g., Principal Component Analysis).
o Sampling: Selecting a representative subset of data.
o Data cube aggregation: Summarize data at various levels of granularity.
Benefits of Data Preprocessing
1. Improves Data Quality: Ensures accuracy, consistency, and completeness.
2. Enhances Algorithm Performance: Clean and structured data leads to better results in machine
learning models and data mining techniques.
3. Reduces Computational Complexity: Preprocessing simplifies data, speeding up analysis and
reducing resource consumption.
4. Ensures Reliable Insights: Clean data prevents biases and errors, resulting in trustworthy
conclusions.
Conclusion
Data preprocessing is an indispensable step in data analysis, transforming raw data into a clean and usable
form. By addressing issues like missing values, noise, and inconsistencies, it ensures that the dataset is
ready for effective and accurate analysis. Proper preprocessing is key to uncovering meaningful patterns and
making informed decisions.

KNOWLEDGE DISCOVERY IN DATA (KDD) PROCESS


It is aninteractive and iterative sequence comprising of 9 phases. Teams commonly learn new things in a
phase that cause them to go back and refine the work done in prior phases based on new insights and
information that have been uncovered. The diagram given below depicts the iterative movement between
phases until the team members have sufficient information to move to the next phase. The process begins
with finding the KDD goals and ends with the successful implementation of the discovered knowledge. 1.
Domain Understanding – In this preliminary step the team needs to understand and define the goals of the
end-user and the environment in which the KDD process will take place. 2. Selection & Addition – In this
phase it is important to determine the dataset which will be utilized for the KDD process. So, the team needs
to first find the relevant data which is accessible. Data from multiple sources can be integrated in this phase.
Note that this is the data which is going to lead us to Knowledge. So, if some attributes from the data are
missing then it will lead to half-cooked Knowledge. Therefore, the objective of this phase is determining the
suitable and complete dataset on which the discovery will be performed. 3. Pre-processing & Cleansing –
The data received from the earlier phase is like a rough diamond. Now in this phase you need to polish the
diamond so that everyone can know its beauty. So now the main task in this phase is to sanitize and prepare
the data for use. Data cleansing is a subprocess that focuses on removing errors in your data so your data
becomes true and consistent. Sanity checks are performed to check that the data does not contain physically
or theoretically impossible values such as people taller than 3 meters or someone with an age of 299 years.
104 4. Data Transformation – Once your team has cleansed and integrated the data, now you may have to
transform your data so it becomes suitable for the next phase of data mining. In this phase, the data is
transformed or consolidated into forms appropriate for mining by performing summary or aggregation
operations. The aggregation operators perform mathematical operations like Average, Count, Max, Min, Sum
etc. on the numeric property of the elements in the collection. This phase is very project-specific. 5. Data
Mining – In this phase methods like Association, Classification, Clustering and/or Regression are applied in
order to extract patterns. We may need to use the Data Mining Algorithm several times until the desired
output is obtained. 6. Evaluation – In this phase we evaluate and understand the mined patterns, rules and
reliability to the goal set in the first phase. Here we assess the pre-processing steps for their impact on the
Data Mining Algorithm outcomes. For example, we can assess the outcome of the algorithm by adding an
extra feature in phase 4 and repeating from there. This phase focuses on comprehensibility and efficacy of
the newly developed model. 7. Discovered Knowledge Presentation – The last phase is all about the use and
overall feedback and discovery results acquired by Data Mining. The interesting discovered patterns are
presented to the enduser and may be stored as new knowledge in the knowledge base. The success of this
phase decides the effectiveness of the entire KDD process.

You might also like