SPPU 2022 Solved Question Paper DWDM
SPPU 2022 Solved Question Paper DWDM
Q.2. What is a Data warehouse. Explain the need and characteristics of Data warehouse [5]
A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known
as star schema because the entity-relationship diagram of this schemas simulates a star, with points,
diverge from a central table. The center of the schema consists of a large fact table, and the points of
the star are the dimension tables.
Snowflake Schema
• We can create even more complex star schemas by normalizing a dimension table into several
tables. The normalized dimension table is called a Snowflake.
• A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if
one or more dimension tables do not connect directly to the fact table but must join through
other dimension tables."
• The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of snowflake
schema resembles a snowflake. Snowflaking is a method of normalizing the dimension tables
in a STAR schemas. When we normalize all the dimension tables entirely, the resultant
structure resembles a snowflake with the fact table in the middle
A Fact constellation means two or more fact tables sharing one or more dimensions. It is also
called Galaxy schema.
Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact
Constellation Schema can design with a collection of de-normalized FACT, Shared, and Conformed
Dimension tables.
Kimball methodology is intended for for designing, developing, and deploying data
warehouse/business intelligence systems,
o Starts with one data mart (ex. sales); later on additional data marts are added (ex. collection,
marketing, etc.)
o Data flows from source into data marts, then into the data warehouse
o Kimball approach is faster to implement as it is implemented in stages
Before we go ahead with details of the methodology, let us take a quick view on some essential
definitions of the terms used.
Kimball lifecycle diagram illustrates the flow of data warehouse implementation. It identifies task
sequencing and highlights activities that should happen concurrently. Activities may need to be
customized to address the unique needs of the organization. Also,not every detail of every lifecycle
task will be required on every project – this has to be decided as per need.
As per Kimball Lifecycle, we start building a data warehouse with understanding business
requirements and determining how best to add value to the organization. The organization must agree
on what the value of this data is before deciding to build a data warehouse to hold it. Once the
requirements are gathered, implementation phase begins with design steps across three different
tracks – technology, data, and BI applications. Once we are done with this implementation, the
Lifecycle comes back together to deploy the query tools, reports, and applications to the user
community.
The incremental approach of the Lifecycle helps to deliver business value in a short span of time and
at the same time helps to build a enterprise wide information resource in a long term.
Q.2. What is a Data warehouse? Explain the properties of Data warehouse architecture
Data warehouse : is a subject-oriented, integrated, time-variant, and non-volatile collection of
data. This data helps analysts to take informed decisions in an organization.
Three-tier architecture: This is the most widely used architecture. It consists of the,
1. Top,
2. Middle and
3. Bottom Tier.
Bottom Tier: The database of the Data warehouse servers as the bottom tier. It is usually a relational
database system. Data is cleansed, transformed, and loaded into this layer using back-end tools.
Middle Tier: The middle tier in Data warehouse is an OLAP server which is implemented using either
ROLAP or MOLAP model. For a user, this application tier presents an abstracted view of the database.
This layer also acts as a mediator between the end-user and the database.
Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you connect and get
data out from the data warehouse. It could be Query tools, reporting tools, managed query tools,
Analysis tools and Data mining tools.
Q.3. ) What is ETL? Explain data pre-processing techniques in detail.
ETL : Extract, transform, and load (ETL) is the process of combining data from multiple sources into
a large, central repository called a data warehouse. ETL uses a set of business rules to clean and
organize raw data and prepare it for storage, data analytics, and machine learning (ML).
• Data pre-processing is an important task. It is a data mining technique that transforms raw data
into a more understandable, useful and efficient format.
Why is data pre-processing required?
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effective methods of dimensionality
reduction are:Wavelet transforms and PCA (Principal Component Analysis).
1. Multidimensional conceptual view: OLAP systems let business users have a dimensional and
logical view of the data in the data warehouse. It helps in carrying slice and dice operations.
2. Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should
provide normal database operations, containing retrieval, update, adequacy control, integrity,
and security.
3. Accessibility: OLAP acts as a mediator between data warehouses and front-end. The OLAP
operations should be sitting between data sources (e.g., data warehouses) and an OLAP front-
end.
4. Storing OLAP results: OLAP results are kept separate from data sources.
5. Uniform documenting performance: Increasing the number of dimensions or database size
should not significantly degrade the reporting performance of the OLAP system.
6. OLAP provides for distinguishing between zero values and missing values so that aggregates
are computed correctly.
7. OLAP system should ignore all missing values and compute correct aggregate values.
8. OLAP facilitate interactive query and complex analysis for the users.
9. OLAP allows users to drill down for greater details or roll up for aggregations of metrics along
a single business dimension or across multiple dimension.
10. OLAP provides the ability to perform intricate calculations and comparisons.
11. OLAP presents results in a number of meaningful ways, including charts and graphs.
Q.3) Describe ETL. What are the tasks to be performed during data transformation? [5]
ETL : Extract, transform, and load (ETL) is the process of combining data from multiple sources into
a large, central repository called a data warehouse. ETL uses a set of business rules to clean and
organize raw data and prepare it for storage, data analytics, and machine learning (ML).
tasks to be performed during data transformation
Data transformation in data mining refers to the process of converting raw data into a format that is
suitable for analysis and modeling. The goal of data transformation is to prepare the data for data
mining so that it can be used to extract useful insights and knowledge. Data transformation
typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the data.
2. Data integration: Combining data from multiple sources, such as databases and spreadsheets,
into a single format.
3. Data normalization: Scaling the data to a common range of values, such as between 0 and 1,
to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant
features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by summing or
averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to ensure that the
data is in a format that is suitable for analysis and modeling, and that it is free of errors and
inconsistencies. Data transformation can also help to improve the performance of data mining
algorithms, by reducing the dimensionality of the data, and by scaling the data to a common
range of values.
OLAP servers were based on the multidimensional view of data and we will discuss OLAP
operations in multidimensional data.
Roll-up is performed by climbing up to the concept hierarchy for the dimension location.
• Initially the concept hierarchy was "street < city < province < country".
• On rolling up, the data is aggregated by ascending the location hierarchy from the level
of the city to the level of the country.
• The data is grouped into cities rather than the countries.
• When roll-up is performed by one or more dimensions from the data cube are removed.
2. Drill-down
• Drill-down is performed by stepping down to the concept hierarchy for the dimension
time.
• Initially, the concept hierarchy was "day < month < quarter < year."
• On drilling down, the time dimension is descended from the level of the quarter to the
level of the month.
• When drill-down is performed by one or more dimensions from the data cube were
added.
• Drill down will navigate the data from less detailed data to highly detailed data.
3. Slice:
The slice operation selects one particular dimension from the given cube and provides a new
sub-cube.
• Here the Slice is performed by the dimension "time" using the criterion time = "Q1".
• It will form a new sub-cube by selecting one or more dimensions.
4. Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. We
Consider the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves in three
dimensions.
• (location = "Toronto" or "Vancouver")
• (time = "Q1" or "Q2")
• (item =" Mobile" or Modem")
5. Pivot
The pivot operation is also known as rotation. It rotates the data axes in order to view and provide
an alternative presentation of data. we Consider the following diagram that shows the pivot
operation.
Q.4) What is Data mining? Explain the architecture of Data mining. [3]
Data Mining is a process of discovering interesting patterns and knowledge from large amounts of
data. The data sources can include databases, data warehouses, the web, and other information
repositories or data that are streamed into the system dynamically.
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and
other documents. You need a huge amount of historical data for data mining to be successful.
Organizations typically store data in databases or data warehouses. Data warehouses may comprise
one or more databases, text files spreadsheets, or other repositories of data. Sometimes, even plain text
files or spreadsheets may contain information. Another primary source of data is the World Wide Web
or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned, integrated,
and selected. As the information comes from various sources and in different formats, it can't be used
directly for the data mining procedure because the data may not be complete and accurate. So, the first
data requires to be cleaned and unified. More information than needed will be collected from various
data sources, and only the data of interest will have to be selected and passed to the server. These
procedures are not as easy as we think. Several methods may be performed on the data as part of
selection, integration, and cleaning
The database or data warehouse server consists of the original data that is ready to be processed. Hence,
the server is cause for retrieving the relevant data that is based on data mining as per user request.
The data mining engine is a major component of any data mining system. It contains several modules
for operating data mining tasks, including association, characterization, classification, clustering,
prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It comprises
instruments and software used to obtain insights and knowledge from data collected from various data
sources and stored within the data warehouse.
The Pattern evaluation module is primarily responsible for the measure of investigation of the pattern
by using a threshold value. It collaborates with the data mining engine to focus the search on exciting
patterns.
This segment commonly employs stake measures that cooperate with the data mining modules to focus
the search towards fascinating patterns. It might utilize a stake threshold to filter out discovered
patterns. On the other hand, the pattern evaluation module might be coordinated with the mining
module, depending on the implementation of the data mining techniques used. For efficient data
mining, it is abnormally suggested to push the evaluation of pattern stake as much as possible into the
mining procedure to confine the search to only fascinating patterns.
Graphical User Interface:
The graphical user interface (GUI) module communicates between the data mining system and the
user. This module helps the user to easily and efficiently use the system without knowing the
complexity of the process. This module cooperates with the data mining system when the user specifies
a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the
search or evaluate the stake of the result patterns. The knowledge base may even contain user views
and data from user experiences that might be helpful in the data mining process. The data mining
engine may receive inputs from the knowledge base to make the result more accurate and reliable. The
pattern assessment module regularly interacts with the knowledge base to get inputs, and also update
it.
Q.4) Apply FP Tree Algorithm to construct FP Tree and find frequent itemset for the
following dataset given below (minimum support = 30%)
In recent data mining projects, various major data mining techniques have been developed and used,
including association, classification, clustering, prediction, sequential patterns, and regression.
. Classification:
This technique is used to obtain important and relevant information about data and metadata. This data
mining technique helps to classify data in different classes.
Clustering:
Clustering is a division of information into groups of connected objects. Describing the data by a few
clusters mainly loses certain confine details, but accomplishes improvement. It models data by its
clusters. Data modeling puts clustering from a historical point of view rooted in statistics, mathematics,
and numerical analysis. From a machine learning point of view, clusters relate to hidden patterns, the
search for clusters is unsupervised learning, and the subsequent framework represents a data concept.
From a practical point of view, clustering plays an extraordinary job in data mining applications. For
example, scientific data exploration, text mining, information retrieval, spatial database applications,
CRM, Web analysis, computational biology, medical diagnostics, and much more.
. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific
variable. Regression, primarily a form of planning and modeling. For example, we might use it to
project certain costs, depending on other factors such as availability, consumer demand, and
competition. Primarily it gives the exact relationship between two or more variables in the given data
set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden pattern
in the data set.
Association rules are if-then statements that support to show the probability of interactions between
data items within large data sets in different types of databases. Association rule mining has several
applications and is commonly used to help sales correlations in data or medical data sets.
. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which do not
match an expected pattern or expected behavior. This technique may be used in various domains like
intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or Outilier mining. The
outlier is a data point that diverges too much from the rest of the dataset. The majority of the real-world
datasets have an outlier. Outlier detection plays a significant role in the data mining field. Outlier
detection is valuable in numerous fields like network interruption identification, credit or debit card
fraud detection, detecting outlying in wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to
discover sequential patterns. It comprises of finding interesting subsequences in a set of sequences,
where the stake of a sequence can be measured in terms of different criteria like length, occurrence
frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering, classification,
etc. It analyzes past events or instances in the right sequence to predict a future event.
Q.4) How does the KNN algorithm works? [7]
Apply KNN classification algorithm for the given dataset and predict the class for
X(P1 = 3, P2 = 7) (K = 3)
P1 P2 class
7 7 False
7 4 False
3 4 True
1 4 True
Q5) a) What is text mining? Explain the process of text mining [4]
Text mining is also known as text analysis. It is the process of transforming unstructured text into
structured data for easy analysis. Text mining needs natural language processing (NLP), enabling
devices to learn the human language and process it automatically.
The text mining process contains the following steps to extract the data from the files which are as
follows –
• Text transformation
A text transformation is a technique that is used to control the capitalization of the text.
Here the two major way of document representation is given.
Bag of words
Vector Space
• Text Pre-processing
Pre-processing is a significant task and a critical step in Text Mining, Natural Language
Processing (NLP), and information retrieval(IR). In the field of text mining, data pre-
processing is used for extracting useful information and knowledge from unstructured text
data. Information Retrieval (IR) is a matter of choosing which documents in a collection
should be retrieved to fulfill the user's need.
• Feature selection:
Feature selection is a significant part of data mining. Feature selection can be defined as the
process of reducing the input of processing or finding the essential information sources. The
feature selection is also called variable selection.
• Data Mining:
Now, in this step, the text mining procedure merges with the conventional process. Classic
Data Mining procedures are used in the structural database.
• Evaluate:
Afterward, it evaluates the results. Once the result is evaluated, the result abandon.
Q.5 b) Explain K-means algorithm. Apply K-means algorithm for group of visitors to a website
into two groups using their age as follows: 15, 16, 19, 20, 21, 28, 35, 40, 42, 44, 60, 65 (Consider
initial centroid 16 and 28 of two groups) [6]
Q.5. a) Apply K-means algorithm for the given data set where K is the cluster number D = {2,
3, 4, 10, 11, 12, 20, 25, 30}, K = 2. [6]
Q.5. b) What are the different types of web mining? [4]
Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is discovering
useful information from the World-Wide Web and its usage patterns.
Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as following
below.
1. Web Content Mining: Web content mining is the application of extracting useful information
from the content of the web documents. Web content consist of several types of data – text,
image, audio, video etc. Content data is the group of facts that a web page is designed. It can
provide effective and interesting patterns about user needs. Text documents are related to text
mining, machine learning and natural language processing. This mining is also known as text
mining. This type of mining performs scanning and mining of the text, images and groups of
web pages according to the content of the input.
2. Web Structure Mining: Web structure mining is the application of discovering structure
information from the web. The structure of the web graph consists of web pages as nodes, and
hyperlinks as edges connecting related pages. Structure mining basically shows the structured
summary of a particular website. It identifies relationship between web pages linked by
information or direct link connection. To determine the connection between two commercial
websites, Web structure mining can be very useful.
3. Web Usage Mining: Web usage mining is the application of identifying or discovering
interesting usage patterns from large data sets. And these patterns enable you to understand the
user behaviors or something like that. In web usage mining, user access data on the web and
collect data in form of logs. So, Web usage mining is also called log mining.