50% found this document useful (2 votes)
3K views25 pages

SPPU 2022 Solved Question Paper DWDM

The document discusses SPPU 2022 question paper for the subject "Data Warehousing and Data Mining". 1) It provides multiple choice questions to test knowledge of data warehousing concepts like K-means clustering, data warehouse architectures, ETL processes, OLAP, and data mining algorithms. 2) It asks questions about the need and characteristics of data warehouses, explaining that they provide integrated, non-volatile data to help users make informed decisions and that they are subject-oriented, time-variant, and integrated. 3) It describes common data warehouse schemas like star schemas, snowflake schemas, and fact constellation schemas - with star schemas having dimension and fact tables, snowflake schemas normalizing

Uploaded by

KALPESH KUMBHAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
50% found this document useful (2 votes)
3K views25 pages

SPPU 2022 Solved Question Paper DWDM

The document discusses SPPU 2022 question paper for the subject "Data Warehousing and Data Mining". 1) It provides multiple choice questions to test knowledge of data warehousing concepts like K-means clustering, data warehouse architectures, ETL processes, OLAP, and data mining algorithms. 2) It asks questions about the need and characteristics of data warehouses, explaining that they provide integrated, non-volatile data to help users make informed decisions and that they are subject-oriented, time-variant, and integrated. 3) It describes common data warehouse schemas like star schemas, snowflake schemas, and fact constellation schemas - with star schemas having dimension and fact tables, snowflake schemas normalizing

Uploaded by

KALPESH KUMBHAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

SPPU 2022 Question paper

IT 32: DATA WAREHOUSINGAND DATA MINING

Q.1. ) Answer the following multiple choice questions. [20×½=10]

i) Find the wrong statement of the K-means clustering.


a) K-means clustering is a method of vector quantization.
b) K-means neighbour is same as K-nearest.
c) K-means clustering aims to partition ‘n’ observations into K clusters.
d) K-means clustering produces the final estimate of cluster centroids
ii) How many tier data warehouse architecture?
a) 2 b) 1
c) 3 d) 4
iii) is an intermediate storage area used for data processing in ETLprocess
of data warehousing.
a) Buffer b) Virtual memory
c) Staging area d) Inter storage area
iv) is a good alternative to the star schema.
a) Star schema b) Snowflake schema
c) Fact constellation d) Star-snowflake schema
v) In a snowflake schema which of the following types of tables are
considered?
a) Fact b) Dimension
c) Both fact and dimension d) None of the mentioned
vi) The role of ETL is to .
a) Find erroneous data
b) Fix erroneous data
c) Both finding and fixing erroneous data
d) Filtering of the data source
vii) is a data transformation process.
a) Comparison b) Projection
c) Selection d) Filtering
viii) OLTP stands for .
a) Online Transaction Protocol
b) Online Transaction Processing
c) Online Terminal Protocol
d) Online Terminal Processing
ix) An approach in which the aggregated totals are stored in a multidimensional
database while the detailed data is stored in the relational database isa
______
a) MOLAP b) ROLAP
c) HOLAP d) OLAP
x) Summary of data from an OLAP can be presented in .
a) Normalization b) Primary keys
c) Pivot Table d) Foreign keys
xi) Efficiency and scalability of data mining algorithm is related to
to
a) Mining methodology b) User interaction
c) Diverse data types d) None of the mentioned
xii)
Strategic value of data mining is .
a) Cost sensitive b) Work sensitive
c) Time-sensitive d) Technical-sensitive
xiii) If the ETL process featches the data separately from the host server during
the automatic load to the data warehouse, one of the challenge involved
is
a) the associated network may be down
b) it may end up pulling the incomplete /incorrect file.
c) it may end up connecting to an incorrect host server.
d) None
xiv) Webmining helps to improve the power of web search engine by identifying
_______
a) Web pages and classifying the web documents
b) XML documents
c) Text documents
d) Database
xv) Assigning data to one of the clusters and recompute the centroid are the
steps in which algorithm.
a) Apriori algorithm b) Bayesian classification
c) FP tree algorithm d) K-mean
xvi) A disadvantage of KNN algorithm is, it takes
a) More time for training
b) More time for testing
c) Equal time for training
d) Equal time for testing
xvii) Which is not a characteristics of Data warehouse?
a) Volatile
b) Subject oriented
c) Non volatile
d) Time varient
xviii) What is the first stage of Kimball Life Cycle diagram.
a) Requirement Definition b) Dimensional Modelling
c) ETL Design Development d) Maintenance

xix) A connected region of a multidimensional space with a comparatively


high density objects.
a) Clustering b) Association
c) Classification d) Subset

Q.2. What is a Data warehouse. Explain the need and characteristics of Data warehouse [5]

Data Warehouse: Data warehouse is a subject-oriented, integrated, time-variant, and non-


volatile collection of data. This data helps analysts to take informed decisions in an organization.

• Need of data warehouse :


Data Warehouse is needed for the following reasons:
1. Business User: Business users require a data warehouse to view summarized data from the
past. Since these people are non-technical, the data may be presented to them in an elementary
form.
2. Store historical data: Data Warehouse is required to store the time variable data from the past.
This input is made to be used for various purposes.
3. Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4. For data consistency and quality: Bringing the data from different sources at a commonplace,
the user can effectively undertake to bring the uniformity and consistency in data.
5. High response time: Data warehouse has to be ready for somewhat unexpected loads and types
of queries, which demands a significant degree of flexibility and quick response time.

• User who need mass amount of data


• It helps user who obtain customized difficult processes to receive information from
various data source
• It benefits people who demand basic technology to access the data
• Users who requires a perfect approach to taking decisions
• For users who require fast performance on a huge amount of data which is a necessary for
obtaining charts grids an reports
• Users who want to identify the hidden patterns of grouping and data flows
Characteristics of Data warehouse :

• Subject Oriented - A data warehouse is subject oriented because it provides information


around a subject rather than the organization's ongoing operations. These subjects can be
product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the
ongoing operations, rather it focuses on modelling and analysis of data for decision making.
• Integrated – A data warehouse is constructed by integrating data from heterogeneous sources
such as relational databases, flat files, etc. This integration enhances the effective analysis of
data.
• Time Variant - The data collected in a data warehouse is identified with a particular time
period. The data in a data warehouse provides information from the historical point of view.
• Non-volatile - Non-volatile means the previous data is not erased when new data is added to
it. A data warehouse is kept separate from the operational database and therefore frequent
changes in operational database is not reflected in the data warehouse.

Q.2 Explain the schemas of Data warehouse [5]


schemas of Data warehouse
• Star Schema
• Snowflake Schema
• Fact Consolation Schema
Star Schema
A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured, such as a sale or log in. A
dimension includes reference data about the fact, such as date, item, or customer

A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known
as star schema because the entity-relationship diagram of this schemas simulates a star, with points,
diverge from a central table. The center of the schema consists of a large fact table, and the points of
the star are the dimension tables.
Snowflake Schema

• We can create even more complex star schemas by normalizing a dimension table into several
tables. The normalized dimension table is called a Snowflake.
• A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if
one or more dimension tables do not connect directly to the fact table but must join through
other dimension tables."
• The snowflake schema is an expansion of the star schema where each point of the star
explodes into more points. It is called snowflake schema because the diagram of snowflake
schema resembles a snowflake. Snowflaking is a method of normalizing the dimension tables
in a STAR schemas. When we normalize all the dimension tables entirely, the resultant
structure resembles a snowflake with the fact table in the middle

Fact Consolation Schema

A Fact constellation means two or more fact tables sharing one or more dimensions. It is also
called Galaxy schema.

Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact
Constellation Schema can design with a collection of de-normalized FACT, Shared, and Conformed
Dimension tables.

Fact Constellation Schema is a sophisticated database design that is difficult to summarize


information. Fact Constellation Schema can implement between aggregate Fact tables or decompose
a complex Fact table into independent simplex Fact table
Q.2)Explain Kimball Life Cycle diagram in detail

Kimball methodology is intended for for designing, developing, and deploying data
warehouse/business intelligence systems,

o Starts with one data mart (ex. sales); later on additional data marts are added (ex. collection,
marketing, etc.)
o Data flows from source into data marts, then into the data warehouse
o Kimball approach is faster to implement as it is implemented in stages
Before we go ahead with details of the methodology, let us take a quick view on some essential
definitions of the terms used.

Kimball lifecycle diagram illustrates the flow of data warehouse implementation. It identifies task
sequencing and highlights activities that should happen concurrently. Activities may need to be
customized to address the unique needs of the organization. Also,not every detail of every lifecycle
task will be required on every project – this has to be decided as per need.

As per Kimball Lifecycle, we start building a data warehouse with understanding business
requirements and determining how best to add value to the organization. The organization must agree
on what the value of this data is before deciding to build a data warehouse to hold it. Once the
requirements are gathered, implementation phase begins with design steps across three different
tracks – technology, data, and BI applications. Once we are done with this implementation, the
Lifecycle comes back together to deploy the query tools, reports, and applications to the user
community.

The incremental approach of the Lifecycle helps to deliver business value in a short span of time and
at the same time helps to build a enterprise wide information resource in a long term.
Q.2. What is a Data warehouse? Explain the properties of Data warehouse architecture
Data warehouse : is a subject-oriented, integrated, time-variant, and non-volatile collection of
data. This data helps analysts to take informed decisions in an organization.
Three-tier architecture: This is the most widely used architecture. It consists of the,

1. Top,
2. Middle and

3. Bottom Tier.
Bottom Tier: The database of the Data warehouse servers as the bottom tier. It is usually a relational
database system. Data is cleansed, transformed, and loaded into this layer using back-end tools.
Middle Tier: The middle tier in Data warehouse is an OLAP server which is implemented using either
ROLAP or MOLAP model. For a user, this application tier presents an abstracted view of the database.
This layer also acts as a mediator between the end-user and the database.
Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you connect and get
data out from the data warehouse. It could be Query tools, reporting tools, managed query tools,
Analysis tools and Data mining tools.
Q.3. ) What is ETL? Explain data pre-processing techniques in detail.

ETL : Extract, transform, and load (ETL) is the process of combining data from multiple sources into
a large, central repository called a data warehouse. ETL uses a set of business rules to clean and
organize raw data and prepare it for storage, data analytics, and machine learning (ML).

• Data pre-processing is an important task. It is a data mining technique that transforms raw data
into a more understandable, useful and efficient format.
Why is data pre-processing required?

• Real world data is generally:


• Incomplete: Certain attributes or values or both are missing or only aggregate data is available.
• Noisy: Data contains errors or outliers
• Inconsistent: Data contains differences in codes or names etc.

Tasks in data preprocessing

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

• (a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values
are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually,
by attribute mean or the most probable value.

• (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due
to faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task. Each
segmented is handled separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data
reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis
costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For performing attribute
selection, one can use level of significance and p- value of the attribute. the attribute having p-
value greater than significance level can be discarded.

3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.

4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are called
lossless reduction else it is called lossy reduction. The two effective methods of dimensionality
reduction are:Wavelet transforms and PCA (Principal Component Analysis).

Q.3) What is OLAP? Describe the characteristics of OLAP


OLAP :
Online Analytical Processing Server (OLAP) is based on multidimensional data models. It allows
managers and analysts to get a piece of deep information through fast, consistent, and interactive
access to information. In this topic, we cover the types of OLAP, operations on OLAP, a difference
between OLAP, and statistical databases and OLTP.
characteristics of OLAP

The main characteristics of OLAP are as follows:

1. Multidimensional conceptual view: OLAP systems let business users have a dimensional and
logical view of the data in the data warehouse. It helps in carrying slice and dice operations.
2. Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should
provide normal database operations, containing retrieval, update, adequacy control, integrity,
and security.
3. Accessibility: OLAP acts as a mediator between data warehouses and front-end. The OLAP
operations should be sitting between data sources (e.g., data warehouses) and an OLAP front-
end.
4. Storing OLAP results: OLAP results are kept separate from data sources.
5. Uniform documenting performance: Increasing the number of dimensions or database size
should not significantly degrade the reporting performance of the OLAP system.
6. OLAP provides for distinguishing between zero values and missing values so that aggregates
are computed correctly.
7. OLAP system should ignore all missing values and compute correct aggregate values.
8. OLAP facilitate interactive query and complex analysis for the users.
9. OLAP allows users to drill down for greater details or roll up for aggregations of metrics along
a single business dimension or across multiple dimension.
10. OLAP provides the ability to perform intricate calculations and comparisons.
11. OLAP presents results in a number of meaningful ways, including charts and graphs.
Q.3) Describe ETL. What are the tasks to be performed during data transformation? [5]
ETL : Extract, transform, and load (ETL) is the process of combining data from multiple sources into
a large, central repository called a data warehouse. ETL uses a set of business rules to clean and
organize raw data and prepare it for storage, data analytics, and machine learning (ML).
tasks to be performed during data transformation
Data transformation in data mining refers to the process of converting raw data into a format that is
suitable for analysis and modeling. The goal of data transformation is to prepare the data for data
mining so that it can be used to extract useful insights and knowledge. Data transformation
typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values in the data.
2. Data integration: Combining data from multiple sources, such as databases and spreadsheets,
into a single format.
3. Data normalization: Scaling the data to a common range of values, such as between 0 and 1,
to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of relevant
features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by summing or
averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to ensure that the
data is in a format that is suitable for analysis and modeling, and that it is free of errors and
inconsistencies. Data transformation can also help to improve the performance of data mining
algorithms, by reducing the dimensionality of the data, and by scaling the data to a common
range of values.

Q.3) What are the basic operations of OLAP? [5]

OLAP servers were based on the multidimensional view of data and we will discuss OLAP
operations in multidimensional data.

Here is the list of OLAP operations −


• Roll-up
• Drill-down
• Slice and dice
• Pivot
1. (Rotate) Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
• By climbing up a concept hierarchy for a dimension
• By dimension reduction
The following diagram illustrates how roll-up works.

Roll-up is performed by climbing up to the concept hierarchy for the dimension location.
• Initially the concept hierarchy was "street < city < province < country".
• On rolling up, the data is aggregated by ascending the location hierarchy from the level
of the city to the level of the country.
• The data is grouped into cities rather than the countries.
• When roll-up is performed by one or more dimensions from the data cube are removed.

2. Drill-down

Drill-down is the reverse operation of roll-up. It is performed in the following ways −


• By stepping down a concept hierarchy for a dimension
• By introducing a new dimension.

The following diagram illustrates how drill-down works −

• Drill-down is performed by stepping down to the concept hierarchy for the dimension
time.
• Initially, the concept hierarchy was "day < month < quarter < year."

• On drilling down, the time dimension is descended from the level of the quarter to the
level of the month.
• When drill-down is performed by one or more dimensions from the data cube were
added.
• Drill down will navigate the data from less detailed data to highly detailed data.

3. Slice:
The slice operation selects one particular dimension from the given cube and provides a new
sub-cube.

We Consider the following diagram that shows how slice works.

• Here the Slice is performed by the dimension "time" using the criterion time = "Q1".
• It will form a new sub-cube by selecting one or more dimensions.

4. Dice

Dice selects two or more dimensions from a given cube and provides a new sub-cube. We
Consider the following diagram that shows the dice operation.

The dice operation on the cube based on the following selection criteria involves in three
dimensions.
• (location = "Toronto" or "Vancouver")
• (time = "Q1" or "Q2")
• (item =" Mobile" or Modem")
5. Pivot
The pivot operation is also known as rotation. It rotates the data axes in order to view and provide
an alternative presentation of data. we Consider the following diagram that shows the pivot
operation.

Q.4) What is Data mining? Explain the architecture of Data mining. [3]

Data Mining is a process of discovering interesting patterns and knowledge from large amounts of
data. The data sources can include databases, data warehouses, the web, and other information
repositories or data that are streamed into the system dynamically.

Architecture of Data mining


Data Source:

The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and
other documents. You need a huge amount of historical data for data mining to be successful.
Organizations typically store data in databases or data warehouses. Data warehouses may comprise
one or more databases, text files spreadsheets, or other repositories of data. Sometimes, even plain text
files or spreadsheets may contain information. Another primary source of data is the World Wide Web
or the internet.

Different processes:

Before passing the data to the database or data warehouse server, the data must be cleaned, integrated,
and selected. As the information comes from various sources and in different formats, it can't be used
directly for the data mining procedure because the data may not be complete and accurate. So, the first
data requires to be cleaned and unified. More information than needed will be collected from various
data sources, and only the data of interest will have to be selected and passed to the server. These
procedures are not as easy as we think. Several methods may be performed on the data as part of
selection, integration, and cleaning

Database or Data Warehouse Server:

The database or data warehouse server consists of the original data that is ready to be processed. Hence,
the server is cause for retrieving the relevant data that is based on data mining as per user request.

Data Mining Engine:

The data mining engine is a major component of any data mining system. It contains several modules
for operating data mining tasks, including association, characterization, classification, clustering,
prediction, time-series analysis, etc.

In other words, we can say data mining is the root of our data mining architecture. It comprises
instruments and software used to obtain insights and knowledge from data collected from various data
sources and stored within the data warehouse.

Pattern Evaluation Module:

The Pattern evaluation module is primarily responsible for the measure of investigation of the pattern
by using a threshold value. It collaborates with the data mining engine to focus the search on exciting
patterns.

This segment commonly employs stake measures that cooperate with the data mining modules to focus
the search towards fascinating patterns. It might utilize a stake threshold to filter out discovered
patterns. On the other hand, the pattern evaluation module might be coordinated with the mining
module, depending on the implementation of the data mining techniques used. For efficient data
mining, it is abnormally suggested to push the evaluation of pattern stake as much as possible into the
mining procedure to confine the search to only fascinating patterns.
Graphical User Interface:

The graphical user interface (GUI) module communicates between the data mining system and the
user. This module helps the user to easily and efficiently use the system without knowing the
complexity of the process. This module cooperates with the data mining system when the user specifies
a query or a task and displays the results.

Knowledge Base:

The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the
search or evaluate the stake of the result patterns. The knowledge base may even contain user views
and data from user experiences that might be helpful in the data mining process. The data mining
engine may receive inputs from the knowledge base to make the result more accurate and reliable. The
pattern assessment module regularly interacts with the knowledge base to get inputs, and also update
it.
Q.4) Apply FP Tree Algorithm to construct FP Tree and find frequent itemset for the
following dataset given below (minimum support = 30%)

Transaction ID List of Products


1 Apple, Berries, Coconut
2 Berries, Coconut, Dates
3 Coconut, Dates
4 Berries, Dates
5 Apple, Coconut
6 Apple, Coconut, Dates
Q.4) Explain data mining techniques in brief [3]
Data Mining is a process of discovering interesting patterns and knowledge from large amounts of
data. The data sources can include databases, data warehouses, the web, and other information
repositories or data that are streamed into the system dynamically.

In recent data mining projects, various major data mining techniques have been developed and used,
including association, classification, clustering, prediction, sequential patterns, and regression.

. Classification:

This technique is used to obtain important and relevant information about data and metadata. This data
mining technique helps to classify data in different classes.

Clustering:

Clustering is a division of information into groups of connected objects. Describing the data by a few
clusters mainly loses certain confine details, but accomplishes improvement. It models data by its
clusters. Data modeling puts clustering from a historical point of view rooted in statistics, mathematics,
and numerical analysis. From a machine learning point of view, clusters relate to hidden patterns, the
search for clusters is unsupervised learning, and the subsequent framework represents a data concept.
From a practical point of view, clustering plays an extraordinary job in data mining applications. For
example, scientific data exploration, text mining, information retrieval, spatial database applications,
CRM, Web analysis, computational biology, medical diagnostics, and much more.
. Regression:

Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific
variable. Regression, primarily a form of planning and modeling. For example, we might use it to
project certain costs, depending on other factors such as availability, consumer demand, and
competition. Primarily it gives the exact relationship between two or more variables in the given data
set.

4. Association Rules:

This data mining technique helps to discover a link between two or more items. It finds a hidden pattern
in the data set.

Association rules are if-then statements that support to show the probability of interactions between
data items within large data sets in different types of databases. Association rule mining has several
applications and is commonly used to help sales correlations in data or medical data sets.

. Outer detection:

This type of data mining technique relates to the observation of data items in the data set, which do not
match an expected pattern or expected behavior. This technique may be used in various domains like
intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or Outilier mining. The
outlier is a data point that diverges too much from the rest of the dataset. The majority of the real-world
datasets have an outlier. Outlier detection plays a significant role in the data mining field. Outlier
detection is valuable in numerous fields like network interruption identification, credit or debit card
fraud detection, detecting outlying in wireless sensor network data, etc.

6. Sequential Patterns:

The sequential pattern is a data mining technique specialized for evaluating sequential data to
discover sequential patterns. It comprises of finding interesting subsequences in a set of sequences,
where the stake of a sequence can be measured in terms of different criteria like length, occurrence
frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.

7. Prediction:

Prediction used a combination of other data mining techniques such as trends, clustering, classification,
etc. It analyzes past events or instances in the right sequence to predict a future event.
Q.4) How does the KNN algorithm works? [7]
Apply KNN classification algorithm for the given dataset and predict the class for
X(P1 = 3, P2 = 7) (K = 3)
P1 P2 class
7 7 False
7 4 False
3 4 True
1 4 True
Q5) a) What is text mining? Explain the process of text mining [4]
Text mining is also known as text analysis. It is the process of transforming unstructured text into
structured data for easy analysis. Text mining needs natural language processing (NLP), enabling
devices to learn the human language and process it automatically.
The text mining process contains the following steps to extract the data from the files which are as
follows –

• Text transformation
A text transformation is a technique that is used to control the capitalization of the text.
Here the two major way of document representation is given.
Bag of words
Vector Space
• Text Pre-processing
Pre-processing is a significant task and a critical step in Text Mining, Natural Language
Processing (NLP), and information retrieval(IR). In the field of text mining, data pre-
processing is used for extracting useful information and knowledge from unstructured text
data. Information Retrieval (IR) is a matter of choosing which documents in a collection
should be retrieved to fulfill the user's need.
• Feature selection:
Feature selection is a significant part of data mining. Feature selection can be defined as the
process of reducing the input of processing or finding the essential information sources. The
feature selection is also called variable selection.
• Data Mining:
Now, in this step, the text mining procedure merges with the conventional process. Classic
Data Mining procedures are used in the structural database.
• Evaluate:
Afterward, it evaluates the results. Once the result is evaluated, the result abandon.
Q.5 b) Explain K-means algorithm. Apply K-means algorithm for group of visitors to a website
into two groups using their age as follows: 15, 16, 19, 20, 21, 28, 35, 40, 42, 44, 60, 65 (Consider
initial centroid 16 and 28 of two groups) [6]
Q.5. a) Apply K-means algorithm for the given data set where K is the cluster number D = {2,
3, 4, 10, 11, 12, 20, 25, 30}, K = 2. [6]
Q.5. b) What are the different types of web mining? [4]

Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is discovering
useful information from the World-Wide Web and its usage patterns.
Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as following
below.

1. Web Content Mining: Web content mining is the application of extracting useful information
from the content of the web documents. Web content consist of several types of data – text,
image, audio, video etc. Content data is the group of facts that a web page is designed. It can
provide effective and interesting patterns about user needs. Text documents are related to text
mining, machine learning and natural language processing. This mining is also known as text
mining. This type of mining performs scanning and mining of the text, images and groups of
web pages according to the content of the input.
2. Web Structure Mining: Web structure mining is the application of discovering structure
information from the web. The structure of the web graph consists of web pages as nodes, and
hyperlinks as edges connecting related pages. Structure mining basically shows the structured
summary of a particular website. It identifies relationship between web pages linked by
information or direct link connection. To determine the connection between two commercial
websites, Web structure mining can be very useful.
3. Web Usage Mining: Web usage mining is the application of identifying or discovering
interesting usage patterns from large data sets. And these patterns enable you to understand the
user behaviors or something like that. In web usage mining, user access data on the web and
collect data in form of logs. So, Web usage mining is also called log mining.

You might also like