Question With Answer
Question With Answer
Multifeature cubes, which compute complex queries involving multiple dependent aggregates at
multiple granularity. These cubes are very useful in practice. Many complex data mining
querie scan be answered by multifeature cubes without any significant increase in
computational cost, in comparison to cube computation for simple queries with standard data
cubes.
If an on-line operational database systems is used for efficient retrieval, efficient storage and
Management of large amounts of data, then the system is said to be on-line transaction
processing. Data warehouse systems serves users (or) knowledge workers in the role of data
analysis and decision-making. Such systems can organize and present data in various formats.
These systems are known as on-line analytical processing systems.
Metadata are data about data. When used in a data warehouse, metadata are the data that define
warehouse objects. Metadata are created for the data names and definitions of the given
warehouse. Additional metadata are created and captured for time stamping any extracted
data, the source of the extracted data, and missing fields that have been added bydata cleaning
or integration processes.
4. Explain the differences between star and snowflake schema. (Nov/Dec 2008)
The dimension table of the snowflake schema model may be kept in normalized form to
reduce redundancies. Such a table is easy to maintain and saves storage space.
`In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Data transformation can involve the following: Smoothing, Aggregation,
Generalization, Normalization, Attribute construction.
The slice operation performs a selection on one dimension of the cube resulting ina sub cube.
The dice operation defines a sub cube by performing a selection on two (or) more
dimensions.
Handling of relational and complex types of data: Because relational databases and data
warehouses are widely used, the development of efficient and effective data mining systems
for such data is important. Mining information from heterogeneous databases and global
information systems: Local- and wide-area computer networks (such as the Internet) connect
many sources of data, forming huge, distributed, and heterogeneous databases.
The bitmap indexing method is popular in OLAP products because it allows quick searching
in data cubes. The bitmap index is an alternative representation of the record ID (RID) list.
Fact table contains the name of facts (or) measures as well as keys to each of the related
dimensional tables. A dimension table are used for describing the dimension. (e.g.) A
dimension table for item may contain the attributes item_ name, brand and type.
12. Briefly discuss the schemas for multidimensional databases. (May/June 2010)
Star’s schema: The most common modeling paradigm is the star schema, in which the data
Warehouse contains (1) a large central table (fact table) containing the bulk of the data, with
no redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each
dimension. Snowflake’s schema: The snowflake schema is a variant of the star schema model,
where some dimension tables are normalized, thereby further splitting the data into additional
tables. The resulting schema graph forms a shape similar to a snowflake.
Fact Constellations: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence is
called galaxy schema or a fact constellation.
13. How is a data warehouse different from a database? How are they similar? (Nov/Dec
2007, Nov/Dec 2010)
Descriptive data mining, which describes data in a concise and summative manner and
resents interesting general properties of the data? Predictive data mining, which analyzes data
in order to construct one or a set of models and attempts to predict the behavior of new data
sets? Predictive data mining, such as classification, regression analysis, and trend analysis?
15. List out the functions of OLAP servers in the data warehouse architecture. (Nov/Dec
2010)
The OLAP server performs multidimensional queries of data and stores the results in its
multidimensional storage. It speeds the analysis of fact tables into cubes, stores the cubes until
needed, and then quickly returns the data to clients.
Data mining refers to extracting or “mining” knowledge from large amounts of data. The term
is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as
gold mining rather than rock or sand mining. Thus, data mining should have been more
appropriately named “knowledge mining from data,” A data warehouse is usually modeled by
multidimensional database structure, where each dimension corresponds to an attribute or a set
of attributes in the schema, and each cell stores the value of some aggregate measure, such as
counter sales amount.
The term data mining is a synonym for another popularly used term, Knowledge Discovery
from Data, or KDD. Alternatively, others view data mining as simply an essential step in the
process of knowledge discovery. Knowledge discovery as a process and an iterative
sequence of the following steps:
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods are applied in order to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledgebaseon
some interestingness measures)
2. List and discuss the characteristics and main functions performed by the
components of a data warehouse. Give diagrammatic illustration. (May/June
2014, May/June 2012)
3.List and discuss the steps involved in building a data warehouse. (Nov/Dec
2012)
4.Give detailed information about Meta data in data warehousing. (May/June
2014)
5.List and discuss the steps involved in mapping the data warehouse to
amultiprocessor architecture. (May/June 2014, Nov/Dec 2011)
6.i) Explain the role played by sourcing, acquisition, clean up and
transformation toolsin data warehousing. (May/June 2013)
ii) Explain about STAR Join and STAR Index. (Nov/Dec 2012)
7.Describe in detail about DBMS schemas for decision support.
8.Explain about data extraction, clean up and transformation tools.
9.Explain the following:
i) Implementation considerations in building data warehouse
ii) Database architectures for parallel processing.
UNIT-2
BUSINESS ANALYSIS
Data can be associated with classes or concepts. It can be useful to describe individual classes and
concepts in summarized, concise, and yet precise terms. Such descriptions of a class or a concept are
called class/concept descriptions. These descriptions can be derived via (1) data characterization, by
summarizing the data of the class under study (often called the target class) in general terms, or (2)
data discrimination, by comparison of the target class with one or a set of comparative classes (often
called the contrasting classes), or (3) both data characterization and discrimination.
4. Mention the various tasks to be accomplished as part of data pre-processing. (Nov/ Dec 2008)
1. Data cleaning
2. Data Integration
3. Data Transformation
4. Data reduction
Data cleaning means removing the inconsistent data or noise and collectingnecessary information of a
collection of interrelated data.
6. Define Data mining. (Nov/Dec 2008)
Data mining refers to extracting or “mining” knowledge from large amounts of data. The
term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred
to as gold mining rather than rock or sand mining. Thus, data mining should have been more
appropriately named “knowledge mining from data,”
8. List the three important issues that have to be addressed during data integration. (May/June
2009) (OR) List the issues to be considered during data integration. (May/June 2010)
There are a number of issues to consider during data integration. Schema integration and
object matching can be tricky. How can equivalent real-world entities from multiple data
sources be matched up this is referred to as the entity identification problem. Redundancy is
another important issue. An attribute (such as annual revenue, for instance) may be redundant
if it can be “derived” from another attribute or set of attributes. Inconsistencies in attribute or
dimension naming can also cause redundancies in the resulting data set. A third important
issue in data integration is the detection and resolution of data value conflicts. For example,
for the same real-world entity, attribute values from different sources may differ. This may be
due to differences in representation, scaling, or encoding. For instance, a weight attribute may
be stored in metric units in one system and British imperial units in another.
9. Write the strategies for data reduction. (May/June 2010)
3. Dimensionality reduction
4. Numerosity reduction
10. Why is it important to have data mining query language? (May/June 2010)
The design of an effective data mining query language requires a deep understanding of the
power, limitation, and underlying mechanisms of the various kinds of data mining tasks. A
data mining query language can be used to specify data mining tasks. In particular, we
examine how to define data warehouses and data marts in our SQL-based data mining query
language, DMQL.
11. List the five primitives for specifying a data mining task. (Nov/Dec 2010)
It is process that abstracts a large set of task-relevant data in a database from relatively low
conceptual levels to higher conceptual levels 2 approaches for Generalization.
13. How concept hierarchies are useful in data mining? (Nov/Dec 2010)
A concept hierarchy for a given numerical attribute defines a discretization of the attribute.
Concept hierarchies can be used to reduce the data by collecting and replacing low- level
concepts (such as numerical values for the attribute age) with higher-level concepts (such as
youth, middle-aged, or senior). Although detail is lost by such data generalization, the
generalized data may be more meaningful and easier to interpret.
Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data.
5. Use the attribute mean for all samples belonging to the same class as the given tuple
1. Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,”
that is, the values around it.
2. Regression: Data can be smoothed by fitting the data to a function, such as with Regression
3. Clustering: Outliers may be detected by clustering, where similar values are organized into
groups, or “clusters.
Commercial tools can assist in the data transformation step. Data migration tools allow simple
transformations to be specified, such as to replace the string “gender” by “sex”. ETL
(extraction/transformation/loading) tools allow users to specify transforms through a graphical
user interface (GUI). These tools typically support only a restricted set of transforms so that,
often, we may also choose to write custom scripts for this step of the data cleaning process.
PART – B
1. Explain in detail about the reporting and query tools. (May/June 2014)
2. Describe in detail about COGNOS IMPROMTU. (May/June 2014)
3. Explain the categorization of OLAP tools with necessary
diagrams. (May/June2014)
4. i) List and explain the OLAP operation in multidimensional data model.
ii) Differentiate between OLTP and OLAP. (Nov/Dec 2014)
5. i)List and discuss the features of Cognos Impromptu. (Nov/Dec 2012)
ii) List and discuss the basic features data provided by reporting and
query toolsused for business analysis (Apr/May 2011)
6. i) What is a Multidimensional data model? Explain star schema with an
example. (May/June 2014)
7. Write the difference between multi-dimensional OLAP (MOLAP) and Multi-
relational OLAP (ROLAP). (May/June 2014, Nov/Dec 2012)
8. Explain the following:(May/June 2012)
i) Different schemas for multidimensional databases.
ii) OLAP guidelines.
9. i) Write in detail about Managed Query Environment (MQE).
ii) Explain about how to use OLAP tools on the Internet.
UNIT-3
DATA MINING
A set of items is referred to as an item set. An item set that contains k items is a k-item set. The
set computer, antivirus software is a 2-itemset. The occurrence frequency of an item set is
the number of transactions that contain the item set. This is also known, simply, as the
frequency, support count, or count of the itemset. Where each variation involves “playing”
with the support threshold in a slightly different way. The variations, where nodes indicate an
item or item set that has been examined, and nodes with thick borders indicate that an
examined item or item set is frequent.
Suppose, however, that rather than using a transactional database, sales and related
information are stored in a relational database or data warehouse. Such data stores are
multidimensional, by definition. For instance, in addition to keeping track of the items
purchased in sales transactions, relational database may record other attributes associated with
the items, such as the quantity purchased or the price, or the branch location of the sale.
Additional relational information regarding the customers who purchased the items, such as
customer age, occupation, credit rating, income, and address, may also be stored.
3. List two interesting measures for association rules. (April/May 2008) (OR)
Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules. A support of 2% for Association Rule
(5.1) means that2% of all the transactions under analysis show that computer and antivirus
software are purchased together. A confidence of 60% means that 60% of the customers who
purchased a computer also bought the software. Typically, association rules are considered
interesting if they satisfy both a minimum support threshold and a minimum confidence
threshold. Such thresholds can be set by users or domain experts. Additional analysis can be
performed to uncover interesting statistical correlations between associated items.
Select R.a1,R.a2,?..R.an,agg_f(R,b)
From relation R
Group by R.a1,R.a2,?.,R.an
Having agg_f(R.b)>=threhold
5. What is over fitting and what can you do to prevent it? (Nov/Dec 2008)
Tree pruning methods address this problem of over fitting the data. Such methods typically
use statistical measures to remove the least reliable branches. An unprimed tree and a pruned
Version of it. Pruned trees tend to be smaller and less complex and, thus, easier to
comprehend. They are usually faster and better at correctly classifying independent test data
(i.e., of previously unseen tuples) than un pruned trees.
6. In classification trees, what are surrogate splits, and how are they used? (Nov/Dec 2008)
Decision trees can suffer from repetition and replication, making them overwhelming to
interpret. Repetition occurs when an attribute is repeatedly tested along a given branch of the
tree (such as“age < 60?” followed by “age < 45”? and so on). In replication, duplicate sub
trees exist within the tree. These situations can impede the accuracy and comprehensibility of
a decision tree. The use of multivariate splits (splits based on a combination of attributes) can
prevent these problems.
Market basket analysis, which studies the buying habits of customers by searching for sets of
items that are frequently purchased together (or in sequence). This process analyzes customer
buying habits by finding associations between the different items that customers place in their
“shopping baskets”. The discovery of such associations can help retailers develop marketing
strategies by gaining insight into which items are frequently purchased together by customers.
For instance, if customers are buying milk, how likely are they to also buy bread (and what
kind of bread) on the same trip to the supermarket? Such information can lead to increased
sales by helping retailers do selective marketing and plan their shelf space.
8. Give the difference between Boolean association rule and quantitative association rule.
(Nov/Dec 2009)
Based on the types of values handled in the rule: If a rule involves associations between the
presence or absence of items, it is a Boolean association rule. For example, the following
three rules are Boolean association rules obtained from market basket analysis.
Quantitative association rules involve numeric attributes that have an implicit ordering
among values (e.g., age). If a rule describes associations between quantitative items or
attributes, then it is a quantitative association rule. In these rules, quantitative values for items
or attributes are partitioned into intervals. Following rule is considered a quantitative
association rule. Note that the quantitative attributes, age and income, have been discredited.
age(X, “30: : :39”)^income(X, “42K ... 48K”) => buys(X, “high resolution TV”)
9. Give the difference between operational database and informational database. (Nov/Dec
2009)
10. List the techniques to improve the efficiency of Apriority algorithm. (May/June 2010)
11. Define support and confidence in Association rule mining. (May/June 2010) (Nov/Dec
2010)
13. How Meta rules are useful in constraint-based association mining. (May/June 2010)
Meta rules allow users to specify the syntactic form of rules that they are interested in mining.
The rule forms can be used as constraints to help improve the efficiency of the mining
process. Meta rules may be based on the analyst’s experience, expectations, or intuition
regarding the data or may be automatically generated based on the database schema.
14. Mention few approaches to mining Multilevel Association Rules. (Nov/Dec 2010)
Multilevel association rules can be mined using several strategies, based on how minimum
support thresholds are defined at each level of abstraction, such as uniform support, reduced
support, and group-based support. Redundant multilevel (descendant) association rules can be
eliminated if their support and confidence are close to their expected values, based on their
corresponding ancestor rules.
Based on the kinds of rules to be mined, categories include mining association rules and
correlation rules. Many efficient and scalable algorithms have been developed for frequent
item set mining, from which association and correlation rules can be derived. Thesealgorithms
can be classified into three categories:
PART-B
ii) What are the major issues in data mining? Explain. (May/June 2012)
Tree pruning attempts to identify and remove such branches, with the goal of improving
classification accuracy on unseen data.
Mining data streams involves the efficient discovery of general patterns and dynamic
changes within stream data. For example, we may like to detect intrusions of a computer
network based on the anomaly of message flow, which may be discovered by clustering data
streams, dynamic construction of stream models, or comparing the current frequent patterns
with that at a certain previous time.
Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts, for the purpose of being able to use the model to predict the class of
objects whose class label is unknown. The derived model is based on the analysis of a set of
training data (i.e., data objects whose class label is known).
The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k
clusters so that the resulting intra cluster similarity is high but the inter cluster similarity is
low. Cluster similarity is measured in regard to the mean value of the objects in a cluster,
which can be viewed as the cluster’s centroid or center of gravity. First, it randomly selects k
of the objects, each of which initially represents a cluster mean or center. For each of the
remaining objects, an object is assigned to the cluster to which it is the most similar, based on
the distance between the object and the cluster mean. It then computes the new mean for each
cluster. This process iterates until the criterion function converges. Typically, the square-error
criterion is used, defined as where E is the sum of the square error for all objects in the data
set; p is the point in space representing a given object; and mi is the mean of cluster Ci(both p
and mi are multidimensional).
5. The naïve Bayes classifier makes what assumption that motivates its name?
Studies comparing classification algorithms have found a simple Bayesian classifier known as
the naive Bayesian classifier to be comparable in performance with decision tree and
selected neural network classifiers. Bayesian classifiers have also exhibited high accuracy and
speed when applied to large databases.
Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class conditional
independence. It is made to simplify the computations involved and, in this sense, is
considered “naive.”
6. What is an outlier? (May/June 2009) (OR)
A database may contain data objects that do not comply with the general behavior or model of
the data. These data objects are outliers. Most data mining methods discard outliers as noise
or exceptions. These can be categorized into four approaches: the statistical approach, the
distance-based approach, the density-based local outlier approach, and the deviation-based
approach.
7.Compare clustering and classification. (Nov/Dec 2009)
Clustering techniques consider data tuples as objects. They partition the objects into groups
or clusters, so that objects within a cluster are “similar” to one another and “dissimilar” to
objects in other clusters. Similarity is commonly defined in terms of how “close” the objects
are in space, based on a distance function. The “quality” of a cluster may be represented by its
diameter, the maximum distance between any two objects in the cluster. Outliers may be
detected by clustering, where similar values are organized into groups, or “clusters.”
Intuitively, values that fall outside of the set of clusters may be considered outliers.
A hierarchical method creates a hierarchical decomposition of the given set of data objects. A
hierarchical method can be classified as being either agglomerative or divisive, based on
how the hierarchical decomposition is formed. The agglomerative approach, also called the
bottom-up approach, starts with each object forming a separate group. It successively merges
the objects or groups that are close to one another, until all of the groups are merged into one
(the topmost level of the hierarchy), or until a termination condition holds. The divisive
approach, also called the top-down approach, starts with all of the objects in the same cluster.
In each successive iteration, a cluster is split up into smaller clusters, until eventually each
object is in one cluster, or until a termination condition holds.
When a decision tree is built, many of the branches will reflect anomalies in the training data
due to noise or outliers. Tree pruning methods address this problem of over fitting the data.
Such methods typically use statistical measures to remove the least reliable branches.
12.What is Association based classification? (Nov/Dec 2010)
Association-based classification, which classifies documents based on a set of associated,
frequently occurring text patterns. Notice that very frequent terms are likely poor
discriminators. Thus, only those terms that are not very frequent and that have good
discriminative power will be used in document classification. Such an association-based
classification method proceeds as follows: First, keywords and terms can be extracted by
information retrieval and simple association analysis techniques. Second, concept hierarchies
of keywords and terms can be obtained using available term classes, such as WordNet, or
relying on expert knowledge, or some keyword classification systems.
13.Why tree pruning useful in decision tree induction? (May/June 2010) (Nov/Dec 2010)
When a decision tree is built, many of the branches will reflect anomalies in the training data
due to noise or outliers. Tree pruning methods address this problem of over fitting the data.
Such methods typically use statistical measures to remove the least reliable branches.
14.Compare the advantages of and disadvantages of eager classification (e.g., decision tree)
versus lazy classification (k-nearest neighbor) (Nov/Dec 2010)
Eager learners, when given a set of training tuples, will construct a generalization (i.e.,
classification) model before receiving new (e.g., test) tuples to classify. We can think of the
learned model as being ready and eager to classify previously unseen tuples. Imagine a
contrasting lazy approach, in which the learner instead waits until the last minute before doing
any model construction in order to classify a given test tuple. That is, when given a training
tuple, a lazy learner simply stores it (or does only a little minor processing) and waits until it
is given a test tuple.
Bayesian classifiers are statistical classifiers. They can predict class membership
probabilities, such as the probability that a given tuple belongs to a particular class. Bayesian
classification is based on Bayes’ theorem, described below. Studies comparing classification
algorithms have found a simple Bayesian classifier known as the naïve Bayesian classifier to
be comparable in performance with decision tree and selected neural network classifiers.
Bayesian classifiers have also exhibited high accuracy and speed when applied to large
databases.
PART-B
1. Write and explain the algorithm for mining frequent item sets with
candidate generation. Give relevant example.
2. Write and explain the algorithm for mining frequent item sets without
candidategeneration. Give relevant example.
3. Discuss the approaches foe mining multi-level and multi-dimensional
association rules
from the transactional databases. Give relevant example.
4. i) Explain the algorithm for constructing a decision tree from training
samples.
5. i) Apply the Apriori algorithm for discovering frequent item sets of the
following. Use
0.3 for minimum support value.
(12)
TID Items purchased
101 milk,bread,eggs
102 milk,juice
103 juice,butter
104 milk,bread,eggs
105 coffee,eggs
106 coffee
107 coffee,juice
108 milk,bread,cookies,eggs
109 cookies,butter
110 milk,bread
v) Classification by backpropagation
7.i) Explain the 2 steps for data classification.
ii) Explain about Bayesian classification.
8.Describe in detail about classification by decision induction.
ii) Lasy
learners.
iv) Associative
classification
UNIT-5
CLUSTERING AND APPLICATIONS AND
TRENDS IN DATA MINING
Clustering can be used to generate a concept hierarchy for A by following either a top-down
splitting strategy or a bottom-up merging strategy, where each cluster forms a node of the
concept hierarchy. In the former, each initial cluster or partition may be further decomposed
into several sub clusters, forming a lower level of the hierarchy. In the latter, clusters are
formed by repeatedly grouping neighboring clusters in order to form higher- levelconcepts.
< Scalability
< Ability to deal with different types of attributes
A cluster analysis is the process of analyzing the various clusters to organize the different
objects into meaningful and descriptive object.
4. Define CLARANS.
< CLARANS(Cluster Large Applications based on Randomized Search) to improve the quality
of CLARA we go for CLARANS.
ROCK (RObust Clustering using links): Merges clusters based on their interconnectivity. Great
for categorical data. Ignores information about the looseness of two clusters while
emphasizing interconnectivity.
(May/June 2010)
Web usage mining is the process of extracting useful information from server logs i.e. users
history. Web usage mining is the process of finding out what users are looking for on the
Internet. Some users might be looking at only textual data, whereas some others might be
interested in multimedia data.
Audio data mining uses audio signals to indicate the patterns of data or the features of data
mining results. Although visual data mining may disclose interesting patterns using graphical
displays, it requires users to concentrate on watching patterns and identifying interesting or
novel features within them. This can sometimes be quite tiresome. If patterns can be
transformed into sound and music, then instead of watching pictures, we can listen to pitches,
rhythms, tune, and melody in order to identify anything interesting or unusual. This may
relieve some of the burden of visual concentration and be more relaxing than visual mining.
Therefore, audio data mining is an interesting complement to visual mining.
Visual data mining discovers implicit and useful knowledge from large data sets using data
and/or knowledge visualization techniques. The human visual system is controlled by the eyes
and brain, the latter of which can be thought of as a powerful, highly parallel processing and
reasoning engine containing a large knowledge base. Visual data mining essentially combines
the power of these components, making it a highly attractive and effective tool for the
comprehension of data distributions, patterns, clusters, and outliers in data.
A set of items is referred to as an item set. An item set that contains k items is a k-item set. The
set {computer, antivirus software} is a 2-itemset. The occurrence frequency of an item set is
the number of transactions that contain the item set. This is also known, simply, as the
frequency, support count, or count of the item set.
Hierarchical clustering (or hierarchic clustering) outputs a hierarchy, a structure that is more
informative than the unstructured set of clusters returned by flat clustering. Hierarchical
clustering does not require us to prespecify the number of clusters and most hierarchical
algorithms that have been used in IR are deterministic. These advantages of hierarchical
clustering come at the cost of lower efficiency.
11. Define time series analysis. (May/June 2009)
Time series analysis comprises methods for analyzing time series data in order to extract
meaningful statistics and other characteristics of the data. Time series forecasting is the use of
model to predict future values based on previously observed values. Time series are very
frequently plotted via line charts.
Web content mining, also known as text mining, is generally the second step in Web data
mining. Content mining is the scanning and mining of text, pictures and graphs of a Web page
to determine the relevance of the content to the search query. This scanning is completed after
the clustering of web pages through structure mining and provides the results based upon the
level of relevance to the suggested query. With the massive amount of information that is
available on the World Wide Web, content mining provides the results lists to search engines
in order of highest relevance to the keywords in the query.
A multimedia database system stores and manages a large collection of multimedia data, such
as audio, video, image, graphics, speech, text, document, and hypertext data, which contain
text, text markups, and linkages. Multimedia database systems are increasingly common
15. List out the methods for information retrieval. (May/June 2010)
They generally either view the retrieval problem as a document selection problem or as
document ranking problem. In document selection methods, the query is regarded as
specifying constraints for selecting relevant documents. A typical method of this category is
the Boolean retrieval model, in which a document is represented by a set of keywords and a
user provides a Boolean expression of keywords, such as “car and repair shops,” “tea or
coffee”.
Document ranking methods use the query to rank all documents in the order of relevance. For
ordinary users and exploratory queries, these methods are more appropriate than document
selection methods.
A categorical variable is a generalization of the binary variable in that it can take on more
than two states. For example, map color is a categorical variable that may have, say, five
states: red, yellow, green, pink, and blue. Let the number of states of a categorical variable be
M. The states can be denoted by letters, symbols, or a set of integers, such as 1, 2, ..., M. Notice
that such integers are used just for data handling and do not represent any specific ordering.
17. What is the difference between row scalability and column scalability? (Nov/Dec 2010)
Data mining has two kinds of scalability issues: row (or database size) scalability and column
(or dimension) scalability. A data mining system is considered row scalable if, when the
number of rows is enlarged 10 times; it takes no more than 10 times to execute the same data
mining queries. A data mining system is considered column scalable if the mining query
execution time increases linearly with the number of columns (or attributes or dimensions).
Due to the curse of dimensionality, it is much more challenging to make a system column
scalable than row scalable.
18. What are the major challenges faced in bringing data mining research to market?
(Nov/Dec2010)
The diversity of data, data mining tasks, and data mining approaches poses many challenging
research issues in data mining. The development of efficient and effective data mining
methods and systems, the construction of interactive and integrated data mining
environments, the design of data mining languages, and the application of data mining
techniques to solve large application problems are important tasks for data mining researchers
and data mining system and application developers.owing to the popular use of audio, video
equipment, digital cameras, CD-ROMs, and the Internet.
PART – B
1. i) What is cluster analysis? Explain about requirements of clustering in data
mining.
ii) Explain about data mining applications.
2. Describe in detail about types of data in cluster analysis.
3. i) Write note on: categorization of major clustering methods.
ii) Explain the following clustering methods in detail:
i) BIRCH
ii) CURE
4. Explain in detail about partitioning methods.
5. Describe in detail about hierarchical methods.
6. Explain the following:
i) Density based - methods
ii) Constraint based cluster analysis
7. Explain the following:
i) Grid based methods
ii) Model based clustering methods
8. Explain about clustering high dimensional data
9. Explain about outlier analysis.