0% found this document useful (0 votes)
65 views

Assignment

The document poses 11 questions related to data mining concepts and techniques. It covers topics such as the differences and similarities between data warehouses and databases, challenges of mining large datasets, definitions of various data mining functionalities, examples of descriptive statistics, data normalization methods, receiver operating characteristic (ROC) curves, decision tree pruning, clustering algorithms and constraints, considerations for implementing real-world data mining applications, and frequent itemset mining.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Assignment

The document poses 11 questions related to data mining concepts and techniques. It covers topics such as the differences and similarities between data warehouses and databases, challenges of mining large datasets, definitions of various data mining functionalities, examples of descriptive statistics, data normalization methods, receiver operating characteristic (ROC) curves, decision tree pruning, clustering algorithms and constraints, considerations for implementing real-world data mining applications, and frequent itemset mining.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

1) How is a data warehouse different from a database? How are they similar?

2) What are the major challenges of mining a huge amount of data (e.g., billions of tuples) in
comparison with mining a small amount of data (e.g., data set of a few hundred tuple)?
3) Define each of the following data mining functionalities: characterization, discrimi-nation,
association and correlation analysis, classification, regression, clustering, and outlier analysis.
Give examples of each data mining functionality, using a real-life. database that you are familiar
with.
4) Suppose that the data for analysis includes the attribute age. The age values for the data tuples
are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35,
35, 35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.)
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
(g) How is a quantile–quantile plot different from a quantile plot?

5) Use these methods to normalize the following group of data: 200, 300, 400, 600,1000
(a) min-max normalization by setting min = 0 and max = 1
(b) z-score normalization
(c) z-score normalization using the mean absolute deviation instead of standard deviation
(d) normalization by decimal scaling

6) The data tuples from the table below are sorted by decreasing probability value, as returned by a
classifier. For each tuple, compute the values for the number of true positives (TP), false
positives (FP), true negatives (TN) and false negatives (FN). Compute the true positive rate
(TPR) and false positive rate (FPR). Plot the ROC curve for the data.
Tuple class probability
P 0.95
1
N 0.85
2
P 0.78
3
P 0.66
4
N 0.60
5
P 0.55
6
N 0.53
7
N 0.52
8
N 0.51
9
10 P 0.40
7) Given a decision tree, you have the option of (a) converting the decision tree to rules and
then pruning the resulting rules, or (b) pruning the decision tree and then converting the
pruned tree to rules. What advantage does (a) have over (b)?

8) Briefly describe and give examples of each of the following approaches to clustering:
partitioning methods, hierarchical methods, density-based methods, and grid-based
methods.

9) Suppose that you are to allocate a number of automatic teller machines (ATMs) in a given
region so as to satisfy a number of constraints. Households or workplaces may be clustered
so that typically one ATM is assigned per cluster. The clustering, however, may be
constrained by two factors: (1) obstacle objects (i.e., there are bridges, rivers, and highways
that can affect ATM accessibility), and (2) additional user-specified constraints such as that
each ATM should serve at least 10,000 households. How can a clustering algorithm such as
k-means be modified for quality clustering under both constraints?

10) Choose any real world data mining application, what major considerations are you going to
follow to implement your model?

11) Using the given database below, Generate all possible candidate itemsets and frequent
itemsets, where the minimum support count is 2.

TID Items
100 134
200 235
300 1235
400 25

You might also like