0% found this document useful (0 votes)
38 views11 pages

DWDM-CSE-Question Bank

The document contains questions about data mining techniques and processes. It covers topics like data warehousing, association rule mining, clustering algorithms like k-means and k-medoids, and decision trees. Several questions involve applying these techniques to datasets and analyzing the results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views11 pages

DWDM-CSE-Question Bank

The document contains questions about data mining techniques and processes. It covers topics like data warehousing, association rule mining, clustering algorithms like k-means and k-medoids, and decision trees. Several questions involve applying these techniques to datasets and analyzing the results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Module1

1. What is Data Mining? (3)


2. Define data warehouse. What is the purpose of it? (2+3)
3. What are the key elements of a data warehouse? Explain each of them. (6)
4. Describe the key steps in the data mining process. Why is it important to follow these
processes? (5)
5. What are the major mistakes to be avoided when doing data mining? (3)
6. Why is data cleaning so important? (3)
7. Define support, confidence and lift in Association rule mining. What are the demerits
of Apriori Algorithm? (3+2)
8. What is an Association Rule? What is the importance of Association Rules in Data
Mining? (5)
9. Find the cosine similarity and the dissimilarity between the 2 vectors- ‘X’ & ‘Y’
X= {3, 2, 0, 5} and Y = {1, 0, 0, 0} (5)
10. For the following given Transaction Data set, generate rules using Apriori Algorithm.
Consider the values of support = 22% & Confidence = 70%. (10)

11. Explain the KDD process in detail. (5)


12. Differentiate among Enterprise Warehouse, Data mart and Virtual warehouse. (3)

13. State the differences between Data Mart & Data Warehouse. (5)
14. Distinguish between OLTP and OLAP systems. (5)
15. How is data warehouse different from a database? (3)
16. Explain Metadata in brief. Explain different types of Metadata.
(5)
17. Define Data Lake. What is a Data Mart? Define the types of Data Marts.
(2.5+2.5)
18. What is the significance of a multi-dimensional data model in data-warehousing?
Briefly compare the snowflake schema and fact constellation concepts with a suitable
example. (3+3)
19. Discuss the steps of the Apriori Algorithm for mining frequent itemsets. (5)
20. Generate FP-Tree for the following Transaction dataset. [Min. Support Count= 3]
Show the Conditional Pattern Base, Conditional FP-Tree and Frequent Item set. (10)

Transaction ID Items
T1 {E, K, M, N, O, Y}
T2 {D, E, K, N, O, Y}
T3 {A, E, K, M}
T4 {C, K, M, U, Y}
T5 {C, E, I, K, O}

21. Define with suitable examples of each of the following data mining functionalities:
data characterization, data association and data discrimination. (3)
22. Explain the architecture of a typical data mining system. (5)
23. What is meant by slice-and-dice? Give an example. (5)
24. Define Roll-up and Drill-down process with a suitable example. (5)
25. Explain the three-tier data warehousing architecture. (5)
26. What is ETL? Explain each of the terms clearly. (5)
27. Differentiate among ROLAP, MOLAP and HOLAP. (5)
28. Discuss the different phases of FP-tree growth algorithm. (5)
29. What do you mean by OLAP? What are the various OLAP operations in
multidimensional data models? Describe them briefly. (10)
30. Write a short note on Snowflake Schema, Galaxy Schema. (3+3)
31. Discuss Star schema with suitable example. (5)
32. Explain Jaccard similarity index. Find the Jaccard similarity index and Jaccard
distance for the following data: (5)
A = {0, 1, 2, 5, 6}
B = {0, 2, 3, 4, 5, 7, 9}
33. The rating data for 4 colleges is given. (5)
Sl. No. Engg. Teaching Fees Placements Internship Infrastructure
College
1. A 5 2 5 5 3
2. B 4 5 5 4 5
3. C 3 4 4 3 4
4. D 1 3 1 1 2

a) Find the Euclidean distance between


i) College A-B
ii) College B-C
iii) College C-D
iv) College A-D
b) Out of the above-mentioned group of colleges, which of the group of college has
the shortest Euclidean distance between them?
34. Generate all Frequent Itemsets from the following transaction data given minimum
support = 0.3.
TID Items TID Items
1 A, B, C, E 6 B, C
2 B, D, E 7 A, C, E
3 B, C 8 A, B, C, E
4 A, B, D 9 A, B, C
5 A, C 10 C, D, E

Find the Association Rules from the above frequent sets at minimum 50% confidence.
(10)

Module2:
1. Define decision tree. (3)
2. What are the advantages and disadvantages of the decision tree approach over other
approaches for data mining? (3)
3. What is clustering? What are the different clustering techniques? Write some
applications of cluster analysis. (6)
4. Define Entropy and Information Gain with suitable examples. (5)
5. Describe the working of the PAM K-medoids clustering algorithm. (5)
6. Define Classification and Prediction. (5)
7. Describe K-medoids algorithm in brief. (5)
8. Using K-means clustering algorithm, determine 3 clusters for the following eight data
points: A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9). Distance
function is Euclidean distance. Do it for 3 iterations. (10)
9. Define Jaccard coefficient. (2)
10. Apply the K-means clustering for the following dataset for two clusters. Consider data
point S1 and S2 are the initial centroid of the respective clusters. Continue the
procedure for three iterations. (12)
Sample No. X Y
S1 185 72
S2 170 56
S3 168 60
S4 179 68
S5 182 72
S6 188 77
S7 180 71
S8 180 70
S9 183 84
S10 180 88
S11 180 67
S12 177 76

11. What do you mean by attribute selection measure with respect to decision tree
induction? (3)
12. Suppose that the data mining task is to cluster the following ten points representing
location into two clusters:
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6

The distance function is defined as |Xi – Xj| + |Yi – Yj|. Use K-medoids algorithm to
determine the two clusters. (10)
13. Write down the algorithm for K-means clustering. (5)
14. What is hierarchical clustering technique? (5)
15. Distinguish between partitional clustering and hierarchical clustering. (5)
16. What is Classification and Clustering? Explain the key differences between them. (5)
17. What is a classification problem? What is the difference between Supervised and
Unsupervised Learning? (5)
18. Differentiate agglomerative hierarchical clustering and divisive hierarchical
clustering. (5)
19. Explain the ID3 algorithm for Decision Trees. (5)
20. What is a dendrogram? Explain it with the help of an example. (5)
21. Define Euclidean and Manhattan distance metric. (5)
22. What is a centroid point in K-means clustering? (2)
23. Apply Hierarchical Agglomerative clustering technique on the following dataset (Use
Complete Linkage method). Draw the corresponding dendrogram. (8)
Sample No. X Y
S1 40 53
S2 22 38
S3 35 32
S4 26 19
S5 8 41
S6 45 30
S7 40 50

24. Use single and complete linkage agglomerative clustering to group the data described
by the following distance matrix. Show the dendrograms. (5+5)
P1 P2 P3 P4 P5
P1 0 9 3 6 11
P2 9 0 7 5 10
P3 3 7 0 9 2
P4 6 5 9 0 8
P5 11 10 2 8 0

25. How does agglomerative hierarchical clustering works? (5)


26. How does divisive hierarchical clustering works? (5)
27. Write Bayesian probabilistic Theory. (5)
28. What is a regression model? (3)
29. What are the different types of regression? (5)
30. Explain simple linear regression. (5)
31. Explain multiple linear regression. (5)
32. How to improve accuracy of the linear regression model? (5)
33. Use the data given in Dataset as shown below, create a regression model to predict the
Test2 from Test1 score. Then predict the score for the one who got a 46 in Test1. (10)
Test1 Test2
59 56
52 63
44 55
51 50
42 66
42 48
41 58
45 36
27 13
63 50
54 81
44 56
50 64
47 50

34. Data in the table below shows the height to nearest weight of a sample of 10 male
students drawn at random from 1st year students of an Engineering College. Construct
the regression line that approximates the data set: (10)
X 63 59 62 65 61 64 65 62 60 58
(height
in
inches)
Y 55 52 54 58 63 60 59 53 60 51
(weight
in kg)

35. A random sample of 15 students in that class was selected and the data is given
below:
Internal 15 23 18 23 24 22 22 19 19 16 24 11 24 16 23
Exam
External 49 63 58 60 58 61 60 63 60 52 62 30 59 49 68
Exam
Construct the regression line that approximates the data set. (10)

36. Explain logistic regression with example. (5)


37. Explain Ordinary Least Squares (OLS) algorithm in the context of regression analysis.
(5)
38. Explain the key differences between classification and regression. (5)
39. What are the advantages of Logistic Regression? (3)
40. What are the disadvantages of Logistic Regression? (3)
41. What is sigmoid function? (3)
42. List down the advantages of the Decision Trees. (3)
43. List down the disadvantages of the Decision Trees. (3)

44. Create a decision tree for the following data given below. The objective is to predict
the class category (Play Tennis or not?). (10)

45. Write down the kNN Algorithm. (5)


46. Why kNN is known as lazy learning and non-parametric algorithm? (2.5+2.5)
47. List down the advantages and disadvantages of kNN algorithm. (5)
48. Apply kNN classification algorithm on the following dataset and predict the class for
P (X =3 and Y =7), where k = 3. (5)

X Y Class

7 7 False

7 4 False

3 4 True

1 4 True

49. Apply kNN classification algorithm on the following dataset and predict the class for
P (X = 5 and Y = 7), where k = 3. (5)
X Y Class

7 7 False

7 5 False

3 4 True

4 4 True

4 3 False

50. Apply the data set of Question 44 for Naïve Bayes classification problem. The
objective is to predict the class category (Play Tennis or not?).
(10)

Module 3
1. What do you understand by ‘Secular Trend’ in the analysis of a time series? Explain
with examples. (5)
2. Explain the process of Exponential Smoothing with an example. (5)
3. Mention the merits and demerits of Moving Average Method. (5)
4. Distinguish between ‘seasonal’ and ‘cyclical’ fluctuations in time series data. (5)
5. Find the trend for the following series using a three-year weighted moving average
with weights 1, 2, 1. (5)
Year 1 2 3 4 5 6 7
Values 2 4 5 7 8 10 13

6. Discuss the method of fitting mathematical curves for determining the trend in time
series data. (5)

7. Fit a straight-line trend equation by the method of least squares from the following
data and then estimate the trend value for the year 2025. (5)
Year 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Value 65 80 84 75 77 71 76 74 70 68

8. With which component of the time series would you associate each of the following?
Why? (2 X 5=10)
(i) The rainfall that occurred in Calcutta for four days in February, 1981.
(ii) A decline in ice cream sales during November to March.
(iii) An era of prosperity.
(iv) Increase in garment sales in October.
(v) General increase in sale of T.V. sets.
9. Explain full periodic pattern and partial periodic pattern for time-related sequence
data with examples. (5)
10. Assuming a four-yearly cycle, calculate the trend by the method of moving averages
from the following data relating to the production of tea in India: (5)
Year 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Productio 464 515 518 467 502 540 557 571 586 612
n
(lbs.)

11. The trend equation fitted to annual average sales is given by 𝑦 = 230 + 20𝑥, unit of
x- one year, origin- 30th June, 2008. Adjust the trend equation for finding the monthly
trend values and find trend value for the month of January 2020. (5)

12. Using 1964 as the origin, obtain a straight-line trend equation by the method of least
squares. Find the trend value of the missing year 1961. (5)
Year 1960 1962 1963 1964 1965 1966 1969
Value 140 144 160 152 168 176 180

13. Fit a second-degree polynomial to the following data: (5)


Year 1882 1883 1884 1885 1886 1887 1888 1889 1890
Price 84 82 76 72 69 68 70 72 73
index

14. Discuss briefly how we can obtain the monthly trend from annual data for odd and
even number of number of years given. (5)
15. a) Show that the sum of weights in an exponential smoothing is one. (3)
b) The last period’s forecast was 70 and demand was 60. What is the simple
exponential smoothing forecast with smoothing coefficient of 0.4 for the next period?
(2)
16. Fit a straight-line trend by the least squares method to the following figures of
production of a sugar factory: (5)
Year 1969 1970 1971 1972 1973 1974 1975
Production 76 87 95 81 91 96 90
(‘000
tons)
Estimate the production for 1976.
17. Explain in brief Similarity Search in Time-Series Analysis. (5)

Module4
1. Define Precision, Recall and F1 score in the context of evaluation of performance of a
machine learning model. (5)
2. A model makes predictions and predicts 120 examples as belonging to the minority
class, 90 of which are correct, and 30 of which are incorrect. Find the Precision of the
model. (3)
3. Precision of model is 0.75 and Recall is 0.43. Find the F-score. (2)
4. What is a Class Imbalance problem in the context of data analysis? (5)
5. Explain confusion matrix. (5)
6. Describe in brief the methodologies for Stream Data Processing and Stream Data
Systems. (10)
7. Explain Random Sampling, Sliding Window and Histogram concept with respect to
mining data streams. (6)
8. Explain Graph Mining. (5)
9. What is Social Network Analysis? (5)
10. What are the characteristics of Social Networks? Explain each of them briefly. (5)
11. What is frequent pattern mining in data stream? (5)
12. What is sequential pattern mining in data stream? (5)

Module5
1. What do you understand by Web Mining? What are the three types of web mining?
(5)
2. Compare Web Mining with Data Mining. (5)
3. Explain the challenges for mining the Web Wide Web. (5)
4. Explain the HITS Algorithm with an example. (5)
5. Explain in brief Web Structure Mining. (5)
6. Explain in brief Web Content Mining. (5)
7. Explain in brief Web Usage Mining. (5)
8. What is Vision-based Page Segmentation (VIPS)? (5)
9. What is a hub in the context of web pages? (3)
10. What is meant by authoritative Web pages? (3)
11. Write a short note on Automatic Classification of Web documents. (5)
12. Discuss about mining multimedia data on the web. (5)
13. There are 3 pages in a web graph: A, B and C. A points to B and C. But has no
incoming links itself. B and C have no outgoing links. For a value of “d” (Damping
factor) given as 0.6. Find the Page Ranks of A, B and C. (5)
14. There are 3 pages in a web graph, A, B and C. A and B both point to C but have no
incoming links themselves. C has no outgoing links. For a value of “d” (Damping
factor) given as 0.6. Find the Page Ranks of A, B and C. (5)
15. Explain the Page Rank algorithm. (5)

Module6
1. What is the requirement of dimensionality reduction and explain how PCA helps in
that scenario? (5)
2. Explain the steps of PCA Algorithm. (5)
3. Explain Curse of Dimensionality. (5)
4. Explain the differences between Social Network Analysis and Traditional Data
Mining. (5)
5. What does Social Network Analysis (SNA) mean? (5)
6. Write a short note on issues and challenges in data mining. (5)
7. What are the recent trends in data mining? (10)
8. What do you understand by the term “Graph Mining”? (5)
9. Why Class Imbalance is a problem? Explain with an example. (10)
10. What are the recent developments in distributed data warehouse environments? (10)
11. Explain the concept of distributed data mining. (5)
12. What are the issues relating to the diversity of data types? (5)
13. Find the covariance matrix of the following data:
X 2.5 0.5 2.2 1.9 3.1 2.3 2 1 1.5 1.1
Y 2.4 0.7 2.9 2.2 3.0 2.7 1.6 1.1 1.6 0.9

You might also like