0% found this document useful (0 votes)
522 views12 pages

Ban Quiz Answer

The document contains a test bank with questions about data mining and data warehousing topics. It includes 50 multiple choice questions across 5 weeks covering topics like data quality, data preprocessing, data visualization, data cubes, and more. The questions are from a course on data mining and data warehousing.

Uploaded by

Hazel Natuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
522 views12 pages

Ban Quiz Answer

The document contains a test bank with questions about data mining and data warehousing topics. It includes 50 multiple choice questions across 5 weeks covering topics like data quality, data preprocessing, data visualization, data cubes, and more. The questions are from a course on data mining and data warehousing.

Uploaded by

Hazel Natuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Test Bank Questions ([# of questions])

Data Mining and Data Warehousing - IT 446


Developed by Bradley C. Watson

 Reference: Tan, Steinbach and Kumar (2006)


 Reference: IBM, “Descriptive, predictive, prescriptive: Transforming asset and facilities
management with analytics” (2013)

Week 2
1) Which of the following is a motivating challenge for developing data mining:.
A. No one cares about data mining
B. Scalability
C. Algorithms are very complex
D. Boredom

2) Data mining, as a research discipline, does not draw ideas from this research discipline:
A. Statistics
B. Linguistics
C. Artificial Intelligence
D. Information Theory

3) Database technology does not support data mining in terms of


A. Theoretical basis of data mining research
B. Efficient storage
C. Query processing
D. Indexing

4) Which of the following is an important data-related success factor in data mining efforts:
A. Data type(s) being mined.
B. Processing power available
C. Algorithm design
D. Descriptive statistics-based tools

5) Graph-Based data (choose the false answer)


A. Is a rare form of data that is only used in data mining if absolutely necessary given
the difficulty of obtaining it and processing it
B. Can consist of data objects whose relationship to each other is representable by
placement as nodes in a graph
C. Can be data objects that are themselves graphs
D. Can, in some instances, be mined in terms of substructures
6) Ordered data (choose the true answer)
A. Is never ordered based on one or more time attributes (stamps)
B. Can be sequentially ordered in terms of spatial (positional)- or time-based attributes
C. Is of no use in data mining
D. Is not a method for modeling genetic information, such as genes in human DNA

7) The terms “Analytics” and “Big Data” have become essentially synonymous with the
term “data mining” in recent years.

Answer true

8) Search algorithms are the only AI techniques of interest to data mining researchers

Answer false

9) The term “noise” has a technical meaning in data mining referring to the distortion of
data from their true value and/or the addition of spurious objects.

Answer true

10) Effectively handling “noisy” data is the object of much data mining research, as data used
in the real world often contains significant amounts of noise, thus potentially
contaminating data mining results to the point where those results are useless.

Answer true

11) Cluster analysis is a means for discovering patterns in the data based on highly associated
features of the data

Answer false

12) Cluster analysis is a means for discovering and grouping together (clustering) sets of
observations that are closely related

Answer true

13) Association analysis is a means for discovering and grouping together (clustering) sets of
observations that are closely related

Answer false
Week 3
1 Which of the following is not a factor in data quality:
A. Accuracy
B. Completeness
C. Relevance
D. Timeliness

14) Which of the following data attributes is not one of the most common found in real-world
databases and data warehouses
A. Inaccuracy
B. Interpretability
C. Incompleteness
D. Timeliness

15) Data preprocessing does not include which of these tasks:


A. Data classification
B. Data integration
C. Data reduction
D. Data cleaning

16) Dirty data can cause which of the following problems regarding data mining results
A. Distrust of the results by those who must rely on them to make important decisions
B. Inaccurate results
C. Incomplete results
D. All of the above

17) Data integration often involves:


A. Reducing the dimensionality of the data set to be mined
B. Removing outliers
C. Determining which data objects in the various data sources match to each other
D. Transforming the data from one form to another

18) Range and variance are measures of location.

Answer false

19) Median is a summary statistic, where thefunction median(x) returns the middle value in a
data set with an odd number of values and the average of the two middle values if the
number of values is even.

Answer true

20) Visualization techniques are often specialized to the type of data being analyzed.

Answer true
21) A multidimensional representation of data with all possible totals (aggregates) is known
as a _data_ _cube_.

22) If we aggregate over all the dimensions of a data set except for two, we are creating a two
dimensional table using a data reduction approach through aggregation known as
_pivoting_.

23) A _MOLAP_ is a database system that might use as its base data model a data cube
representation.

24) If we analyze monthly data in terms of the days of each month we are _drilling__down_.

25) Two steps are necessary to define data in a multidimensional array representation:
_identification of dimensions_ and an_attribute_ that is the analysis focus.

Week 4
1 Which of the following is/are unordered categorical variable(s):.
A. Gender(male, female)
B. Economic status (low, middle, high)
C. Set of set of even integers from n=2 to n=30
D. Hair color (brown, brown, red)
E. All of the above

26) The set of odd integers from n=5 to n=41 is which of these types of variables:
A. Categorical
B. Interval
C. Independent
D. Ordinal
E. Dependent

27) Graph-Based data (choose the false answer)


A. Is a rare form of data that is only used in data mining if absolutely necessary given
the difficulty of obtaining it and processing it
B. Can consist of data objects whose relationship to each other is representable by
placement as nodes in a graph
C. Can be data objects that are themselves graphs
D. Can, in some instances, be mined in terms of substructures

28) Spatially related data


A. Is always best understood through visualization techniques based on those used by
map makers
B. Often is best analyzed through visualization techniques
C. Cannot be adequately mined using current data mining approaches
D. Includes genetic information, such as genes in human DNA.
29) Measures of location for multivariate data can often be obtained by computing
A. The mean and/or the median for each attribute separately
B. The range and variance of the most important attribute and dividing the range by the
variance
C. A roll-up value for each attribute and choosing the attribute with the lowest value
D. The probability that each variable is an independent variable and reporting the
attribute with the highest probability

30) A person who scores at the 50th percentile on a standardized test


A. Did as well or better than ½ of those who took the test
B. Missed exactly 50% of the questions on the test
C. Clearly is not knowledgeable about the field in which they took the test
D. Is an outlier from the norm amongst people taking the test

31) According to Han, Kamber and Pei, data warehouses are a multidimensional space for
storing data

Answer true

32) The four key words in William H. Inmon’s definition of a data warehouse that separate
them from other types of data storage structures are: subject-oriented, integrated, time-
variant and nonvolatile.

Answer true

33) A library maintained by a business does not fit William H. Inmon’s definition of a data
warehouse.

Answer false

34) OLTP is an acronym for online_ _transaction__processing.

35) Data warehouses are not a form of online analytical processing (OLAP) systems

Answer false

36) The term “data cube” refers to a data structure that is defined by dimensions and facts.

Answer true

37) Data cubes consist of “n” dimensions, where n ≥ 2.

Answer true
38) If one dimension of a data cube is “location”, described by the attributes number, street,
city, province_or_state, zip-code and country, then there is implicit in the location
dimension a concept hierarchy defined in the schema for that database.

Answer true

39) Concept hierarchies in a data warehouse represent easily materialized views of the data
using several non-interactive data cube operations.

Answer false

40) The slice and dice operations both form subcubes, but the slice is done on one dimension,
while the dice is done on multiple dimensions.

Answer true

Week 5
1 A key motivation for using a multidimensional approach to the data to be analyzed is that
aggregating data in multiple ways is:
A. Data cubes are an easy data structure for high-level managers to grasp
B. The computational efficiency of algorithms that work with more than two dimensions
is very high
C. Important
D. Non-existent, there is no key motivation to do so

41) Data cubes:


A. Do not need to consist of dimensions (D) of equal size (number of attributes per
Di)
B. Are not necessarily three dimensional, despite the name, data cube.
C. Are a generalization of what is meant by the term “cross-tabulation”.
D. Are defined as being multidimensional representations of data and all possible
aggregations (totals) of that data.
E. None of the above.

42) Regression analysis


A. Is useless with data sets that can be represented multidimensionally.
B. Is based on resolving this equation: D = { (xi, yi) | i = 1, 2, 3, …, N}
C. Involves the concept of explanatory attributes that are either discrete or continuous.
D. All of the above
E. None of the above
43) Computing aggregate totals in data cubes:
A. Is always computationally efficient.
B. Involves fixing specific values for some set of attributes that define the dimensions of
the cube and then summing over all the possible values for the remaining attributes
that make of the cube’s dimensions.
C. Is not part of the analysis effort in data mining that uses data cubes
D. None of the above

44) Drill down operations are needed


A. When the summary data at a given level of abstraction is insufficient to reveal
important patterns in the data, such as sales of milk and honey on specific days of the
week when the data you have is only at the monthly level of abstraction.
B. Involve summarizing to a higher level of abstraction, such as summarizing daily sales
information into weekly or monthly.
C. Are very expensive computationally, and so are to avoided at all costs.
D. All of the above

45) A two-dimensional table


A. Is too simple a data structure to reveal significant patterns in data sets.
B. Is a form of the data structure known as a “data cube”.
C. Is a complex data structure that is computationally difficult to analyze
D. While basic to the definition of relational databases is never found in serious
enterprise level data warehouses.

46) Data cubes can store precomputed measures, such as aggregated values of dimensions of
an attribute such as daily sales totals.

Answer true

47) Explanatory multidimensional data mining is never an interactive process, given the
intense computations that necessarily involved.

Answer false

48) Computation costs are not a factor in performing knowledge discovery in a


multidimensional online analysis environment.

Answer false

49) Preprocessing, or computing, the values to be stored in a given data cube allows for more
efficient real-time querying in a multidimensional online analysis processing
environment.

Answer true

50) A data cube of n dimensions contains2n_cuboids_.


51) A data cube can be usefully viewed as a lattice of cuboids.

Answer true

52) If we aggregate only those cells where the number of items bought by a particular
customer on a given day is greater than x, then the resulting partially materialized data
cube is an_Iceberg_ data cube.

53) Drill down operations on a prediction cube is a computational challenge, given the need
to materialize cell values at many different granularity levels.

Answer true

Week 6
1 Which of the following is / aretrue of arule extracted via association analysis (AA):
A. Always describes an association between / among two or more items that is non-
random, not based in chance
B. Can be, but is not always, reliably useful to predict future behaviors of the population
described by the data from which the rule is derived
C. Is often computationally expensive to discover via AA
D. Must always be assumed to be true in 100% of all future behaviors of the population
that is described by the data from which the rule was extracted

54) If a binary representation is chosen for market basket data, then


A. It is true that the presence of an item is often less important than its absence
B. The item variable is a binary type referred to as asymmetric, because the value one is
often more important to the final results than a value zero for any given item
C. The only two possible values for an item variable are ‘-1’ and ‘1’
D. Calculations on item variables become very computationally inefficient because
computers cannot easily handle binary valued variables

55) If the equation, σ ( X ) =⌊ {t i|X ⊆ t i , t i ∈T } ⌋ , represents the value known as the support
count in associative analysis, then
A. It must be true that ti represents a member of the itemset X
B. Then the higher the value, σ ( X ) is determined to be, the less likely that the
itemset is meaningful in the final analysis results
C. σ ( X ) cannot be the support value, because the set T is clearly always going to
be equal to the null set
D. X is the itemset for which we are trying to determine the number of transactions,
t i , in the data set of transactions, T, that contain X
56) The theorem that states “If an itemset is frequent, then all of its subsets must also be
frequent” is often referred to as the
A. Theorem that can never be proved
B. Apriori Principle
C. The key to understanding market basket analysis
D. All of the above

57) In associative analysis, “frequent” or, put differently, the minimum support value, that an
itemset must have to considered in the final results
A. Is specific to the data being analyzed
B. Is a constant for a given organization’s data, but can be different across organizations
C. Is often found to be non-determinable
D. Is often just a random value chosen by the analyst and so cannot be tested as to its
validity

58) Maximal and closed itemsets are


A. Rare forms of itemsets seldom used in associative analysis (market basket analysis)
that is only used in data mining if absolutely necessary given the difficulty of
obtaining it and processing it
B. Useful as a compaction of large collections of itemsets where the originating
collection is too large to compute associative rules cost effectively
C. Non-existent
D. None of the above

59) A closed itemset (choose the true answer)


A. Has one or more immediate supersets where the support count is identical to its
support count
B. Has no immediate supersets where the confidence value is identical to its own
C. Is of no use in data mining
D. Has no immediate supersets where the support count is identical to its own

60) Which of these possible attributes of a maximal itemset actually is the defining one:
A. None of the immediate subsets of the given itemset are frequent, but it is frequent
B. All of the immediate supersets of the given itemset are frequent, but it is frequent
C. None of the immediate supersets of the given itemset are frequent, but it is frequent
D. One or more of the immediate supersets of the given itemset are frequent, but it is not
frequent

61) Which of these algorithms is often useful in finding maximal itemsets in a collection:
A. Shallow first
B. Binary sort
C. Backtrack
D. Depth-first
62) Because the FP-Growth algorithm abandons the generate and test approach of the Apriori
algorithm in favor a significantly more direct paradigm of storing into a compact data
structure and directly selecting the frequent itemsets from the structure, one can safely
say the FP-Growth algorithm is a radical departure from Apriori.

Answer true

63) Overlapping paths in a FP-tree are indicative of corrupted input data.

Answer false

64) Correlationanalysis can be used to supplement support-confidence frameworks to


discover interesting patterns, especially when low support thresholds are being used.

65) The end user of an analysis is the only one who can ultimately judge if a given rule is
interesting in terms of the results it produces.

Answer true

66) A good interestingness measure will not be impacted by transactions that do not contain
itemsets of interest, because measures that are so impacted generate unstable results

67) If the function Lift(X, Y) returns a value less than 1, then the presence of event X in a
given set of events most likely means event Y is absent, and the reverse is also true.

Answer true

68) If the χ2 value is greater than 1 and the observed value of a slot (X, Y) is less than the
expected value, then there is a negativecorrelation between members X and Y of the slot

69) To say that the measures for interestingness, lift and χ2 are not null-invariant, is to assert
that their values are not independent of the number of null transactions in the data set
being analyzed.

Answer true

70) Han, Kamber and Pei recommend the use of the Kulc null-invariant measure in
conjunction with the imbalance ratio in determining interestedness.

Answer true

71) All strong association rules are interesting

Answer false
72) __________ is a methodology useful for discovering interesting relationships within
large sets of data.
A. Big Data
B. Association analysis
C. Data Mining
D. Algorithm

73) Market basket transactions show:


A. eggs, milk, and bread
B. monthly customer purchases
C. data relationships
D. daily customer purchase data.

74) Sets of frequent items hidden in large data sets are called _________.
A. associationrules.
B. big data.
C. binary representation.
D. analysis.

75) Two key issues that must be addressed when applying association analysis are:
A. Overpopulation
B. Computational Expense
C. Discovering spurious patterns
D. Data density

76) The strength of an association rule can be measured in terms of its____ and _____.
A. support
B. finances
C. confidence
D. size

77) The objective of the ________________ strategy is to find all the items that satisfy the
minsup threshold.
A. Association Rule Discovery
B. Correct Answer
C. Frequent Itemset Generation
D. Incorrect Answer

78) The objective of the ________________ strategy is to extract all the high-confidence
rules from the frequent itemsets found in the Frequent Itemset Generation.
A. Association Rule
B. Rule Generation
C. Association Analysis
D. Association Generation
79) Trimming the exponential search space based on the support measure is known as:
A. Exponential Pruning
B. Apriori Algorithm
C. Trimming
D. Support-Based Pruning

80) __________ are visual structures that use branches and leaf nodes to search an item or
itemset.
A. Functions
B. Hash Trees
C. Root Node
D. Algorithms

81) Market-based analysis studies customers’ buying habits by searching for itemsets that are
frequently purchased together (or in sequence).
TRUE

82) None of the association rule mining algorithms use support measure to prune rules and
itemsets.
FALSE

83) The Apriori Principle says that if an itemset is frequent then all of its subsets must also be
frequent.
TRUE

You might also like