CHAPTER1 Datamining
CHAPTER1 Datamining
Answer:
Watch : https://youtu.be/gq_T7EgQXkI
Study: https://matthewrhoads.com/2017/10/14/blog-post-title-2/
•(c) Discuss how the evolution of database
technology led to data mining.
Characterization differs from clustering in that the former refers to a summarization of the
general characteristics or features of a target class of data while the latter deals with the
analysis of data objects without consulting a known class label. This pair of tasks is similar
in that they both deal with grouping together objects or data that are related or have high
similarity in comparison to one another.
Classification differs from regression in that the former predicts categorical (discrete,
unordered) labels while the latter predicts missing or unavailable, and often numerical,
data values. This pair of tasks is similar in that they both are tools for prediction.
6. Based on your observation, describe another possible kind of knowledge
that needs to be discovered by data mining methods but has not been listed
in this chapter. Does it require a mining methodology that is quite different
from those outlined in this chapter?
Answer:
There is no standard answer for this question and one can judge the quality of an
answer based on the freshness and quality of the proposal. For example, one may
propose partial periodicity as a new kind of knowledge, where a pattern is partial
periodic if only some offsets of a certain time period in a time series demonstrate
some repeating behavior.
7. Outliers are often discarded as noise. However, one person’s
garbage could be another’s treasure. For example, exceptions in
credit card transactions can help us detect the fraudulent use of
credit cards.
Using fraudulence detection as an example, propose two methods
that can be used to detect outliers and discuss which one is more
reliable.
Answer:
There are many outlier detection methods. More details can be found in
Chapter 12. Here we propose two methods for fraudulence detection:
a) Statistical methods (also known as model-based methods): Assume that
the normal transaction data follow some statistical (stochastic) model, then
data not following the model are outliers.
b) Clustering-based methods: Assume that the normal data objects belong
to large and dense clusters, whereas outliers belong to small or sparse
clusters, or do not belong to any clusters.
It is hard to say which one is more reliable. The effectiveness of statistical
methods highly depends on whether the assumptions made for the
statistical model hold true for the given data. And the effectiveness of
clustering methods highly depends on which clustering method we choose.
8. Describe three challenges to data mining regarding
data mining methodology and user interaction issues
Answer:
Challenges to data mining regarding data mining methodology and user interaction issues include the following:
mining different kinds of knowledge in databases, interactive mining of knowledge at multiple levels of abstraction,
incorporation of background knowledge, data mining query languages and ad hoc data mining, presentation and
visualization of data mining results, handling noisy or incomplete data, and pattern evaluation. Below are the
descriptions of the first three challenges mentioned: Mining different kinds of knowledge in databases: Different users
are interested in different kinds of knowledge and will require a wide range of data analysis and knowledge discovery
tasks such as data characterization, discrimination, association, classification, clustering, trend and deviation analysis,
and similarity analysis. Each of these tasks will use the same database in different ways and will require different data
mining techniques.
Interactive mining of knowledge at multiple levels of abstraction: Interactive mining, with
the use of OLAP operations on a data cube, allows users to focus the search for patterns, providing
and refining data mining requests based on returned results. The user can then interactively view the
data and discover patterns at multiple granularities and from different angles.
Incorporation of background knowledge: Background knowledge, or information regarding the
domain under study such as integrity constraints and deduction rules, may be used to guide the
discovery process and allow discovered patterns to be expressed in concise terms and at different levels
of abstraction. This helps to focus and speed up a data mining process or judge the interestingness of
discovered patterns.
9. What are the major challenges of mining a huge amount of
data (such as billions of tuples) in comparison with mining a
small amount of data (such as a few hundred tuple data set)?
Answer:
One challenge to data mining regarding performance issues is the
efficiency and scalability of data mining algorithms. Data mining
algorithms must be efficient and scalable in order to effectively extract
information from large amounts of data in databases within predictable
and acceptable running times.
Another challenge is the parallel, distributed, and incremental processing
of data mining algorithms.
The need for parallel and distributed data mining algorithms has been
brought about by the huge size of many databases, the wide distribution of
data, and the computational complexity of some data mining methods.
Due to the high cost of some data mining processes, incremental data
mining algorithms incorporate database updates without the need to mine
the entire data again from scratch.
10. Outline the major research challenges of data mining in one
specific application domain, such as stream/sensor data
analysis, spatiotemporal data analysis, or bioinformatics.
Answer:
Let’s take spatiotemporal data analysis for example. With the ever increasing amount of available
data from sensor networks, web-based map services, location sensing devices etc., the rate at which
such kind of data are being generated far exceeds our ability to extract useful knowledge from them
to facilitate decision making and to better understand the changing environment. It is a great
challenge how to utilize existing data mining techniques and create novel techniques as well to
effectively exploit the rich spatiotemporal relationships/patterns embedded in the datasets because
both the temporal and spatial dimensions could add substantial complexity to data mining tasks.
First, the spatial and temporal relationships are information bearing and therefore need to be
considered in data mining.
Some spatial and temporal relationships are implicitly defined, and must be extracted from the data.
Such extraction introduces some degree of fuzziness and/or uncertainty that may have an impact on
the results of the data mining process. Second, working at the level of stored data is often
undesirable, and thus complex transformations are required to describe the units of analysis at
higher conceptual levels.
Third, interesting patterns are more likely to be discovered at the lowest resolution/granularity level,
but large support is more likely to exist at higher levels. Finally, how to express domain independent
knowledge and how to integrate patiotemporal reasoning mechanisms in data mining systems are
still open problems
(c) We have presented a view that data mining is the result of the evolution of
database technology.
Do you think that data mining is also the result of the evolution of machine
learning research?
Can you present such views based on the historical progress of this discipline?
Do the same for
the fields of statistics and pattern recognition.
(d) Describe the steps involved in data mining when viewed as a process of
knowledge discovery