0% found this document useful (0 votes)
8 views33 pages

CHAPTER1 Datamining

Data mining is the process of extracting valuable patterns and knowledge from large datasets, integrating techniques from various disciplines such as databases, statistics, and machine learning. It involves steps like data cleaning, integration, selection, transformation, mining, evaluation, and presentation. The document also discusses the differences between data warehouses and databases, various data mining functionalities, and challenges faced in data mining methodologies.

Uploaded by

lindsay.yareth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views33 pages

CHAPTER1 Datamining

Data mining is the process of extracting valuable patterns and knowledge from large datasets, integrating techniques from various disciplines such as databases, statistics, and machine learning. It involves steps like data cleaning, integration, selection, transformation, mining, evaluation, and presentation. The document also discusses the differences between data warehouses and databases, various data mining functionalities, and challenges faced in data mining methodologies.

Uploaded by

lindsay.yareth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

DATA MINING

Dr. Afsaneh Javadi


1. What is data mining?
Answer:
Data mining refers to the process or
method that extracts or “mines”
interesting knowledge
or
patterns
from large amounts of data.
a) Is data mining another hype?

Answer:

Data mining is not another hype. Instead, the


need for data mining has arisen due to the wide
availability of huge amounts of data and the
imminent need for turning such data into useful
information and knowledge. Thus, data mining
can be viewed as the result of the natural
evolution of information technology.
b) Is it a simple transformation or
application of technology developed from
databases, statistics, machine learning,
and pattern recognition?

•Answer: No. Data mining is more than a simple transformation of

technology developed from databases, statistics, and machine

learning. Instead, data mining involves an integration, rather than a

simple transformation, of techniques from multiple disciplines such

as database technology, statistics, machine learning, high-

performance computing, pattern recognition, neural networks, data

visualization, information retrieval, image and signal processing, and

spatial data analysis.


•(c) Discuss how the evolution of
database technology led to data
mining.
History of Data Mining

Watch : https://youtu.be/gq_T7EgQXkI

Study: https://matthewrhoads.com/2017/10/14/blog-post-title-2/
•(c) Discuss how the evolution of database
technology led to data mining.

•Database technology began with the development of data

collection and database creation mechanisms that led to the


development of effective mechanisms for data management
including data storage and retrieval, and query and
transaction processing. The large number of database
systems offering query and transaction processing eventually
and naturally led to the need for data
analysis and understanding. Hence, data mining began its
development out of this necessity.
(d) Describe the steps involved in data
mining when viewed as a process of
knowledge discovery.

The steps involved in data mining when viewed as a process of


knowledge discovery are as follows:
• Data cleaning, a process that removes or transforms noise and
inconsistent data
• Data integration, where multiple data sources may be combined
• Data selection, where data relevant to the analysis task are retrieved
from the database
• Data transformation, where data are transformed or consolidated
into forms appropriate for mining
(d) Describe the steps involved in
data mining when viewed as a
process of knowledge discovery.

• Data mining, an essential process where intelligent

and efficient methods are applied in


order to extract patterns
• Pattern evaluation, a process that identifies the truly
interesting patterns representing
knowledge based on some interestingness measures
• Knowledge presentation, where visualization and
knowledge representation techniques are
used to present the mined knowledge to the user
2. How is a data warehouse
different from a database?
How are they similar?
Answer:
“Differences between a data warehouse and a database”
A data warehouse is a repository of information collected from multiple sources, over a
history of time, stored under a unified schema, and used for data analysis and decision
support; whereas a database, is a collection of interrelated data that represents the
current status of the stored data. There could be multiple heterogeneous databases where
the schema of one database may not agree with the schema of another.
A database system supports ad-hoc query and on-line transaction processing. For more
details, please refer to the section “Differences between operational database systems and
data warehouses.”

“Similarities between a data warehouse and a database”


Both are repositories of information, storing huge amounts of persistent data.
3. Define each of the following data
mining functionalities:
characterization, discrimination,
association and correlation analysis,
classification, regression, clustering,
and outlier analysis.
Give examples of each data mining
functionality, using a real-life
database that you are familiar with.
Answer:
Characterization is a summarization of the general characteristics or features of a target class of
data. For example, the characteristics of students can be produced, generating a profile of all the
University first year computing science students, which may include such information as a high
GPA and large number of courses taken.
Answer:
Discrimination is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes. For example, the general features of
students with high GPA’s may be compared with the general features of students with low GPA’s.
The resulting description could be a general comparative profile of the students such as 75% of the
students with high GPA’s are fourth-year computing science students while 65% of the students
with low GPA’s are not.
Answer:
Association is the discovery of association rules showing attribute-value conditions that occur
frequently together in a given set of data. For example, a data mining system may find association
rules like
major(X, “computing science””) ⇒ owns(X, “personal computer”)
[support = 12%, confidence = 98%]
4. Present an example where data mining is crucial to the success of a
business. What data mining functionalities does this business need (e.g.,
think of the kinds of patterns that could be mined)? Can such patterns be
generated alternatively by data query processing or simple statistical
analysis?
Answer:
A department store, for example, can use data mining to assist with its target marketing
mail campaign.
Using data mining functions such as association, the store can use the mined strong
association rules to determine which products bought by one group of customers are
likely to lead to the buying of certain other products. With this information, the store can
then mail marketing materials only to those kinds of customers who exhibit a high
likelihood of purchasing additional products. Data query processing is used for data or
information retrieval and does not have the means for finding association rules.
Similarly, simple statistical analysis cannot handle large amounts of data such as those of
customer records in a department store.
5. What is the difference between discrimination and classification?
Between characterization and clustering?
Between classification and regression? For each of these pairs of
tasks, how are they similar?
Answer:

Discrimination differs from classification in that the former refers to a comparison of


the general features of target class data objects with the general features of objects
from one or a set of contrasting classes, while the latter is the process of finding a set
of models (or functions) that describe and distinguish data classes or concepts for the
purpose of being able to use the model to predict the class of objects whose class label
is unknown. Discrimination and classification are similar in that they both deal with the
analysis of class data objects.
Answer:

Characterization differs from clustering in that the former refers to a summarization of the
general characteristics or features of a target class of data while the latter deals with the
analysis of data objects without consulting a known class label. This pair of tasks is similar
in that they both deal with grouping together objects or data that are related or have high
similarity in comparison to one another.

Classification differs from regression in that the former predicts categorical (discrete,
unordered) labels while the latter predicts missing or unavailable, and often numerical,
data values. This pair of tasks is similar in that they both are tools for prediction.
6. Based on your observation, describe another possible kind of knowledge
that needs to be discovered by data mining methods but has not been listed
in this chapter. Does it require a mining methodology that is quite different
from those outlined in this chapter?
Answer:
There is no standard answer for this question and one can judge the quality of an
answer based on the freshness and quality of the proposal. For example, one may
propose partial periodicity as a new kind of knowledge, where a pattern is partial
periodic if only some offsets of a certain time period in a time series demonstrate
some repeating behavior.
7. Outliers are often discarded as noise. However, one person’s
garbage could be another’s treasure. For example, exceptions in
credit card transactions can help us detect the fraudulent use of
credit cards.
Using fraudulence detection as an example, propose two methods
that can be used to detect outliers and discuss which one is more
reliable.
Answer:
There are many outlier detection methods. More details can be found in
Chapter 12. Here we propose two methods for fraudulence detection:
a) Statistical methods (also known as model-based methods): Assume that
the normal transaction data follow some statistical (stochastic) model, then
data not following the model are outliers.
b) Clustering-based methods: Assume that the normal data objects belong
to large and dense clusters, whereas outliers belong to small or sparse
clusters, or do not belong to any clusters.
It is hard to say which one is more reliable. The effectiveness of statistical
methods highly depends on whether the assumptions made for the
statistical model hold true for the given data. And the effectiveness of
clustering methods highly depends on which clustering method we choose.
8. Describe three challenges to data mining regarding
data mining methodology and user interaction issues
Answer:
Challenges to data mining regarding data mining methodology and user interaction issues include the following:
mining different kinds of knowledge in databases, interactive mining of knowledge at multiple levels of abstraction,
incorporation of background knowledge, data mining query languages and ad hoc data mining, presentation and
visualization of data mining results, handling noisy or incomplete data, and pattern evaluation. Below are the
descriptions of the first three challenges mentioned: Mining different kinds of knowledge in databases: Different users
are interested in different kinds of knowledge and will require a wide range of data analysis and knowledge discovery
tasks such as data characterization, discrimination, association, classification, clustering, trend and deviation analysis,
and similarity analysis. Each of these tasks will use the same database in different ways and will require different data
mining techniques.
Interactive mining of knowledge at multiple levels of abstraction: Interactive mining, with
the use of OLAP operations on a data cube, allows users to focus the search for patterns, providing
and refining data mining requests based on returned results. The user can then interactively view the
data and discover patterns at multiple granularities and from different angles.
Incorporation of background knowledge: Background knowledge, or information regarding the
domain under study such as integrity constraints and deduction rules, may be used to guide the
discovery process and allow discovered patterns to be expressed in concise terms and at different levels
of abstraction. This helps to focus and speed up a data mining process or judge the interestingness of
discovered patterns.
9. What are the major challenges of mining a huge amount of
data (such as billions of tuples) in comparison with mining a
small amount of data (such as a few hundred tuple data set)?
Answer:
One challenge to data mining regarding performance issues is the
efficiency and scalability of data mining algorithms. Data mining
algorithms must be efficient and scalable in order to effectively extract
information from large amounts of data in databases within predictable
and acceptable running times.
Another challenge is the parallel, distributed, and incremental processing
of data mining algorithms.
The need for parallel and distributed data mining algorithms has been
brought about by the huge size of many databases, the wide distribution of
data, and the computational complexity of some data mining methods.
Due to the high cost of some data mining processes, incremental data
mining algorithms incorporate database updates without the need to mine
the entire data again from scratch.
10. Outline the major research challenges of data mining in one
specific application domain, such as stream/sensor data
analysis, spatiotemporal data analysis, or bioinformatics.
Answer:
Let’s take spatiotemporal data analysis for example. With the ever increasing amount of available
data from sensor networks, web-based map services, location sensing devices etc., the rate at which
such kind of data are being generated far exceeds our ability to extract useful knowledge from them
to facilitate decision making and to better understand the changing environment. It is a great
challenge how to utilize existing data mining techniques and create novel techniques as well to
effectively exploit the rich spatiotemporal relationships/patterns embedded in the datasets because
both the temporal and spatial dimensions could add substantial complexity to data mining tasks.
First, the spatial and temporal relationships are information bearing and therefore need to be
considered in data mining.
Some spatial and temporal relationships are implicitly defined, and must be extracted from the data.
Such extraction introduces some degree of fuzziness and/or uncertainty that may have an impact on
the results of the data mining process. Second, working at the level of stored data is often
undesirable, and thus complex transformations are required to describe the units of analysis at
higher conceptual levels.
Third, interesting patterns are more likely to be discovered at the lowest resolution/granularity level,
but large support is more likely to exist at higher levels. Finally, how to express domain independent
knowledge and how to integrate patiotemporal reasoning mechanisms in data mining systems are
still open problems
(c) We have presented a view that data mining is the result of the evolution of
database technology.
Do you think that data mining is also the result of the evolution of machine
learning research?
Can you present such views based on the historical progress of this discipline?
Do the same for
the fields of statistics and pattern recognition.

(d) Describe the steps involved in data mining when viewed as a process of
knowledge discovery

You might also like