0% found this document useful (0 votes)
41 views

An Introduction To Data Mining

Data mining involves collecting, cleaning, analyzing, and gaining insights from data. There has been an explosion in the amount of data generated from sources like the web, financial transactions, user interactions, and sensors. This deluge of data presents both opportunities and challenges for extracting useful knowledge. Data mining addresses this by applying a multi-step process including data collection, cleaning, transformation, and then applying analytical methods to discover patterns. The type of data, from quantitative to text to graphs, also impacts the mining approaches used.

Uploaded by

borisdblejd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

An Introduction To Data Mining

Data mining involves collecting, cleaning, analyzing, and gaining insights from data. There has been an explosion in the amount of data generated from sources like the web, financial transactions, user interactions, and sensors. This deluge of data presents both opportunities and challenges for extracting useful knowledge. Data mining addresses this by applying a multi-step process including data collection, cleaning, transformation, and then applying analytical methods to discover patterns. The type of data, from quantitative to text to graphs, also impacts the mining approaches used.

Uploaded by

borisdblejd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

An Introduction to Data Mining

Education is not the piling on of learning, information, data, facts, skills,


or abilities thats training or instruction but is rather making visible
what is hidden as a seed.Thomas More
1.1 Introduction
Data mining is the study of collecting, cleaning, processing, analyzing, and gaining useful
insights from data. A wide variation exists in terms of the problem domains, applications,
formulations, and data representations that are encountered in real applications. Therefore,
data mining is a broad umbrella term that is used to describe these different aspects of
data processing.
In the modern age, virtually all automated systems generate some form of data either
for diagnostic or analysis purposes. This has resulted in a deluge of data, which has been
reaching the order of petabytes or exabytes. Some examples of different kinds of data are
as follows:
World Wide Web: The number of documents on the indexed Web is now on the order
of billions, and the invisible Web is much larger. User accesses to such documents
create Web access logs at servers and customer behavior profiles at commercial sites.
Furthermore, the linked structure of the Web is referred to as the Web graph, which
is itself a kind of data. These different types of data are useful in various applications.
For example, the Web documents and link structure can be mined to determine associations
between different topics on the Web. On the other hand, user access logs can
be mined to determine frequent patterns of accesses or unusual patterns of possibly
unwarranted behavior.
Financial interactions: Most common transactions of everyday life, such as using an
automated teller machine (ATM) card or a credit card, can create data in an automated
way. Such transactions can be mined for many useful insights such as fraud orUser interactions: Many
forms of user interactions create large volumes of data. For
example, the use of a telephone typically creates a record at the telecommunication
company with details about the duration and destination of the call. Many phone
companies routinely analyze such data to determine relevant patterns of behavior
that can be used to make decisions about network capacity, promotions, pricing, or
customer targeting.
Sensor technologies and the Internet of Things: A recent trend is the development
of low-cost wearable sensors, smartphones, and other smart devices that can communicate
with one another. By one estimate, the number of such devices exceeded the
number of people on the planet in 2008 [30]. The implications of such massive data
collection are significant for mining algorithms.
The deluge of data is a direct result of advances in technology and the computerization of
every aspect of modern life. It is, therefore, natural to examine whether one can extract
concise and possibly actionable insights from the available data for application-specific goals.
This is where the task of data mining comes in. The raw data may be arbitrary, unstructured,
or even in a format that is not immediately suitable for automated processing. For example,
manually collected data may be drawn from heterogeneous sources in different formats and
yet somehow needs to be processed by an automated computer program to gain insights.
To address this issue, data mining analysts use a pipeline of processing, where the raw
data are collected, cleaned, and transformed into a standardized format. The data may be
stored in a commercial database system and finally processed for insights with the use of
analytical methods. In fact, while data mining often conjures up the notion of analytical
algorithms, the reality is that the vast majority of work is related to the data preparation
portion of the process. This pipeline of processing is conceptually similar to that of an actual
mining process from a mineral ore to the refined end product. The term mining derives
its roots from this analogy.
From an analytical perspective, data mining is challenging because of the wide disparity
in the problems and data types that are encountered. For example, a commercial product
recommendation problem is very different from an intrusion-detection application, even at
the level of the input data format or the problem definition. Even within related classes
of problems, the differences are quite significant. For example, a product recommendation
problem in a multidimensional database is very different from a social recommendation
problem due to the differences in the underlying data type. Nevertheless, in spite of these
differences, data mining applications are often closely connected to one of four superproblems
in data mining: association pattern mining, clustering, classification, and outlier
detection. These problems are so important because they are used as building blocks in a
majority of the applications in some indirect form or the other. This is a useful abstraction
because it helps us conceptualize and structure the field of data mining more effectively.
The data may have different formats or types. The type may be quantitative (e.g., age),
categorical (e.g., ethnicity), text, spatial, temporal, or graph-oriented. Although the most
common form of data is multidimensional, an increasing proportion belongs to more complex
data types. While there is a conceptual portability of algorithms between many data types
at a very high level, this is not the case from a practical perspective. The reality is that
the precise data type may affect the behavior of a particular algorithm significantly. As a
result, one may need to design refined variations of the basic approach for multidimensional
data, so that it can be used effectively for a different data type. Therefore, this book will
dedicate different chapters to the various data types to provide a better understanding of
how the processing methods are affected by the underlying data type.
A major challenge has been created in recent years due to increasing data volumes. The
prevalence of continuously collected data has led to an increasing interest in the field of data
streams. For example, Internet traffic generates large streams that cannot even be stored
effectively unless significant resources are spent on storage. This leads to unique challenges
from the perspective of processing and analysis. In cases where it is not possible to explicitly
store the data, all the processing needs to be performed in real time.
This chapter will provide a broad overview of the different technologies involved in preprocessing
and analyzing different types of data. The goal is to study data mining from the
perspective of different problem abstractions and data types that are frequently encountered.
Many important applications can be converted into these abstractions.
This chapter is organized as follows. Section 1.2 discusses the data mining process with
particular attention paid to the data preprocessing phase in this section. Different data
types and their formal definition are discussed in Sect. 1.3. The major problems in data
mining are discussed in Sect. 1.4 at a very high level. The impact of data type on problem
definitions is also addressed in this section. Scalability issues are addressed in Sect. 1.5. In
Sect. 1.6, a few examples of applications are provided. Section 1.7 gives a summary.

You might also like