0% found this document useful (0 votes)
29 views

Data Mining Notes

Uploaded by

shimi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Data Mining Notes

Uploaded by

shimi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

DATA MINING

Data mining refers to extracting or mining knowledge from large amounts of data. Data
mining should have been more appropriately named as knowledge mining which emphasizes on
mining from large amounts of data.
It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
The key properties of data mining are
• Automatic discovery of patterns
• Prediction of likely outcomes
• Creation of actionable information
• Focus on large datasets and databases

The knowledge discovery process is shown in Figure as an iterative sequence of the following
steps:

• Data cleaning: To remove noise and inconsistent data


• Data integration: Where multiple data sources may be combined
• Data selection: Where data relevant to the analysis task are retrieved from the database
• Data transformation: Where data are transformed and consolidated into form
appropriate for mining by performing summary or aggregation operations.
• Data mining: An essential process where intelligent methods are applied to extract data
patterns.
• Pattern evaluation: To identify the interesting patterns representing knowledge based on
interestingness measures.
• Knowledge presentation: Where visualization and knowledge representation techniques
are used to present mined knowledge to users.
What kinds of data can be mined?
The most basic forms of data for mining applications are database data, data warehouse data and
transactional data. Data mining can also be applied to other forms of data such as data streams,
ordered/sequence data, graph or networked data, spatial data, text data, multimedia data and the
WWW.

Database data
A database system, also called a database management system (DBMS), consists of a collection
of interrelated data known as a database and a set of software programs to manage and access the
data. The software programs provide mechanisms for defining database structures and data
storage, for specifying and managing concurrent, shared, or distributed data access and for
ensuring consistency and security of the information stored despite system crashes or attempts at
unauthorized access.
Relational databases are one of the most commonly available and richest information repositories
and thus they are a major data form in the study of data mining.

THE SCOPE OF DATA MINING


Data mining derives its name from the similarities between searching for valuable
business information in a large database. For example, finding linked products in gigabytes of
store scanner data_ and mining a mountain for a vein of valuable ore.
Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive handson
analysis can now be answered directly from the data — quickly. A typical example of a
predictive problem is targeted marketing. Data mining uses data on past promotional mailings to
identify the targets most likely to maximize return on investment in future mailings. Other
predictive problems include forecasting bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern discovery is
the analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.

Tasks of Data Mining


Data mining involves six common classes of tasks:
• Anomaly detection (Outlier/change/deviation detection) - The identification of
unusual data records, that might be interesting or data errors that require further
investigation.
• Association rule learning (Dependency modeling) - Searches for relationships between
variables. For example, a supermarket might gather data on customer purchasing habits.
Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
• Clustering - is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.
• Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
• Regression – attempts to find a function which models the data with the least error.
• Summarization - providing a more compact representation of the data set, including
visualization and report generation.
Architecture of Data Mining
A typical data mining system may have the following major components.

1. Knowledge Base:
This is the domain knowledge that is used to guide the search evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness
based on its unexpectedness, may also be included. Other examples of domain knowledge
are additional interestingness constraints or thresholds, and metadata (e.g., describing
data from multiple heterogeneous sources).
2. Data Mining Engine:

This is essential to the data mining system and ideally consists of a set of functional
modules for tasks such as characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and evolution analysis.

3. Pattern Evaluation Module:

This component typically employs interestingness measures interact with the data
mining modules so as to focus the search toward interesting patterns. It may use
interestingness thresholds to filter out discovered patterns. Alternatively, the pattern
evaluation module may be integrated with the mining module, depending on the
implementation of the data mining method used. For efficient data mining, it is highly
recommended to push the evaluation of pattern interestingness as deep as possible into
the mining process so as to confine the search to only the interesting patterns.

4. User interface:

This module communicates between users and the data mining system, allowing the user to
interact with the system by specifying a data mining query or task, providing information to
help focus the search, and performing exploratory data mining based on the intermediate data
mining results. In addition, this component allows the user to browse database and data
warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in
different forms.

Data Mining Process:


Data Mining is a process of discovering various models, summaries, and derived values from
a given collection of data.
The general experimental procedure adapted to data-mining problems involves the following
steps:

1. State the problem and formulate the hypothesis

Most data-based modeling studies are performed in a particular application domain.


Hence, domain-specific knowledge and experience are usually necessary in order to
come up with a meaningful problem statement. Unfortunately, many application studies
tend to focus on the data-mining technique at the expense of a clear problem statement.
In this step, a modeler usually specifies a set of variables for the unknown dependency
and, if possible, a general form of this dependency as an initial hypothesis. There may be
several hypotheses formulated for a single problem at this stage. The first step requires
the combined expertise of an application domain and a data-mining model. In practice, it
usually means a close interaction between the data-mining expert and the application
expert. In successful data-mining applications, this cooperation does not stop in the
initial phase; it continues during the entire data-mining process.

2. Collect the data

This step is concerned with how the data are generated and collected. In general, there
are two distinct possibilities. The first is when the data-generation process is under the
control of an expert (modeler): this approach is known as a designed experiment. The
second possibility is when the expert cannot influence the data-generation process: this is
known as the observational approach. An observational setting, namely, random data
generation, is assumed in most data-mining applications. Typically, the sampling
distribution is completely unknown after data are collected, or it is partially and
implicitly given in the data-collection procedure. It is very important, however, to
understand how data collection affects its theoretical distribution, since such a priori
knowledge can be very useful for modeling and, later, for the final interpretation of
results. Also, it is important to make sure that the data used for estimating a model and
the data used later for testing and applying a model come from the same, unknown,
sampling distribution. If this is not the case, the estimated model cannot be successfully
used in a final application of the results.

3. Preprocessing the data

In the observational setting, data are usually "collected" from the existing databses, data
warehouses, and data marts. Data preprocessing usually includes at least two common
tasks:

1. Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement
errors, coding and recording errors, and, sometimes, are natural, abnormal values.
Such nonrepresentative samples can seriously affect the model produced later. There
are two strategies for dealing with outliers:

a. Detect and eventually remove outliers as a part of the preprocessing phase, or


b. Develop robust modeling methods that are insensitive to outliers.

2. Scaling, encoding, and selecting features – Data preprocessing includes several


steps such as variable scaling and different types of encoding. For example, one
feature with the range [0, 1] and the other with the range [−100, 1000] will not have
the same weights in the applied technique; they will also influence the final data-
mining results differently. Therefore, it is recommended to scale them and bring both
features to the same weight for further analysis. Also, application-specific encoding
methods usually achieve dimensionality reduction by providing a smaller number of
informative features for subsequent data modeling. These two classes of
preprocessing tasks are only illustrative examples of a large spectrum of
preprocessing activities in a data-mining process. Data-preprocessing steps should
not be considered completely independent from other data-mining phases. In every
iteration of the data-mining process, all activities, together, could define new and
improved data sets for subsequent iterations. Generally, a good preprocessing
method provides an optimal representation for a data-mining technique by
incorporating a priori knowledge in the form of application-specific scaling and
encoding.

4.Estimate the model


The selection and implementation of the appropriate data-mining technique is the main task in
this phase. This process is not straightforward; usually, in practice, the implementation is based
on several models, and selecting the best one is an additional task. The basic principles of
learning and discovery from data are given in Chapter 4 of this book. Later, Chapter 5 through
13 explain and analyze specific techniques that are applied to perform a successful learning
process from data and to develop an appropriate model.

5.Interpret the model and draw conclusions


In most cases, data-mining models should help in decision making. Hence, such models need to
be interpretable in order to be useful because humans are not likely to base their decisions on
complex "black-box" models. Note that the goals of accuracy of the model and accuracy of its
interpretation are somewhat contradictory. Usually, simple models are more interpretable, but
they are also less accurate. Modern data-mining methods are expected to yield highly accurate
results using highdimensional models. The problem of interpreting these models, also very
important, is considered a separate task, with specific DEPT OF CSE & IT VSSUT, Burla
techniques to validate the results. A user does not want hundreds of pages of numeric results. He
does not understand them; he cannot summarize, interpret, and use them for successful decision
making.
Classification of Data mining Systems:

The data mining system can be classified according to the following criteria:
• Database Technology
• Statistics
• Machine Learning
• Information Science
• Visualization and Other Disciplines
Some Other Classification Criteria:
1. Classification according to kind of databases mined
2. Classification according to kind of knowledge mined
3. Classification according to kinds of techniques utilized
4. Classification according to applications adapted
1. Classification according to kind of databases mined

We can classify the data mining system according to kind of databases mined. Database
system can be classified according to different criteria such as data models, types of data etc.
And the data mining system can be classified accordingly. For example if we classify the
database according to data model then we may have a relational, transactional, object-
relational, or data warehouse mining system.

2. Classification according to kind of knowledge mined

We can classify the data mining system according to kind of knowledge mined. It is means
data mining system are classified on the basis of functionalities such as:
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Clustering
• Outlier Analysis
• Evolution Analysis
3. Classification according to kinds of techniques utilized
We can classify the data mining system according to kind of techniques used. We can describes
these techniques according to degree of user interaction involved or the methods of analysis
employed.

4. Classification according to applications adapted


We can classify the data mining system according to application adapted. These applications are
as follows:
• Finance
• Telecommunications
• DNA
• Stock Markets
• E-mail

You might also like