Data Mining Notes
Data Mining Notes
Data mining refers to extracting or mining knowledge from large amounts of data. Data
mining should have been more appropriately named as knowledge mining which emphasizes on
mining from large amounts of data.
It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
The key properties of data mining are
• Automatic discovery of patterns
• Prediction of likely outcomes
• Creation of actionable information
• Focus on large datasets and databases
The knowledge discovery process is shown in Figure as an iterative sequence of the following
steps:
Database data
A database system, also called a database management system (DBMS), consists of a collection
of interrelated data known as a database and a set of software programs to manage and access the
data. The software programs provide mechanisms for defining database structures and data
storage, for specifying and managing concurrent, shared, or distributed data access and for
ensuring consistency and security of the information stored despite system crashes or attempts at
unauthorized access.
Relational databases are one of the most commonly available and richest information repositories
and thus they are a major data form in the study of data mining.
1. Knowledge Base:
This is the domain knowledge that is used to guide the search evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness
based on its unexpectedness, may also be included. Other examples of domain knowledge
are additional interestingness constraints or thresholds, and metadata (e.g., describing
data from multiple heterogeneous sources).
2. Data Mining Engine:
This is essential to the data mining system and ideally consists of a set of functional
modules for tasks such as characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
This component typically employs interestingness measures interact with the data
mining modules so as to focus the search toward interesting patterns. It may use
interestingness thresholds to filter out discovered patterns. Alternatively, the pattern
evaluation module may be integrated with the mining module, depending on the
implementation of the data mining method used. For efficient data mining, it is highly
recommended to push the evaluation of pattern interestingness as deep as possible into
the mining process so as to confine the search to only the interesting patterns.
4. User interface:
This module communicates between users and the data mining system, allowing the user to
interact with the system by specifying a data mining query or task, providing information to
help focus the search, and performing exploratory data mining based on the intermediate data
mining results. In addition, this component allows the user to browse database and data
warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in
different forms.
This step is concerned with how the data are generated and collected. In general, there
are two distinct possibilities. The first is when the data-generation process is under the
control of an expert (modeler): this approach is known as a designed experiment. The
second possibility is when the expert cannot influence the data-generation process: this is
known as the observational approach. An observational setting, namely, random data
generation, is assumed in most data-mining applications. Typically, the sampling
distribution is completely unknown after data are collected, or it is partially and
implicitly given in the data-collection procedure. It is very important, however, to
understand how data collection affects its theoretical distribution, since such a priori
knowledge can be very useful for modeling and, later, for the final interpretation of
results. Also, it is important to make sure that the data used for estimating a model and
the data used later for testing and applying a model come from the same, unknown,
sampling distribution. If this is not the case, the estimated model cannot be successfully
used in a final application of the results.
In the observational setting, data are usually "collected" from the existing databses, data
warehouses, and data marts. Data preprocessing usually includes at least two common
tasks:
1. Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement
errors, coding and recording errors, and, sometimes, are natural, abnormal values.
Such nonrepresentative samples can seriously affect the model produced later. There
are two strategies for dealing with outliers:
The data mining system can be classified according to the following criteria:
• Database Technology
• Statistics
• Machine Learning
• Information Science
• Visualization and Other Disciplines
Some Other Classification Criteria:
1. Classification according to kind of databases mined
2. Classification according to kind of knowledge mined
3. Classification according to kinds of techniques utilized
4. Classification according to applications adapted
1. Classification according to kind of databases mined
We can classify the data mining system according to kind of databases mined. Database
system can be classified according to different criteria such as data models, types of data etc.
And the data mining system can be classified accordingly. For example if we classify the
database according to data model then we may have a relational, transactional, object-
relational, or data warehouse mining system.
We can classify the data mining system according to kind of knowledge mined. It is means
data mining system are classified on the basis of functionalities such as:
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Clustering
• Outlier Analysis
• Evolution Analysis
3. Classification according to kinds of techniques utilized
We can classify the data mining system according to kind of techniques used. We can describes
these techniques according to degree of user interaction involved or the methods of analysis
employed.