DM - MOD - 1 Part I
DM - MOD - 1 Part I
Module - 1
while data mining is only one of these steps. The KDD is an interactive and iterative
process.
● Data cleaning: also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection.
● Data integration: at this stage, multiple data sources, often heterogeneous, may be
combined in a common source.
● Data selection: at this step, the data relevant to the analysis is decided on and retrieved
from the data collection.
● Data transformation: also known as data consolidation, it is a phase in which the selected
data is transformed into forms appropriate for the mining procedure.
● Data mining: it is the crucial step in which clever techniques are applied to extract
patterns potentially useful.
● Pattern evaluation: in this step, strictly interesting patterns representing knowledge are
identified based on given measures.
● Knowledge representation: is the final phase in which the discovered knowledge is
visually represented to the user. This essential step uses visualization techniques to help
users understand and interpret the data mining results.
Data Warehouse:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of
data in support of management's decision making process.
● Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.
● Integrated: A data warehouse integrates data from multiple data sources. For example,
source A and source B may have different ways of identifying a product, but in a data
warehouse, there will be only a single way of identifying a product.
● Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve
data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This
contrasts with a transactions system, where often only the most recent data is kept.
For example, a transaction system may hold the most recent address of a customer, where
a data warehouse can hold all addresses associated with a customer.
● Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in
a data warehouse should never be altered.
Knowledge representation is the presentation of knowledge to the user for visualization in terms
of trees, tables, rules, graphs, charts, matrices, etc.
There are mainly six forms of knowledge from data mining:
● Rule
Rule knowledge consists of precondition and conclusion, precondition composed of
“and” and “or” operation on the value of field term(property), conclusion composed of
the value or type of decision making field term(property).
Taking one corporation’s notebook computer sales data as an example, we will describe
the representation of rules.
Taking table 1 as an example, we can obtain the following rule knowledge by the methods of
data mining.
● Decision tree
Decision tree is a kind of treelike graph to indicate people’s series of judgment
procedures in order to make some decision. Decision tree consists of decision nodes,
branches and leaves. The top node is the root which is the start of the decision tree. In the
course of searching along the decision tree from top to bottom, there will be a problem in
each node, different answers will lead to different branches, it will reach a leaf at last.
Each leaf belongs to a different classification. Below figure is a decision tree
corresponding with TABLE 1.
• Association rules
Association rules are "if-then" statements that help to show the probability of relationships
between data items, within large data sets in various types of databases.
• Classification rules
Rule-based classification in data mining is a technique in which class decisions are taken based
on various “if...then… else” rules. Thus, we define it as a classification type governed by a set of
IF-THEN rules. We write an IF-THEN rule as:
“IF condition THEN conclusion.”
• Prediction schemes:
Nearest neighbor
Bayesian classification
Neural networks
Regression
• Clusters:
Type of grouping: partitions/hierarchical
Grouping or describing: agglomerative/conceptual
Type of descriptions: statistical/structural
1. Class/Concept Descriptions:
Classes or definitions can be correlated with results. In simplified, descriptive and yet accurate
ways, it can be helpful to define individual groups and concepts. These class or concept
definitions are referred to as class/concept descriptions.
Data Characterization: This refers to the summary of general characteristics or features of the
class that is under the study. The output of the data characterization can be presented in various
forms including pie charts, bar charts, curves, multidimensional data cubes.
Example: To study the characteristics of software products with sales increased by 10% in the
previous years. To summarize the characteristics of the customers who spend more than $5000 a
year at AllElectronics, the result is the general profile of those customers such as that they are
40-50 years old, employed and have excellent credit rating.
3. Association Analysis:
The process involves uncovering the relationship between data and deciding the rules of the
association. It is a way of discovering the relationship between various items.
Example: Suppose we want to know which items are frequently purchased together. An example
for such a rule mined from a transactional database is,
buys (X, “computer”) ⇒ buys (X, “software”) [support = 1%, confidence = 50%],
○ Confidence is the conditional probability that an item occurs when another item occurs in
a transaction.
4. Correlation Analysis: Correlation is a mathematical technique that can show whether and
how strongly the pairs of attributes are related to each other.
5. Cluster Analysis
In image processing, pattern recognition and bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the classes are not predefined. Data attributes
represent the classes. Similar data are grouped together, with the difference being that a class
label is not known. Clustering algorithms group data based on similar features and
dissimilarities.
6. Outlier Analysis
Outlier analysis is important to understand the quality of data. If there are too many outliers, you
cannot trust the data or draw patterns.
8. Correlation Analysis
Correlation is a mathematical technique for determining whether and how strongly two attributes
are related to one another. It refers to the various types of data structures, such as trees and
graphs, that can be combined with an item set or subsequence.
Data mining tools perform data analysis and may uncover important data patterns, contributing
greatly to business strategies, knowledge bases, and scientific and medical research. The
widening gap between data and information calls for the systematic development of data tools
that will turn data tombs into “golden nuggets” of knowledge.
Data can be saved in several types of databases and data repositories. One data repository
structure that has appeared in the data warehouse, a repository of several heterogeneous data
sources organized under a unified schema at an individual site to support management decision
making.Data warehouse technology involves data cleaning, data integration, and online
analytical processing (OLAP), especially, analysis techniques with functionalities including
summarization, consolidation, and aggregation, and the ability to view data from multiple
angles.
While data can help organizations achieve their goals, it needs to be mined first. Raw data
cannot be used for any purpose. Data mining ensures that useful information can be derived
from raw data and used to benefit both the organization and its customers. Some of the areas
where data mining helps are detecting fraud, spam filtering, managing risks, and cybersecurity.
In the marketing sector, it helps in forecasting customer behavior. In the banking sector, it can
help in determining fraudulent transactions. Data mining is not only beneficial for organizations.
From governments to healthcare, it is used everywhere.
Classification of the data mining system helps users to understand the system and match their
requirements with such systems.
1. Classification according to the form of mined databases
A data mining method can be graded according to the type of mined databases. It is possible to
distinguish database systems according to various parameters (such as data models or data types
or programs involved), each of which may include its own technique of data mining.
Accordingly, data mining programs can be categorized accordingly.
Data mining systems can be classified by the kinds of information they gain, that is, based on
functionalities of data mining, such as characterization, discrimination, analysis of interaction
and similarity, grouping, estimation, clustering, outlier analysis, and analysis of evolution. A
robust data mining framework generally offers several and/or combined functionalities for data
mining
We may define a method of data mining according to the form of information that is extracted.
This suggests that the data mining system is categorized on the basis of features including
● Characterisation
● Discrimination Also
● Correlation Study and Interaction
● Classifying
● Forecasting
● Review of Outlier
● Study of Evolutions
● Finance
● Telecommunications
● DNA
● Stock Markets
● E-mail
It is a strategy that integrates data from several sources to make it available to users in a single
uniform view that shows their status. There are communication sources between systems that can
include multiple databases, data cubes, or flat files. The goal of data integration is to make it
easier to access and analyze data that is spread across multiple systems or platforms, in order to
gain a more complete and accurate understanding of the data.