Data Mining Real
Data Mining Real
1
2
A detailed description of parts of data mining architecture is shown:
1. Data Sources:
Database, World Wide Web(WWW) and data warehouse are parts of data
sources. The data in these sources may be in the form of plain text,
spreadsheets or in other forms of media like photos or videos. WWW is one
of the biggest sources of data.
2. Database Server:
The database server contains the actual data ready to be processed. It
performs the task of handling data retrieval as per the request of the user.
3. Data Mining Engine:
It is one of the core components of the data mining architecture that
performs all kinds of data mining techniques like association, classification,
characterization, clustering, prediction, etc.
4. Pattern Evaluation Modules:
They are responsible for finding interesting patterns in the data and
sometimes they also interact with the database servers for producing the
result of the user requests.
5. Graphic User Interface:
Since the user cannot fully understand the complexity of the data mining
process so graphical user interface helps the user to communicate
effectively with the data mining system.
6. Knowledge Base:
Knowledge Base is an important part of the data mining engine that is quite
beneficial in guiding the search for the result patterns. Data mining engine
may also sometimes get inputs from the knowledge base. This knowledge
base may contain data from user experiences. The objective of the
knowledge base is to make the result more accurate and reliable.
1. Association
2. Classification
3. Clustering:
Unlike classification and prediction, which analyze class-labeled data objects or
attributes, clustering analyzes data objects without consulting an identified class
label. In general, the class labels do not exist in the training data simply
because they are not known to begin with. Clustering can be used to generate
these labels. The objects are clustered based on the principle of maximizing the
intra-class similarity and minimizing the interclass similarity.
4.prediction:
Data Prediction is a two-step process, similar to that of data classification.
Although, for prediction, we do not utilize the phrasing of “Class label attribute”
because the attribute for which values are being predicted is consistently
valued(ordered) instead of categorical (discrete-esteemed and unordered). The
attribute can be referred to simply as the predicted attribute. Prediction can be
viewed as the construction and use of a model to assess the class of an
unlabeled object, or to assess the value or value ranges of an attribute that a
given object is likely to have.
5. Regression
A database may contain data objects that do not comply with the general
behavior or model of the data. These data objects are Outliers. The
investigation of OUTLIER data is known as OUTLIER MINING. An outlier may
be detected using statistical tests which assume a distribution or probability
model for the data, or using distance measures where objects having a small
fraction of “close” neighbors in space are considered outliers.
8. Genetic Algorithm
Genetic algorithms are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms. Genetic algorithms are based on the
ideas of natural selection and genetics. These are intelligent exploitation of
random search provided with historical data to direct the search into the region
of better performance in solution space. They are commonly used to generate
high-quality solutions for optimization problems and search problems.. In simple
words, they simulate “survival of the fittest” among individuals of consecutive
generations for solving a problem.
Cluster Analysis
Cluster Analysis is the process to find similar groups of objects in order to form
clusters.It is an unsupervised machine learning-based algorithm that acts on
unlabelled data. A group of data points would comprise together to form a
cluster in which all the objects would belong to the same group.
Cluster:
The given data is divided into different groups by combining similar objects into
a group. This group is nothing but a cluster. A cluster is nothing but a collection
of similar data which is grouped together.
For example, consider a dataset of vehicles is given in which it contains
information about different vehicles like cars, buses, bicycles, etc. As it is
unsupervised learning there are no class labels like Cars, Bikes, etc for all the
vehicles, all the data is combined and is not in a structured manner.
Now our task is to convert the unlabelled data to labelled data and it can be
done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by
forming clusters like cars cluster which contains all the cars, bikes clusters
which contains all the bikes, etc.
Simply it is partitioning of similar objects which are applied on unlabelled data.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should
be dealing with huge databases. In order to handle extensive databases, the
clustering algorithm should be scalable. Data should be scalable if it is not
scalable, then we can’t get the appropriate result and would lead to wrong
results.
2. High Dimensionality: The algorithm should be able to handle high
dimensional space along with the data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can
be used with algorithms of clustering. It should be capable of dealing with
different types of data like discrete, categorical and interval-based data, binary
data etc.
4. Dealing with unstructured data: These would be some databases that
contain missing values, noisy or erroneous data. If the algorithms are sensitive
to such data then it may lead to poor quality clusters. So it should be able to
handle unstructured data give it some structure to the data by organizing it into
groups of similar data objects. This makes the job of the data expert easier in
order to process the data and discover new patterns.
5. Interpretability: The outcomes of clustering should be interpretable,
comprehensible, and usable. The interpretability reflects how easily the data is
understood.
Clustering Methods:
The clustering methods can be classified into the following categories:
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Binary variables: has only two states: 0 or 1, where 0 means that the variable is absent,
and 1 means that it is present. Given the variable smoker describing a patient, for instance, 1
indicates that the patient smokes, while 0 indicates that the patient does not. Treating binary
variables as if they are interval-scaled can lead to misleading clustering results. Therefore,
methods specific to binary data are necessary for computing dissimilarities.A binary variable
is symmetric if both of its states are equally valuable and carry the same weight; that is,
there is no preference on which outcome should be coded as 0 or 1.
Dissimilarity that is based on symmetric binary variables is called symmetric binary dissimilarity. Its
dissimilarity (or distance) measure, defined in Equation (7.9), can be used to assess the dissimilarity
between objects i and j.
r+s
d(i, j) = . (7.9)
q+r+s+t
A binary variable is asymmetric if the outcomes of the states are not equally important, such as
the positive and negative outcomes.
Categorical variables: A categorical variable is a generalization of the binary variable in that it can
take on more than two states. For example, map color is a categorical variable that may have, say, five
states: red, yellow, green, pink, and blue.
Let the number of states of a categorical variable be M. The states can be denoted by letters, symbols,
or a set of integers, such as 1, 2,..., M. Notice that such integers are used just for data handling and do
not represent any specific ordering.
The dissimilarity between two objects i and j can be computed based on the ratio of mismatches:
p−m
d(i, j) = , (7.12) p
where m is the number of matches (i.e., the number of variables for which i and j are in the same state),
and p is the total number of variables.
Ordinal Variables
A discrete ordinal variable resembles a categorical variable, except that the M states of the ordinal value
are ordered in a meaningful sequence. Ordinal variables are very useful for registering subjective
assessments of qualities that cannot be measured objectively. For example, professional ranks are often
enumerated in a sequential order, such as assistant, associate, and full for professors. A continuous
ordinal variable looks like a set of continuous data of an unknown scale; that is, the relative ordering of
the values is essential but their actual magnitude is not.
Ratio-Scaled Variables
A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponential scale,
approximately following the formula
AeBt or Ae−Bt (7.14)
where A and B are positive constants, andt typically represents time. Common examples include the
growth of a bacteria population or the decay of a radioactive element. There are three methods to
handle ratio-scaled variables for computing the dissimilarity between objects.
Treat ratio-scaled variables like interval-scaled variables. This, however, is not usually a good choice
since it is likely that the scale may be distorted
Apply logarithmic transformation to a ratio-scaled variable f having value xif for object i by using the
formula yif = log(xif ).
Treat xif as continuous ordinal data and treat their ranks as interval-valued.
Outlier
An outlier is an object that deviates significantly from the rest of the objects.
They can be caused by measurement or execution errors. The analysis of
outlier data is referred to as outlier analysis or outlier mining. An outlier cannot
be termed as a noise or error.
Difference between outliers and noise
Any unwanted error occurs in some previously measured variable, or there is any variance in the
previously measured variable called noise. Before finding the outliers present in any data set, it is
recommended first to remove the noise.
Analysis of Outliers
Outliers are mostly discarded when methods of data mining are applied. But,
it’s still used in certain applications like fraud detection. This is mainly
because the events that rarely occur can store much more interesting facts
than the events that occur more regularly.
“How does the discordancy testing work?” A statistical discordancy test examines two
hypotheses: a working hypothesis and an alternative hypothesis
There are two basic types of procedures for detecting outliers:
Block procedures: In this case, either all of the suspect objects are treated as outliers or all of them
are accepted as consistent.
Consecutiv(orsequential)procedures:Anexampleofsuchaprocedureistheinsideout procedure.
Its main idea is that the object that is least “likely” to be an outlier is tested first. If it is found to
be an outlier, then all of the more extreme values are also considered outliers; otherwise, the next
most extreme object is tested, and so on. This procedure tends to be more effective than block
procedures.
2.Distance-Based Outlier Detection
The notion of distance-based outliers was introduced to counter the main limitations imposed by
statistical methods. An object, o, in a data set, D, is a distance-based (DB) outlier with
parameters
pct and dmin,3 that is, a DB(pct,dmin)-outlier, if at least a fraction, pct, of the objects in D lie at a
distance greater than dmin from o. In other words, rather than relying on statistical tests, we can think
of distance-based outliers as those objects that do not have “enough” neighbors, where neighbors are
defined based on distance from the given object. In comparison with statistical-based methods,
distancebased outlier detection generalizes the ideas behind discordancy testing for various standard
distributions. Distance-based outlier detection avoids the excessive computation that can be associated
with fitting the observed distribution into some standard distribution and in selecting discordancy tests.
Several efficient algorithms for mining distance-based outliers have been developed. These are outlined
as follows.
Index-based algorithm: Given a data set, the index-based algorithm uses multidimensional indexing
structures, such as R-trees or k-d trees, to search for neighbors of each object o within radius dmin
around that object.
Nested-loop algorithm: The nested-loop algorithm has the same computational complexity as
the index-based algorithm but avoids index structure construction and tries to minimize the
number of I/Os. It divides the memory buffer space into two halves and the data set into
several logical blocks. By carefully choosing the order in which blocks are loaded into each
half, I/O efficiency can be achieved.
3
number of cells and k is the dimensionality. In this method, the data space is partitioned into cells with a
side length equal to . Each cell has two√ layers surrounding it.
Missing Values
Imagine that you need to analyze All Electronics sales and customer data. You note that many tuples
have no recorded value for several attributes, such as customer income. How can you go out filling in
the missing values for this attribute? Let’s look at the following methods:
Ignore the tuple: This is usually done when the class label is missing (assuming the
miningtaskinvolvesclassification).Thismethodisnotveryeffective,unless the tuple
containsseveralattributeswithmissingvalues.Itisespeciallypoorwhenthepercentage of missing values per
attribute varies considerably.
Fill in the missing value manually: In general, this approach is time-consuming and may not be feasible
given a large data set with many missing values
Use a global constant to fill in the missing value: Replace all missing attribute values by the same
constant, such as a label like “Unknown” or −∞. If missing values are replaced by, say, “Unknown,” then
the mining program may mistakenly think that they form an interesting concept, since they all have a
value in common—that of “Unknown.” Hence, although this method is simple, it is not foolproof.
Use the attribute mean to fill in the missing value: For example, suppose that the average income of
AllElectronics customers is $56,000. Use this value to replace the missing value for income.
Use the attribute mean for all samples belonging to the same class as the given tuple: Forexample, if
classifying customers according to credit risk ,replace the missing value with the average income value
for customers in the same credit risk category as that of the given tuple.
Use the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using a Bayesian formalism, or decision tree induction. For example, using the
other customer attributes in your data set, you may construct a decision tree to predict the missing
values for income.
What is classification?
Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer
(loan applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given
profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the
categorical labels. These labels are risky or safe for loan application data and yes or no
for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will
spend during a sale at his company. In this example we are bothered to predict a
numeric value. Therefore the data analysis task is an example of numeric prediction. In
this case, a model or a predictor will be constructed that predicts a continuous-valued-
function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for
numeric prediction.
Decision Trees
Decision tree is the most powerful and popular tool for classification and prediction. A Decision tree is a
flowchart like tree structure, where each internal node denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf node (terminal node) holds a class label.
They help to analyze which parts of the database are really useful or which part contains a solution to
your problem.
It is a support tool that uses a decision chart or model and its possible consequences. That includes
results of chance events, resource costs, and utility.
From a decision perspective, a decision tree is the least number of questions that must be
acknowledged to evaluate the likelihood of making an accurate decision.
By looking at the predictors or values for each split in the tree, you can draw some ideas or find answers
to the questions you have asked.
Decision trees enable you to approach the obstacle in a structured and systematic behavior.
In Decision Trees, for estimating a class tag for best ever we start with the root of the
tree. We make relations with the root attribute to the record’s attribute. We make
division agreeing to that value and jump to the subsequent node on the base of choice.
Classification by Decision Tree Induction
Decision tree induction is the learning of decision trees from class-labeled training tuples. A decision
tree is a flowchart-like tree structure , where each internal node ( non leaf node) denotes a test on an
attribute ,each branch represents an outcome of the test, and each leaf node (or terminal node) holds a
class label. The top most node in a tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer
at a company is likely to buy a computer or not. Each internal node represents a test on an
attribute. Each leaf node represents a class.
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_ list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_ attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to
noise or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
Pre-pruning − The tree is pruned by halting its construction early.
Post-pruning - This approach removes a sub-tree from a fully grown tree.
Cost Complexity
The cost complexity is measured by the following two parameters −
Points to remember −
The IF part of the rule is called rule antecedent or precondition.
The THEN part of the rule is called rule consequent.
The antecedent part the condition consist of one or more attribute tests and
these tests are logically ANDed.
The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from
a decision tree.
Points to remember −
To extract a rule from a decision tree −
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.
Genetic Algorithms
The idea of genetic algorithm is derived from natural evolution. In genetic algorithm,
first of all, the initial population is created. This initial population consists of randomly
generated rules. We can represent each rule by a string of bits.
For example, in a given training set, the samples are described by two Boolean
attributes such as A1 and A2. And this given training set contains two classes such as
C1 and C2.
We can encode the rule IF A1 AND NOT A2 THEN C2 into a bit string 100. In this bit
representation, the two leftmost bits represent the attribute A1 and A2, respectively.
Likewise, the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001.
Note − If the attribute has K values where K>2, then we can use the K bits to encode
the attribute values. The classes are also encoded in the same manner.
Points to remember −
Based on the notion of the survival of the fittest, a new population is formed that
consists of the fittest rules in the current population and offspring values of these
rules as well.
The fitness of a rule is assessed by its classification accuracy on a set of training
samples.
The genetic operators such as crossover and mutation are applied to create
offspring.
In crossover, the substring from pair of rules are swapped to form a new pair of
rules.
In mutation, randomly selected bits in a rule's string are inverted.