Data Mining
Data Mining
Data mining deals with the kind of patterns that can be mined. On the basis of the kind of
data to be mined, there are two categories of functions involved in Data Mining −
Descriptive
Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the database. Here is the
list of descriptive functions −
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For example, in
a company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a
concept are called class/concept descriptions. These descriptions can be derived by the
following two ways −
Data Characterization − This refers to summarizing data of class under study. This
class under study is called as Target Class.
Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional data. Here is the list
of kind of frequent patterns −
Frequent Item Set − It refers to a set of items that frequently appear together, for
example, milk and bread.
Frequent Subsequence − A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.
Frequent Sub Structure − Substructure refers to different structural forms, such as
graphs, trees, or lattices, which may be combined with item-sets or subsequences.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is sold
with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other
clusters.
Classification is the process of finding a model that describes the data classes or concepts.
The purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −
Classification − It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e. the data
object whose class label is well known.
Prediction − It is used to predict missing or unavailable numerical data values rather
than class labels. Regression Analysis is generally used for prediction. Prediction can
also be used for identification of distribution trends based on available data.
Outlier Analysis − Outliers may be defined as the data objects that do not comply
with the general behavior or model of the data available.
Evolution Analysis − Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over time.
Data Mining Task Primitives
We can specify a data mining task in the form of a data mining query.
This query is input to the system.
A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives −
Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.
Set of task relevant data to be mined
This is the portion of database in which the user is interested. This portion includes the
following −
Database Attributes
Data Warehouse dimensions of interest
Kind of knowledge to be mined
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data to be
mined at multiple levels of abstraction.
This is used to evaluate the patterns that are discovered by the process of knowledge
discovery. There are different interesting measures for different kind of knowledge.
This refers to the form in which discovered patterns are to be displayed. These
representations may include the following. −
Rules
Tables
Charts
Graphs
Decision Trees
Cubes
What is KDD?
KDD is a computer science field specializing in extracting
previously unknown and interesting information from raw data.
KDD is the whole process of trying to make sense of data by
developing appropriate methods or techniques. This process deals
with low-level mapping data into other forms that are more
compact, abstract, and useful. This is achieved by creating short
reports, modeling the process of generating data, and developing
predictive models that can predict future cases.
Data Mining is only a step within the overall KDD process. There are two
major Data Mining goals defined by the application's goal: verification of
discovery. Verification verifies the user's hypothesis about data, while
discovery automatically finds interesting patterns.
However, the sheer volume of data and the speed with which it is
collected makes sifting through it challenging. Thus, it has
become economically and scientifically necessary to scale up our
analysis capability to handle the vast amount of data that we now
obtain.
Data mining is not an easy task, as the algorithms used can get very complex and data is
not always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this tutorial, we will discuss the major
issues regarding −
Data mining is one of the forms of artificial intelligence that uses perception models,
analytical models, and multiple algorithms to simulate the techniques of the human
brain. Data mining supports machines to take human decisions and create human
choices.
The user of the data mining tools will have to direct the machine rules, preferences, and
even experiences to have decision support data mining metrics are as follows −
Usefulness − Usefulness involves several metrics that tell us whether the model
provides useful data. For instance, a data mining model that correlates save the location
with sales can be both accurate and reliable, but cannot be useful, because it cannot
generalize that result by inserting more stores at the same location.
Furthermore, it does not answer the fundamental business question of why specific
locations have more sales. It can also find that a model that appears successful is
meaningless because it depends on cross-correlations in the data.
Return on Investment (ROI) − Data mining tools will find interesting patterns buried
inside the data and develop predictive models. These models will have several measures
for denoting how well they fit the records. It is not clear how to create a decision based
on some of the measures reported as an element of data mining analyses.
Access Financial Information during Data Mining − The simplest way to frame
decisions in financial terms is to augment the raw information that is generally mined to
also contain financial data. Some organizations are investing and developing data
warehouses, and data marts.
The design of a warehouse or mart contains considerations about the types of analyses
and data needed for expected queries. It is designing warehouses in a way that allows
access to financial information along with access to more typical data on product
attributes, user profiles, etc. can be useful.
Converting Data Mining Metrics into Financial Terms − A general data mining
metric is the measure of "Lift". Lift is a measure of what is achieved by using the specific
model or pattern relative to a base rate in which the model is not used. High values
mean much is achieved. It can seem then that one can simply create a decision based on
Lift.
Accuracy − Accuracy is a measure of how well the model correlates results with the
attributes in the data that has been supported. There are several measures of accuracy,
but all measures of accuracy are dependent on the information that is used. In reality,
values can be missing or approximate, or the data can have been changed by several
processes.
Data mining is the process of finding useful new correlations, patterns, and trends by
transferring through a high amount of data saved in repositories, using pattern
recognition technologies including statistical and mathematical techniques. It is the
analysis of factual datasets to discover unsuspected relationships and to summarize the
records in novel methods that are both logical and helpful to the data owner.
Data mining systems are designed to promote the identification and classification of
individuals into different groups or segments. From the aspect of the commercial firm,
and possibly for the industry as a whole, it can interpret the use of data mining as a
discriminatory technology in the rational search of profits.
There are various social implications of data mining which are as follows −
Privacy − It is a loaded issue. In current years privacy concerns have taken on a more
important role in American society as merchants, insurance companies, and government
agencies amass warehouses including personal records.
The concerns that people have over the group of this data will generally extend to some
analytic capabilities used to the data. Users of data mining should start thinking about
how their use of this technology will be impacted by legal problems associated with
privacy.
Profiling − Data Mining and profiling is a developing field that attempts to organize,
understand, analyze, reason, and use the explosion of data in this information age. The
process contains using algorithms and experience to extract design or anomalies that are
very complex, difficult, or time-consuming to recognize.
The founder of Microsoft's Exploration Team used complex data mining algorithms to
solve an issue that had haunted astronomers for some years. The problem of reviewing,
describing, and categorizing 2 billion sky objects recorded over 3 decades. The algorithm
extracted the relevant design to allocate the sky objects like stars or galaxies. The
algorithms were able to extract the feature that represented sky objects as stars or
galaxies. This developing field of data mining and profiling has several frontiers where it
can be used.
Unauthorized Used − Trends obtain through data mining designed to be used for
marketing goals or some other ethical goals, can be misused. Unethical businesses or
people can use the data obtained through data mining to take benefit of vulnerable
people or discriminate against a specific group of people. Furthermore, the data mining
technique is not 100 percent accurate; thus mistakes do appear which can have serious
results.
In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.
1. Classification:
This technique is used to obtain important and relevant information about data
and metadata. This data mining technique helps to classify data in different
classes.
2. Clustering:
Clustering is a division of information into groups of connected objects.
Describing the data by a few clusters mainly loses certain confine details, but
accomplishes improvement. It models data by its clusters. Data modeling puts
clustering from a historical point of view rooted in statistics, mathematics, and
numerical analysis. From a machine learning point of view, clusters relate to
hidden patterns, the search for clusters is unsupervised learning, and the
subsequent framework represents a data concept. From a practical point of view,
clustering plays an extraordinary job in data mining applications. For example,
scientific data exploration, text mining, information retrieval, spatial database
applications, CRM, Web analysis, computational biology, medical diagnostics,
and much more.
In other words, we can say that Clustering analysis is a data mining technique to
identify similar data. This technique helps to recognize the differences and
similarities between the data. Clustering is very similar to the classification, but it
involves grouping chunks of data together based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze
the relationship between variables because of the presence of the other factor. It
is used to define the probability of the specific variable. Regression, primarily a
form of planning and modeling. For example, we might use it to project certain
costs, depending on other factors such as availability, consumer demand, and
competition. Primarily it gives the exact relationship between two or more
variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It
finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of
databases. Association rule mining has several applications and is commonly
used to help sales correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of
grocery items that you have been buying for the last six months. It calculates a
percentage of items being purchased together.
o Lift:
This measurement technique measures the accuracy of the confidence
over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are
purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased
when item A is purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the
data set, which do not match an expected pattern or expected behavior. This
technique may be used in various domains like intrusion, detection, fraud
detection, etc. It is also known as Outlier Analysis or Outilier mining. The outlier
is a data point that diverges too much from the rest of the dataset. The majority
of the real-world datasets have an outlier. Outlier detection plays a significant
role in the data mining field. Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card fraud detection, detecting
outlying in wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating
sequential data to discover sequential patterns. It comprises of finding
interesting subsequences in a set of sequences, where the stake of a sequence
can be measured in terms of different criteria like length, occurrence frequency,
etc.
Key factors:
Entropy:
Information Gain refers to the decline in entropy after the dataset is split. It is
also called Entropy Reduction. Building a decision tree is all about discovering
attributes that return the highest data gain.
In short, a decision tree is just like a flow chart diagram with the terminal nodes showing
decisions. Starting with the dataset, we can measure the entropy to find a way to segment the
set until the data belongs to the same class.
ADVERTISEMENT
Why are decision trees useful?
It enables us to analyze the possible consequences of a decision thoroughly.
It helps us to make the best decisions based on existing data and best
speculations.
In other words, we can say that a decision tree is a hierarchical tree structure
that can be used to split an extensive collection of records into smaller sets of
the class by implementing a sequence of simple decision rules. A decision tree
model comprises a set of rules for portioning a huge heterogeneous population
into smaller, more homogeneous, or mutually exclusive classes. The attributes of
the classes can be any variables from nominal, ordinal, binary, and quantitative
values, in contrast, the classes must be a qualitative type, such as categorical or
ordinal or binary. In brief, the given data of attributes together with its class, a
decision tree creates a set of rules that can be used to identify the class. One
rule is implemented after another, resulting in a hierarchy of segments within a
segment. The hierarchy is known as the tree, and each segment is called
a node. With each progressive division, the members from the subsequent sets
become more and more similar to each other. Hence, the algorithm used to build
a decision tree is referred to as recursive partitioning. The algorithm is known
as CART (Classification and Regression Trees)
Initially, D is the entire set of training tuples and their related class
levels (input training data).
Compared to other algorithms, decision trees need less exertion for data
preparation during pre-processing.
A decision tree does not require a standardization of data.
Neural Network:
While there are numerous different neural network architectures that have been
created by researchers, the most successful applications in data mining neural
networks have been multilayer feedforward networks. These are networks in which
there is an input layer consisting of nodes that simply accept the input values and
successive layers of nodes that are neurons as depicted in the above figure of Artificial
Neuron. The outputs of neurons in a layer are inputs to neurons in the next layer. The
last layer is called the output layer. Layers between the input and output layers are
known as hidden layers.
As you know that we have two types of Supervised learning one is Regression and
another one is classification. So in the Regression type problem neural network is used
to predict a numerical quantity there is one neuron in the output layer and its output is
the prediction. While on another hand in the classification type problem the output
layer has as many nodes as the number of classes and the output layer node with the
largest output values gives the network’s estimate of the class for a given input. In the
special case of two classes, it is common to have just one node in the output layer, the
classification between the two classes being made by applying a cut-off to the output
value at the node.
Neural networks help in mining large amounts of data in various sectors such as retail,
banking (Fraud detection), bioinformatics(genome sequencing), etc. Finding useful
information for large data which is hidden is very challenging and very necessary also.
Data Mining uses Neural networks to harvest information from large datasets from data
warehousing organizations. Which helps the user in decision making.
Some of the Applications of Neural Network In Data Mining are given below:
Fraud Detection: As we know that fraudsters have been exploiting businesses,
banks for their own financial gain for many past years, and the problem is going to
increase in today’s modern world because of the advancement of technology, which
makes fraud relatively easy to commit but on the other hand technology also helps
is fraud detection and in this neural network help us a lot in detecting fraud.
Healthcare: In healthcare, Neural Network helps us in Diagnosing diseases, as we
know that there are many diseases and there are large datasets having records of
these diseases. With neural networks and these records, we diagnosed these
diseases in the early stage as soon as possible.
Different Neural Network Method in Data Mining