Data Mining - Digital Notes (Unit I To V)
Data Mining - Digital Notes (Unit I To V)
(MR20-1CS0241)
UNIT 1
INTRODUCTION TO DATA MINING:
Introduction to Data Mining – Data Mining Tasks – Major Issues in Data Mining – Data Preprocessing –
Data sets, Association Rule Mining: Efficient and Scalable Frequent Item set Mining Methods – Mining
Various Kinds of Association Rules.
UNIT 2
CLASSIFICATION AND PREDICTION: Decision Tree Introduction, Bayesian Classification,
Rule Based Classification, Classification by Back propagation, Support Vector Machines,
Associative Classification, classification using frequent patterns, Lazy Learners, Other
Classification Methods: Genetic Algorithm.
UNIT 3
CLUSTERING ANALYSIS: Types of Data in Cluster Analysis, Partitioning Methods,
Hierarchical methods, Density-Based Methods, Grid-Based Methods, Probabilistic Model-Based
Clustering, Clustering High-Dimensional Data, Clustering with Constraint, Outliers and Outlier
Analysis.
UNIT 4
WEB AND TEXT MINING: Introduction, web mining, web content mining, web structure
mining, web usage mining, Text mining, unstructured text, episode rule discovery for texts,
hierarchy of categories, text clustering.
UNIT 5
TEMPORAL AND SPATIAL DATA MINING: Introduction; Temporal Data Mining ,
Temporal Association Rules, Sequence Mining, GSP algorithm, SPADE,SPIRIT Episode
Discovery, Time Series Analysis, Spatial Mining, Spatial Mining Tasks, Spatial Clustering, Data
Mining Applications: Data Mining for Retail and Telecommunication industries.
UNIT 1
INTRODUCTION TO DATA MINING
Data mining refers to extracting or mining knowledge from large amounts of data. The term is
actually a misnomer. The overall goal of the data mining process is to extract information from a data
set and transform it into an understandable structure for further use. The key properties of data
mining are Automatic discovery of patterns Prediction of likely outcomes Creation of actionable
information Focus on large datasets and databases
Data mining derives its name from the similarities between searching for valuable business
information in a large database for example, finding linked products in gigabytes of store scanner
data and mining a mountain for a vein of valuable ore. Both processes require either sifting through
an immense amount of material, or intelligently probing it to find exactly where the value resides.
Given databases of sufficient size and quality, data mining technology can generate new business
opportunities by providing these capabilities:
Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive hands-on
analysis can now be answered directly from the data — quickly. A typical example of a predictive
problem is targeted marketing. Data mining uses data on past promotional mailings to identify the
targets most likely to maximize return on investment in future mailings. Other predictive problems
include forecasting bankruptcy and other forms of default, and identifying segments of a population
likely to respond similarly to given events.
Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern discovery is the
analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions and
identifying anomalous data that could represent data entry keying errors.
Data Mining Tasks:
Data Mining is the procedure of selection, exploration, and modeling of high quantities of
information to find regularities or relations that are at first unknown to obtain clear and beneficial
results for the owner of the database.
Data mining is an interdisciplinary field, the assemblage of a set of disciplines, such as database
systems, statistics, machine learning, visualization, and data science. It is based on the data mining
methods used, approaches from other disciplines can be used, including neural networks, fuzzy and
rough set theory, knowledge representation, inductive logic programming, or high-performance
computing.
It is established on the types of data to be mined or on the given data mining application, the data
mining system can also integrate methods from spatial data analysis, data retrieval, pattern
identification, image analysis, signal processing, computer graphics, network technology,
economics, business, bioinformatics, or psychology.
A data mining query language can be designed to incorporate these primitives, enabling users to
flexibly connect with data mining systems. A data mining query language supports an authority on
which user-friendly graphical interfaces can be constructed. This promotes a data mining system's
communication with other data systems and its integration with the complete data processing
environment.
It is designing an inclusive data mining language is challenging because data mining protects a wide
spectrum of functions, from data characterization to evolution analysis. Each task has several
requirements. The design of an effective data mining query language needed broad learning of the
power, limitation, and underlying structure of the different types of data mining tasks.
Data mining functionalities are used to define the type of patterns that have to be discovered in data
mining tasks. In general, data mining tasks can be classified into two types including descriptive and
predictive. Descriptive mining tasks define the common features of the data in the database and the
predictive mining tasks act inference on the current information to develop predictions.
The major components of data mining are as follows −
Databases − This is one or a set of databases, data warehouses, spreadsheets, and another
type of data repository where data cleaning and integration techniques can be implemented.
Data warehouse server − This component fetches the relevant records based on users request
from a data warehouse.
Knowledge base − It is a knowledge domain that is employed for discovering interesting
patterns.
Data mining engine − It uses a functional module that is used to perform tasks including
classification, association, cluster analysis, etc.
Pattern evaluation module − This component uses interestingness measures that
communicate with data mining structure to target the search towards interesting patterns.
User interface − This interface enables users to interact with the system by describing a data
mining function or a query through the graphical user interface.
Organizations have access to more data now than they have ever had before. However, making sense
of the huge volumes of structured and unstructured data to implement organization-wide
improvements can be extremely challenging because of the sheer amount of information. If not
properly addressed, this challenge can minimize the benefits of all the data.
Data mining is the process by which organizations detect patterns in data for insights relevant to their
business needs. It‘s essential for both business intelligence and data science. There are many data
mining techniques organizations can use to turn raw data into actionable insights. These involve
everything from cutting-edge artificial Intelligence to the basics of data preparation, which are both
key for maximizing the value of data investments.
The business value of data cleaning and preparation is self-evident. Without this first step, data is
either meaningless to an organization or unreliable due to its quality. Companies must be able to trust
their data, the results of its analytics, and the action created from those results.
These steps are also necessary for data quality and proper data governance.
2. Tracking patterns
Tracking patterns is a fundamental data mining technique. It involves identifying and monitoring
trends or patterns in data to make intelligent inferences about business outcomes. Once an
organization identifies a trend in sales data, for example, there‘s a basis for taking action to capitalize
on that insight. If it‘s determined that a certain product is selling more than others for a particular
demographic, an organization can use this knowledge to create similar products or services, or
simply better stock the original product for this demographic.
3. Classification
Classification data mining techniques involve analyzing the various attributes associated with
different types of data. Once organizations identify the main characteristics of these data types,
organizations can categorize or classify related data. Doing so is critical for identifying, for example,
personally identifiable information organizations may want to protect or redact from documents.
4. Association
Association is a data mining technique related to statistics. It indicates that certain data (or events
found in data) are linked to other data or data-driven events. It is similar to the notion of co-
occurrence in machine learning, in which the likelihood of one data-driven event is indicated by the
presence of another.
The statistical concept of correlation is also similar to the notion of association. This means that the
analysis of data shows that there is a relationship between two data events: such as the fact that the
purchase of hamburgers is frequently accompanied by that of French fries.
5. Outlier detection
Outlier detection determines any anomalies in datasets. Once organizations find aberrations in their
data, it becomes easier to understand why these anomalies happen and prepare for any future
occurrences to best achieve business objectives. For instance, if there‘s a spike in the usage of
transactional systems for credit cards at a certain time of day, organizations can capitalize on this
information by figuring out why it‘s happening to optimize their sales during the rest of the day.
6. Clustering
Clustering is an analytics technique that relies on visual approaches to understanding data. Clustering
mechanisms use graphics to show where the distribution of data is in relation to different types of
metrics. Clustering techniques also use different colors to show the distribution of data. Graph
approaches are ideal for using cluster analytics. With graphs and clustering in particular, users can
visually see how data is distributed to identify trends that are relevant to their business objectives.
7. Regression
Regression techniques are useful for identifying the nature of the relationship between variables in a
dataset. Those relationships could be causal in some instances, or just simply correlate in others.
Regression is a straightforward white box technique that clearly reveals how variables are related.
Regression techniques are used in aspects of forecasting and data modeling.
8. Prediction
Prediction is a very powerful aspect of data mining that represents one of four branches of analytics.
Predictive analytics use patterns found in current or historical data to extend them into the future.
Thus, it gives organizations insight into what trends will happen next in their data. There are several
different approaches to using predictive analytics. Some of the more advanced involve aspects of
machine learning and artificial intelligence. However, predictive analytics doesn‘t necessarily depend
on these techniques it can also be facilitated with more straightforward algorithms.
9. Sequential patterns
This data mining technique focuses on uncovering a series of events that takes place in sequence. It‘s
particularly useful for data mining transactional data. For instance, this technique can reveal what
items of clothing customers are more likely to buy after an initial purchase of say, a pair of shoes.
Understanding sequential patterns can help organizations recommend additional items to customers
to spur sales.
10. Decision trees
Decision trees are a specific type of predictive model that lets organizations effectively mine data.
Technically, a decision tree is part of machine learning, but it is more popularly known as a white
box machine learning technique because of its extremely straightforward nature.
A decision tree enables users to clearly understand how the data inputs affect the outputs. When
various decision tree models are combined they create predictive analytics models known as a
random forest. Complicated random forest models are considered black box machine learning
techniques, because it‘s not always easy to understand their outputs based on their inputs. In most
cases, however, this basic form of ensemble modeling is more accurate than using decision trees on
their own.
11. Statistical techniques
Statistical techniques are at the core of most analytics involved in the data mining process. The
different analytics models are based on statistical concepts, which output numerical values that are
applicable to specific business objectives. For instance, neural networks use complex statistics based
on different weights and measures to determine if a picture is a dog or a cat in image recognition
systems.
Statistical models represent one of two main branches of artificial intelligence. The models for some
statistical techniques are static, while others involving machine learning get better with time.
12. Visualization
Data visualizations are another important element of data mining. They grant users insight into data
based on sensory perceptions that people can see. Today‘s data visualizations are dynamic, useful for
streaming data in real-time, and characterized by different colors that reveal different trends and
patterns in data.
Dashboards are a powerful way to use data visualizations to uncover data mining insights.
Organizations can base dashboards on different metrics and use visualizations to visually highlight
patterns in data, instead of simply using numerical outputs of statistical models.
13. Neural networks
A neural network is a specific type of machine learning model that is often used with AI and deep
learning. Named after the fact that they have different layers which resemble the way neurons work
in the human brain, neural networks are one of the more accurate machine learning models used
today.
Although a neural network can be a powerful tool in data mining, organizations should take caution
when using it: some of these neural network models are incredibly complex, which makes it difficult
to understand how a neural network determined an output.
14. Data warehousing
Data warehousing is an important part of the data mining process. Traditionally, data warehousing
involved storing structured data in relational database management systems so it could be analyzed
for business intelligence, reporting, and basic dash boarding capabilities. Today, there are cloud data
warehouses and data warehouses in semi-structured and unstructured data stores like Hadoop. While
data warehouses were traditionally used for historic data, many modern approaches can provide an
in-depth, real-time analysis of data.
15. Long-term memory processing
Long term memory processing refers to the ability to analyze data over extended periods of time. The
historic data stored in data warehouses is useful for this purpose. When an organization can perform
analytics on an extended period of time, it‘s able to identify patterns that otherwise might be too
subtle to detect. For example, by analyzing attrition over a period of several years, an organization
may find subtle clues that could lead to reducing churn in finance.
Mining different kinds of knowledge in databases. - The need of different users is not the same.
And Different user may be in interested in different kind of knowledge. Therefore it is necessary for
data mining to cover broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction. - The data mining processneeds
to be interactive because it allows users to focus the search for patterns, providing and refining data
mining requests based on returned results.
Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results. - Once the patterns are discovered it needs
to be expressed in high level languages, visual representations. This representations should be easily
understandable by the users.
Handling noisy or incomplete data. - The data cleaning methods are required that can handle the
noise, incomplete objects while mining the data regularities. If data cleaning methods are not there
then the accuracy of the discovered patterns will be poor.
Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should be
interesting because either they represent common knowledge or lack novelty.
Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.
Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of
databases, wide distribution of data,and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithm divide the data
into partitions which is further processed parallel. Then the results from the partitions is merged.
The incremental algorithms, updates databases without having mine the data again from scratch.
Discretization and concept hierarchy generation, where raw data values for attributes are replaced
by ranges or higher conceptual levels. Data discretization is a form of numerosity reduction that is
very useful for the automatic generation of concept hierarchies. Discretization and concept hierarchy
generation are powerful tools for data mining, in that they allow the mining of data at multiple levels
of abstraction.
Association rule mining is a popular and well researched method for discovering
interestingrelations between variables in large databases.
It is intended to identify strong rules discovered in databases using different
measures of interestingness.
Based on the concept of strong rules, Rakesh Agrawal et al. introduced association
rules.
Problem Definition:
The problem of association rule mining is defined as:
or the ratio of the observed support to that expected if X and Y were independent. The
and can be interpreted as the ratio of the expected frequency that X occurs without Y
(that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.
Efficient and Scalable Frequent Item set Mining Methods:
1. In the first iteration of the algorithm, each item is a member of the set of candidate1-
itemsets, C1. The algorithm simply scans all of the transactions in order to countthe number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of
frequent 1-itemsets, L1, can thenbe determined. It consists of the candidate 1-itemsets
satisfying minimum support.In our example, all of the candidates in C1 satisfy minimum
support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1
togenerate a candidate set of 2-itemsets, C2.No candidates are removed fromC2 during the
prune step because each subset of thecandidates is also frequent.
4. Next, the transactions inDare scanned and the support count of each candidate itemsetInC2
isaccumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2-
itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, Fromthejoin step, we first getC3 =L2x
L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based on the
Apriori property that all subsets of a frequentitemsetmust also be frequent, we can determine
that the four latter candidates cannotpossibly be frequent.
7.The transactions in D are scanned in order to determine L3, consisting of those candidate
3-itemsets in C3 having minimum support.
8.The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.
Generating Association Rules from Frequent Item sets:
Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them.
Example:
For many applications, it is difficult to find strong associations among data items at low or
primitive levels of abstraction due to the sparsity of data at those levels.
Strong associations discovered at high levels of abstraction may represent
commonsense knowledge.
Therefore, data mining systems should provide capabilities for mining association
rules at multiple levels of abstraction, with sufficient flexibility for easy traversal
among different abstraction spaces.
Association rules generated from mining data at multiple levels of abstraction arecalled
multiple-level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence framework.
In general, a top-down strategy is employed, where counts are accumulated for the
calculation of frequent itemsets at each concept level, starting at the concept level 1 and
working downward in the hierarchy toward the more specific concept levels,until no
more frequent itemsets can be found.
The concept hierarchy has five levels, respectively referred to as levels 0to 4, starting with level
0 at the root node for all.
Here, Level 1 includes computer, software, printer&camera, and computer
accessory.
Level 3 includes IBM desktop computer, . . . , Microsoft office software, and so on.
The method is also simple in that users are required to specify only one minimum
supportthreshold.
The uniform support approach, however, has some difficulties. It is unlikely that
items at lower levels of abstraction will occur as frequently as those at higher
levels of abstraction.
If the minimum support threshold is set too high, it could miss some meaningful
associationsoccurring at low abstraction levels. If the threshold is set too low, it may
generate many uninteresting associations occurring at high abstraction levels.
2. Reduced Minimum Support:
Each level of abstraction has its own minimum support threshold.
The deeper the level of abstraction, the smaller the corresponding threshold is.
For example, the minimum support thresholds for levels 1 and 2 are 5% and 3%,
respectively.In this way, ―computer,‖ ―laptop computer,‖ and ―desktop computer‖ are all
considered frequent.
Classification and prediction are two forms of data analysis that can be used to extract
modelsdescribing important data classes or to predict future data trends.
Classification predicts categorical (discrete, unordered) labels, prediction models
continuous valued functions.
For example, we can build a classification model to categorize bank loan applications
as either safe or risky, or a prediction model to predict the expenditures of potential
customerson computer equipment given their income and occupation.
A predictor is constructed that predicts a continuous-valued function, or ordered
value, as opposed to a categorical label.
Regression analysis is a statistical methodology that is most often used for numeric
prediction.
Many classification and prediction methods have been proposed by researchers in
machine learning, pattern recognition, and statistics.
Most algorithms are memory resident, typically assuming a small data size. Recent
data mining research has built on such work, developing scalable classification and
prediction techniques capable of handling large disk-resident data.
Decision Tree Introduction:
Decision tree induction is the learning of decision trees from class-labeled training
tuples.
A decision tree is a flowchart-like tree structure, where
Each internal nodedenotes a test on an attribute.
Each branch represents an outcome of the test.
Each leaf node holds a class label.
The topmost node in a tree is the root node.
The construction of decision treeclassifiers does not require any domain knowledge
or parameter setting, and therefore I appropriate for exploratory knowledge
discovery.
Decision trees can handle high dimensionaldata.
Their representation of acquired knowledge in tree formis intuitive and
generallyeasy to assimilate by humans.
The learning and classification steps of decision treeinduction are simple and
fast. In general, decision tree classifiers have good accuracy.
Decision tree induction algorithmshave been used for classification in many
application areas, such as medicine,manufacturing and production, financial analysis,
astronomy, andmolecular biology.
Algorithm for Decision Tree Induction:
There are three possible scenarios.Let A be the splitting attribute. A has v distinct values,
{a1, a2, … ,av}, based on the training data.
1 A is discrete-valued:
• In this case, the outcomes of the test at node N correspond directly to the
knownvalues of A.
• A branch is created for each known value, aj, of A and labeled with that
value. Aneed not be considered in any future partitioning of the tuples.
2 A is continuous-valued:
In this case, the test at node N has two possible outcomes, corresponding to the conditions
A <=split point and A >split point, respectively
Where split point is the split-point returned by Attribute selection method as part of
thesplitting criterion.
3 A is discrete-valued and a binary tree must be produced:
The test at node N is of the form―A€SA?‖.
SA is the splitting subset for A, returned by Attribute selection method as part of the splitting
criterion. It is a subset of the known values of A.
(a) If A is Discrete valued (b)If A is continuous valued (c) IfA is discrete-valued and a binary
tree must be produced:
Bayesian Classification:
• Bayesian classifiers are statistical classifiers.
• They can predictclass membership probabilities, such as the probability that a given tuple
belongs toa particular class.
• Bayesian classification is based on Bayes‘ theorem.
Bayes’ Theorem:
• Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖and it is described by
measurements made on a set of n attributes.
• Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
• For classification problems, we want to determine P(H|X), the probability that the
hypothesisH holds given the ―evidence‖ or observed data tuple X.
• P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
• Bayes‘ theorem is useful in that it providesa way of calculating the posterior
probability,
• P(H|X), from P(H), P(X|H), and P(X).
IF-THEN Rules
Note − The Decision tree induction can be considered as learning a set of rules
simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class at a
time. When learning a rule from a class Ci, we want the rule to cover all the tuples from class C
only and no tuple form any other class.
Rule Pruning
The rule is pruned is due to the following reason −
The Assessment of quality is made on the original set of training data. The rule may perform
well on training data but less well on subsequent data. That's why the rule pruning is
required.
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has
greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg
where pos and neg is the number of positive tuples covered by R, respectively.
Classification by Back propagation:
Backpropagation is an algorithm that back propagates the errors from output nodes to the input
nodes. Therefore, it is simply referred to as backward propagation of errors. It uses in the vast
applications of neural networks in data mining like Character recognition, Signature
verification, etc.
Neural Network:
Neural networks are an information processing paradigm inspired by the human nervous system.
Just like in the human nervous system, we have biological neurons in the same way in neural
networks we have artificial neurons, artificial neurons are mathematical functions derived from
biological neurons. The human brain is estimated to have about 10 billion neurons, each
connected to an average of 10,000 other neurons. Each neuron receives a signal through a
synapse, which controls the effect of the signal on the neuron.
Backpropagation:
Backpropagation Algorithm:
Types of Backpropagation
There are two types of backpropagation networks.
Static backpropagation: Static backpropagation is a network designed to map static inputs
for static outputs. These types of networks are capable of solving static classification
problems such as OCR (Optical Character Recognition).
Recurrent backpropagation: Recursive backpropagation is another network used for fixed-
point learning. Activation in recurrent backpropagation is feed-forward until a fixed value is
reached. Static backpropagation provides an instant mapping, while recurrent
backpropagation does not provide an instant mapping.
Advantages:
It is simple, fast, and easy to program.
Only numbers of the input are tuned, not any other parameter.
It is Flexible and efficient.
No need for users to learn any special functions.
Disadvantages:
It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate
results.
Performance is highly dependent on input data.
Spending too much time training.
The matrix-based approach is preferred over a mini-batch.
Support Vector Machines:
What are Support Vector Machines? Support Vector Machine (SVM) is a relatively
simple Supervised Machine Learning Algorithm used for classification and/or regression. It is
more preferred for classification but is sometimes very useful for regression as well. Basically,
SVM finds a hyper-plane that creates a boundary between the types of data. In 2-dimensional
space, this hyper-plane is nothing but a line. In SVM, we plot each data item in the dataset in an N-
dimensional space, where N is the number of features/attributes in the data. Next, find the optimal
hyperplane to separate the data. So by this, you must have understood that inherently, SVM can
only perform binary classification (i.e., choose between two classes). However, there are various
techniques to use for multi-class problems. Support Vector Machine for Multi-CLass Problems To
perform SVM on multi-class problems, we can create a binary classifier for each class of the data.
The two results of each classifier will be:
The data point belongs to that class OR
The data point does not belong to that class.
For example, in a class of fruits, to perform multi-class classification, we can create a binary
classifier for each fruit. For say, the ‗mango‘ class, there will be a binary classifier to predict if it
IS a mango OR it is NOT a mango. The classifier with the highest score is chosen as the output
of the SVM. SVM for complex (Non Linearly Separable) SVM works very well without any
modifications for linearly separable data. Linearly Separable Data is any data that can be
plotted in a graph and can be separated into classes using a straight line.
We use Kernelized SVM for non-linearly separable data. Say, we have some non-linearly
separable data in one dimension. We can transform this data into two dimensions and the data
will become linearly separable in two dimensions. This is done by mapping each 1-D data point
to a corresponding 2-D ordered pair. So for any non-linearly separable data in any dimension, we
can just map the data to a higher dimension and then make it linearly separable. This is a very
powerful and general transformation. A kernel is nothing but a measure of similarity between
data points. The kernel function in a kernelized SVM tells you, that given two data points in the
original feature space, what the similarity is between the points in the newly transformed feature
space. There are various kernel functions available, but two are very popular :
Radial Basis Function Kernel (RBF): The similarity between two points in the transformed
feature space is an exponentially decaying function of the distance between the vectors and
the original input space as shown below. RBF is the default kernel used in SVM.
Polynomial Kernel: The Polynomial kernel takes an additional parameter, ‗degree‘ that
controls the model‘s complexity and computational cost of the transformation
A very interesting fact is that SVM does not actually have to perform this actual transformation
on the data points to the new high dimensional feature space. This is called the kernel trick. The
Kernel Trick: Internally, the kernelized SVM can compute these complex transformations just
in terms of similarity calculations between pairs of points in the higher dimensional feature space
where the transformed feature representation is implicit. This similarity function, which is
mathematically a kind of complex dot product is actually the kernel of a kernelized SVM. This
makes it practical to apply SVM when the underlying feature space is complex or even infinite-
dimensional. The kernel trick itself is quite complex and is beyond the scope of this
article. Important Parameters in Kernelized SVC ( Support Vector Classifier)
1. The Kernel: The kernel, is selected based on the type of data and also the type of
transformation. By default, the kernel is Radial Basis Function Kernel (RBF).
2. Gamma : This parameter decides how far the influence of a single training example reaches
during transformation, which in turn affects how tightly the decision boundaries end up
surrounding points in the input space. If there is a small value of gamma, points farther apart
are considered similar. So more points are grouped together and have smoother decision
boundaries (maybe less accurate). Larger values of gamma cause points to be closer together
(may cause overfitting).
3. The ‘C’ parameter: This parameter controls the amount of regularization applied to the data.
Large values of C mean low regularization which in turn causes the training data to fit very
well (may cause overfitting). Lower values of C mean higher regularization which causes the
model to be more tolerant of errors (may lead to lower accuracy).
Associative Classification:
Associative Classification in Data Mining
Data mining is the process of discovering and extracting hidden patterns from different types of
data to help decision-makers make decisions. Associative classification is a common classification
learning method in data mining, which applies association rule detection methods and classification
to create classification models.
Association rule learning is a machine learning method for discovering interesting relationships
between variables in large databases. It is designed to detect strong rules in the database based on
some interesting metrics. For any given multi-item transaction, association rules aim to obtain
rules that determine how or why certain items are linked.
Association rules are created by searching for information on common if-then patterns and using
specific criteria with support and trust to define what the key relationships are. They help to show
the frequency of an item in a given data since confidence is defined by the number of times an if -
then statement is found to be true. However, a third criterion called lift is often used to compare
expected and actual confidence. Lift shows how many times the if-then statement was predicted
to be true. Create association rules to compute itemsets based on data created by two or more
items. Association rules usually consist of rules that are well represented by the data.
There are different types of data mining techniques that can be used to find out the specific
analysis and result like Classification analysis, Clustering analysis, and multivariate analysis.
Association rules are mainly used to analyze and predict customer behavior.
In Classification analysis, it is mostly used to question, make decisions, and predict behavior.
In Clustering analysis, it is mainly used when no assumptions are made about possible
relationships in the data.
In Regression analysis, it is used when we want to predict an infinitely dependent value of a
set of independent variables.
Bing Liu Et Al was the first to propose associative classification, in which he defined a model
whose rule is ―the right-hand side is constrained to be the attribute of the classification class‖.An
associative classifier is a supervised learning model that uses association rules to assign a target
value.
The model generated by the association classifier and used to label new records consists of
association rules that produce class labels. Therefore, they can also be thought of as a list of ―if-
then‖ clauses: if a record meets certain criteria (specified on the left side of the rule, also known
as antecedents), it is marked (or scored) according to the rule‘s category on the right. Most
associative classifiers read the list of rules sequentially and apply the first matching rule to mark
new records. Association classifier rules inherit some metrics from association rules, such as
Support or Confidence, which can be used to rank or filter the rules in the model and evaluate
their quality.
There are different types of Associative Classification Methods, Some of them are given below.
1. CBA (Classification Based on Associations): It uses association rule techniques to classify
data, which proves to be more accurate than traditional classification techniques. It has to face
the sensitivity of the minimum support threshold. When a lower minimum support threshold is
specified, a large number of rules are generated.
2. CMAR (Classification based on Multiple Association Rules): It uses an efficient FP-tree,
which consumes less memory and space compared to Classification Based on Associations. The
FP-tree will not always fit in the main memory, especially when the number of attributes is large.
3. CPAR (Classification based on Predictive Association Rules): Classification based on
predictive association rules combines the advantages of association classification and traditional
rule-based classification. Classification based on predictive association rules uses a greedy
algorithm to generate rules directly from training data. Furthermore, classification based on
predictive association rules generates and tests more rules than traditional rule-based classifiers
to avoid missing important rules.
Classification using frequent patterns:
In Data Mining, Frequent Pattern Mining is a major concern because it is playing a major role in
Associations and Correlations. First of all, we should know what is a Frequent Pattern?
Frequent Pattern is a pattern which appears frequently in a data set. By identifying frequent
patterns we can observe strongly correlated items together and easily identify similar
characteristics, associations among them. By doing frequent pattern mining, it leads to further
analysis like clustering, classification and other data mining tasks.
Before moving to mine frequent patterns, we should focus on two terms which ―support‖ and
―confidence‖ because they can provide a measure if the Association rule is qualified or not for a
particular data set.
Support: how often a given rule appears in the database being mined
Confidence: the number of times a given rule turns out to be true in practice
After getting a clear idea about the two terms Support and Confidence, we can move to frequent
In this article, we are focusing on Mining frequent patterns with candidate generation with Apriori
Algorithm which is popularly used for Association mining. Let‘s understand Apriori Algorithm with
an example and it will help you to understand the concept behind it in a clear manner. Let‘s consider
the sample data set mentioned above and assume that the minimum support=2.
1. Generate Candidate set 1, do the first scan and generate One item set
In this stage, we get the sample data set and take each individual‘s count and make frequent item set
1(K = 1).
Candidate set 1
―Candidate set 1‖ figure shows you the individual item Support count. Hence the minimum support
is 2 and based on that, item E will remove from the Candidate set 1 as an infrequent item
(disqualified).
Frequent item set from the first scan
Frequent item set based on the minimum support value will be shown under the figure ―Frequent
item set from the first scan‖ as the ―One item set‖.
2. Generate Candidate set 2, do the second scan and generate Second item set
Through this step, you create frequent set 2 (K =2) and takes each of their Support counts.
Candidate set 2
―Candidate set 2‖ figure generates through joining Candidate set 1 and takes the frequency of
related occurrences. Hence the minimum support is 2, Itemset B, D will be removed from Candidate
3. Generate Candidate set 3, do the third scan and generate Third item set
In this iteration create frequent set 3 (K = 3) and take count of Support. Then compare with the
Candidate set 3
As you can see here we compare Candidate set 3 with the minimum support value, generated Third
item set. The frequent set from the third scan is the same as above.
4. Generate Candidate set 4, do the fourth scan and generate Fourth item set
By considering the Frequent set, we can generate Candidate set 4, by joining Candidate set 3. Then
possible Candidate set 4 is; [A, B, C, D] which has minimum support less than 2. Therefore we have
to stop the calculation from here because we cannot go no anymore iterations. Therefore for the
above data set frequent patterns are [A, B, C] and [A, C, D].
By considering one of the frequent set as{A, B, C} and the possible Association rules as follows;
1. A => B, C
2. A, B => C
3. A, C => B
4. B => A, C
5. B, C => A
6. C => A, B
Then we assume that the minimum confidence = 50% and calculate each possible Association rules‘
Confidence then we can identify disqualified ones with respect to minimum confidence is less than
50%. Then rest of the Association rules which have the Confidence greater or equal to 50% percent
Lazy Learners:
The classification methods discussed so far in this chapter—decision tree induction, Bayesian
classification, rule-based classification, classification by back propagation, support vector
machines, and classification based on association rule mining—are all examples of eager
learners. Eager learners, when given a set of training tuples, will construct a generalization (i.e.,
classification) model before receiving new (e.g., test) tuples to classify. We can think of the
learned model as being ready and eager to classify previously unseen tuples.
k-Nearest-Neighbor Classifiers
The k-nearest-neighbor method was first described in the early 1950s. The method is labor
intensive when given large training sets, and did not gain popularity until the 1960s when
increased computing power became available. It has since been widely used in the area of pattern
recognition.
Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test
tuple with training tuples that are similar to it. The training tuples are described by n attributes.
Each tuple represents a point in an n-dimensional space. In this way, all of the training tuples are
stored in an n-dimensional pattern space. When given an unknown tuple, a k-nearest-neighbor
classifier searches the pattern space for the k training tuples that are closest to the unknown tuple.
These k training tuples are the k ―nearest neighbors‖ of the unknown tuple.
―Closeness‖ is defined in terms of a distance metric, such as Euclidean distance. The
Euclidean distance between two points or tuples, say, X1 = (x11, x12, : : : , x1n) and X2 =
(x21, x22, : : , x2n), is
Case-Based Reasoning
Case-based reasoning (CBR) classifiers use a database of problem solutions to solve new
problems. Unlike nearest-neighbor classifiers, which store training tuples as points in Euclidean
space, CBR stores the tuples or cases‖ for problem solving as complex symbolic descriptions.
Business applications of CBR include problem resolution for customer service help desks, where
cases describe product-related diagnostic problems. CBR has also been applied to areas such as
engineering and law, where cases are either technical designs or legal rulings, respectively.
Medical education is another area for CBR, where patient case histories and treatments are used to
help diagnose and treat new patients.
When given a new case to classify, a case-based reasoner will first check if an identical training
case exists. If one is found, then the accompanying solution to that case is returned. If no identical
case is found, then the case-based reasoner will search for training cases having SCE Department
of Information Technology components that are similar to those of the new case. Conceptually,
these training cases may be considered as neighbors of the new case. If cases are represented as
graphs, this involves searching for sub graphs that are similar to sub graphs within the new case.
The case-based reasoner tries to combine the solutions of the neighboring training cases in order
to propose a solution for the new case. If incompatibilities arise with the individual solutions, then
backtracking to search for other solutions may be necessary. The case-based reasoner may employ
background knowledge and problem-solving strategies in order to propose a feasible combined
solution.
UNIT 3
Applications:
Cluster analysis has been widely used in numerous applications, including market research,
pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value, and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection, Applications of outlier detection include
the detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Methods
Partitioning Methods:
A partitioning method constructs k partitions of the data, where each partition represents a
cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the
following requirements:
Each group must contain at least one object,
and Each object must belong to exactly one
group.
The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.
Hierarchical Methods:
A hierarchical method creates a hierarchical decomposition of the given set of data objects.
A hierarchical method can be classified as being either agglomerative or divisive, based on
how the hierarchical decomposition is formed.
The agglomerative approach, also called the bottom-up approach, starts with each object
forming a separate group. It successively merges the objects or groups that are close to
one another, until all of the groups are merged into one or until a termination condition
holds.
The divisive approach, also called the top-down approach, starts with all of the objects in
the same cluster. In each successive iteration, a cluster is split up into smaller clusters,
until eventually each object is in one cluster, or until a termination condition holds.
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can
never be undone. This rigidity is useful in that it leads to smaller computation costs by not
having to worry about a combinatorial number of different choices.
Model-Based Methods:
Model-based methods hypothesize a model for each of the clusters and find the best fit
of the data to the given model.
A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.
Tasks in Data Mining:
Clustering High-Dimensional Data
Constraint-Based Clustering
Clustering High-Dimensional Data:
It is a particularly important task in cluster analysis because many applications
require the analysis of objects containing a large number of features or dimensions.
For example, text documents may contain thousands of terms or keywords as
features, and DNA micro array data may provide information on the expression
levels of thousands of genes under hundreds of conditions.
Clustering high-dimensional data is challenging due to the curse of dimensionality.
Many dimensions may not be relevant. As the number of dimensions increases,
The data become increasingly sparse so that the distance measurement between
pairs of points become meaningless and the average density of points anywhere in
the data is likely to be low. Therefore, a different clustering methodology needs to
be developed for high-dimensional data.
CLIQUE and PROCLUS are two influential subspace clustering methods, which
search for clusters in subspaces of the data, rather than over the entire data space.
Frequent pattern–based clustering, another clustering methodology, extracts distinct
frequent patterns among subsets of dimensions that occur frequently. It uses such
patterns to group objects and generate meaningful clusters.
Constraint-Based Clustering:
It is a clustering approach that performs clustering by incorporation of user-specified
or application-oriented constraints.
A constraint expresses a user‘s expectation or describes properties of the desired
clustering results, and provides an effective means for communicating with the
clustering process.
Various kinds of constraints can be specified, either by a user or as per application
requirements.
Spatial clustering employs with the existence of obstacles and clustering under user-
specified constraints. In addition, semi-supervised clustering employs for pairwise
constraints in order to improve the quality of the resulting clustering.
Where E is the sum of the square error for all objects in the data set
pis the point in space representing a given object
miis the mean of cluster Ci
The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data. This effect is particularly
exacerbated due to the use of the square-error function.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick
actual objects to represent the clusters, using one representative object per cluster. Each
remaining object is clustered with the representative object to which it is the most similar.
The partitioning method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point. That is, an
absolute-error criterion is used, defined as
whereE is the sum of the absolute error for all objects in the data set
Case 1:
pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.
Case 2:
pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to orandom, then p is reassigned to orandom.
Case 3:
pcurrently belongs to representative object, oi, i≠j. If ojis replaced by orandomas a representative
object and p is still closest to oi, then the assignment does notchange.
Case 4:
pcurrently belongs to representative object, oi, i≠j. If ojis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned
toorandom.
Four cases of the cost function for k-medoids clustering
We can specify constraints on the objects to be clustered. In a real estate application, for
example, one may like to spatially cluster only those luxury mansions worth over a million
dollars. This constraint confines the setoff objects to be clustered. It can easily be
handled by preprocessing after which the problem reduces to an instance of unconstrained
clustering.
A user may like to set a desired range for each clustering parameter. Clustering parameters
are usually quite specific to the given clustering algorithm. Examples of parameters include
k, the desired number of clusters in a k-means algorithm; or e the radius and the minimum
number of points in the DBSCAN algorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus, their fine tuning and processing are usually not considered a form of constraint-based
clustering.
Constraints on distance or similarity functions:
We can specify different distance or similarity functions for specific attributes of the
objects to be clustered, or different distance measures for specific pairs of objects. When
clustering sportsmen, for example, we may use different weighting schemes for height,
body weight, age, and skill level. Although this will likely change the mining results, it
may not alter the clustering process per se. However, in some cases, such changes may
make the evaluation of the distance function nontrivial, especially when it is tightly
intertwined with the clustering process.
User-specified constraints on the properties of individual clusters:
A user may like to specify desired characteristics of the resulting clusters, which may
strongly influence the clustering process.
Semi-supervised clustering based on partial supervision:
The quality of unsupervised clustering can be significantly improved using some weak
form of supervision. This may be in the form of pairwise constraints (i.e., pairs of objects
labeled as belonging to the same or different cluster). Such a constrained clustering process
is called semi-supervised clustering.
Outlier Analysis:
There exist data objects that do not comply with the general behavior or model of the data.
Such data objects, which are grossly different from or inconsistent with the remaining set
of data, are called outliers.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all
together. This, however, could result in the loss of important hidden information because
one person‘s noise could be another person‘s signal. In other words, the outliers may be of
particular interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.
It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services. In addition, it is useful in customized marketing for
identifying the spending behavior of customers with extremely low or extremely high
incomes, or in medical analysis for finding unusual responses to various medical
treatments.
Outlier mining can be described as follows: Given a set of n data points or objects and k, the
expected number of outliers, find the top k objects that are considerably dissimilar,
exceptional, or inconsistent with respect to the remaining data. The outlier mining problem
can be viewed as two sub problems:
Define what data can be considered as inconsistent in a given data set,
andFind an efficient method to mine the outliers so defined.
Types of outlier detection:
Consecutive procedures:
An example of such a procedure is the inside out procedure. Its main idea is that the object
that is least likely to be an outlier is tested first. If it is found to be an outlier, then all of the
more extreme values are also considered outliers; otherwise, the next most extreme object is
tested, and so on. This procedure tends to be more effective than block procedures.
In this method, the data space is partitioned into cells with a side length equal to
Eachcell has two layers surrounding it. The first layer is one cell thick, while the secondis
cells thick, rounded up to the closest integer. The algorithm counts outliers on a
cell-by-cell rather than an object-by-object basis. For a given cell, it accumulates three
counts—the number of objects in the cell, in the cell and the first layer together, and in the
cell and both layers together. Let‘s refer to these counts as cell count, cell + 1 layer count,
andcell + 2 layers count, respectively.
Let Mbe the maximum number of outliers that can exist in the dmin-neighborhood of an
outlier.
An object, o, in the current cell is considered an outlier only if cell + 1 layer countis less
than or equal to M. If this condition does not hold, then all of the objects in the cell can
beremoved from further investigation as they cannot be outliers.
If cell_+ 2_layers_count is less than or equal to M, then all of the objects in the cell are
considered outliers. Otherwise, if this number is more than M, then it is possible that
some of the objects in the cell may be outliers. To detect these outliers, object-by-object
processing is used where, for each object, o, in the cell, objects in the second layer of o
are examined. For objects in the cell, only those objects having no more than M points in
their dmin-neighborhoods are outliers. The dmin-neighborhood of an object consists
ofthe object‘s cell, all of its first layer, and some of its second layer.
A variation to the algorithm is linear with respect to n and guarantees that no morethan three passes
over the data set are required. It can be used for large disk-residentdata sets, yet does not scale well
for high dimensions.
UNIT 4
WEB AND TEXT MINING: Introduction, web mining, web content mining, web structure
mining, web usage mining, Text mining, unstructured text, episode rule discovery for texts,
hierarchy of categories, text clustering.
Web content mining can be used for mining of useful data, information and knowledge
from web page content.
Web structure mining helps to find useful knowledge or information pattern from the
structure of hyperlinks.
Due to heterogeneity and absence of structure in web data, automated discovery of new
knowledge pattern can be challenging to some extent.
Web content mining performs scanning and mining of the text, images and groups of web
pages according to the content of the input (query), by displaying the list in search engines.
For example: If an user wants to search for a particular book, then search engine provides
the list of suggestions.
Web usage mining is used for mining the web log records (access information of web
pages) and helps to discover the user access patterns of web pages.
Web server registers a web log entry for every web page.
Analysis of similarities in web log records can be useful to identify the potential customers
for e-commerce companies.
Text mining:
Text mining is the process of exploring and analyzing large amounts of unstructured text data aided
by software that can identify concepts, patterns, topics, keywords and other attributes in the data.
It's also known as text analytics, although some people draw a distinction between the two terms;
in that view, text analytics refers to the application that uses text mining techniques to sort through
data sets.
Text mining has become more practical for data scientists and other users due to the development
of big data platforms and deep learning algorithms that can analyze massive sets of unstructured
data.
Mining and analyzing text helps organizations find potentially valuable business insights in
corporate documents, customer emails, call center logs, verbatim survey comments, social network
posts, medical records and other sources of text-based data. Increasingly, text mining capabilities
are also being incorporated into AI chatbots and virtual agents that companies deploy to provide
automated responses to customers as part of their marketing, sales and customer service operations.
Doing so typically involves the use of natural language processing (NLP) technology, which
applies computational linguistics principles to parse and interpret data sets.
The upfront work includes categorizing, clustering and tagging text; summarizing data sets;
creating taxonomies; and extracting information about things like word frequencies and
relationships between data entities. Analytical models are then run to generate findings that can
help drive business strategies and operational actions.
Applications of text mining
Sentiment analysis is a widely used text mining application that can track customer sentiment about
a company. Also known as opinion mining, sentiment analysis mines text from online reviews,
social networks, emails, call center interactions and other data sources to identify common threads
that point to positive or negative feelings on the part of customers. Such information can be used to
fix product issues, improve customer service and plan new marketing campaigns, among other
things.
Hierarchy of categories:
Concept Hierarchy
1. Binning
o In binning, first sort data and partition into (equi-depth) bins then one can smooth by
bin means, smooth by bin median, smooth by bin boundaries, etc.
2. Histogram analysis
o Histogram is a popular data reduction technique
o Divide data into buckets and store average (sum) for each bucket
o Can be constructed optimally in one dimension using dynamic programming
o Related to quantization problems.
3. Clustering analysis
o Partition data set into clusters, and one can store cluster representation only
o Can be very effective if data is clustered but not if data is ―smeared‖
o Can have hierarchical clustering and be stored in multi-dimensional index tree
structures
4. Entropy-based discretization
5. Segmentation by natural partitioning
o 3-4-5 rule can be used to segment numeric data into relatively uniform, ―natural‖
intervals.
o If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition
the range into 3 equi-width intervals
o If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range
into 4 intervals
o If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range
into 5 intervals
Text clustering:
UNIT 5
Temporal Data Mining is the process of extracting useful information from the pool of temporal
data. It is concerned with analyzing temporal data to extract and find the temporal patterns and
regularities in the datasets.
The main objective of Temporal Data Mining is to find the temporal patterns, trends, and relations
within the data and extract meaningful information from the data to visualize how the data trend
has changed over a course of time.
Temporal Data Mining includes the processing of time-series data, and sequences of data to
determine and compute the values of the same attributes over multiple time points
Phase 1. Find every set of items(itemsets) X Í R that is frequent, i.e. their frequency exceeds the
established minimum support s .
Phase 2. Use the frequent itemsets X to find the rules: test for every Y Ì X, with Y ¹ Æ , if the
rule X \ Y Þ Y satisfies with enough confidence, i.e. it exceeds the established minimum
confidence q .
Phase 2T. Use the frequent itemsets X to find the rules: verify for every Y Ì X, with Y ¹ Æ , if the
rule X \ Y Þ Y [t1, t2] is satisfied with enough confidence, in other words, exceeds the minimum
confidence q established in the interval [t1, t2].
Sequence Mining:
Sequential pattern mining is the mining of frequently appearing series events or subsequences as
patterns. An instance of a sequential pattern is users who purchase a Canon digital camera are to
purchase an HP color printer within a month.
For retail information, sequential patterns are beneficial for shelf placement and promotions. This
industry, and telecommunications and different businesses, can also use sequential patterns for
targeted marketing, user retention, and several tasks.
There are several areas in which sequential patterns can be used such as Web access pattern
analysis, weather prediction, production processes, and web intrusion detection.
Given a set of sequences, where each sequence includes a file of events (or elements) and each
event includes a group of items, and given a user-specified minimum provide threshold of min
sup, sequential pattern mining discover all frequent subsequences, i.e., the subsequences whose
occurrence frequency in the group of sequences is no less than min_sup.
Let I = {I1, I2,..., Ip} be the set of all items. An itemset is a nonempty set of items. A sequence is an
ordered series of events. A sequence s is indicated {e1, e2, e3 … el} where event e1 appears before
e2, which appears before e3, etc. Event ej is also known as element of s.
In the case of user purchase information, an event defines a shopping trip in which a customer
purchase items at a specific store. The event is an itemset, i.e., an unordered list of items that the
customer purchased during the trip. The itemset (or event) is indicated (x1x2···xq), where xk is an
item.
An item can appear just once in an event of a sequence, but can appear several times in different
events of a sequence. The multiple instances of items in a sequence is known as the length of the
sequence. A sequence with length l is known as l-sequence.
A sequence database, S, is a group of tuples, (SID, s), where SID is a sequence_ID and s is a
sequence. For instance, S includes sequences for all users of the store. A tuple (SID, s) is include a
sequence α, if α is a subsequence of s.
This phase of sequential pattern mining is an abstraction of user-shopping sequence analysis.
Scalable techniques for sequential pattern mining on such records are as follows −
There are several sequential pattern mining applications cannot be covered by this phase. For
instance, when analyzing Web clickstream series, gaps among clicks become essential if one
required to predict what the next click can be.
In DNA sequence analysis, approximate patterns become helpful because DNA sequences can
include (symbol) insertions, deletions, and mutations. Such diverse requirements can be
considered as constraint relaxation or application.
GSP algorithm:
GSP is a very important algorithm in data mining. It is used in sequence mining from large
databases. Almost all sequence mining algorithms are basically based on a prior algorithm. GSP
uses a level-wise paradigm for finding all the sequence patterns in the data. It starts with finding
the frequent items of size one and then passes that as input to the next iteration of the GSP
algorithm. The database is passed multiple times to this algorithm. In each iteration, GSP
removes all the non-frequent itemsets. This is done based on a threshold frequency which is
called support. Only those itemsets are kept whose frequency is greater than the support count.
After the first pass, GSP finds all the frequent sequences of length-1 which are called 1-
sequences. This makes the input to the next pass, it is the candidate for 2-sequences. At the end
of this pass, GSP generates all frequent 2-sequences, which makes the input for candidate 3-
sequences. The algorithm is recursively called until no more frequent itemsets are found.
Basic of Sequential Pattern (GSP) Mining:
Sequence: A sequence is formally defined as the ordered set of items {s1, s2, s3, …, sn}. As
the name suggests, it is the sequence of items occurring together. It can be considered as a
transaction or purchased items together in a basket.
Subsequence: The subset of the sequence is called a subsequence. Suppose {a, b, g, q, y, e,
c} is a sequence. The subsequence of this can be {a, b, c} or {y, e}. Observe that the
subsequence is not necessarily consecutive items of the sequence. From the sequences of
databases, subsequences are found from which the generalized sequence patterns are found at
the end.
Sequence pattern: A sub-sequence is called a pattern when it is found in multiple sequences.
The goal of the GSP algorithm is to mine the sequence patterns from the large database. The
database consists of the sequences. When a subsequence has a frequency equal to more than
the ―support‖ value. For example: the pattern <a, b> is a sequence pattern mined from
sequences {b, x, c, a}, {a, b, q}, and {a, u, b}.
2. 200 <(ad)c(bcd)(abe)>
3. 300 <(ef)(ab)(def)cb>
4. 400 <eg(adf)CBC>
Transaction: The sequence consists of many elements which are called transactions.
<a(ab)(ac)d(cef)> is a sequence whereas (a), (ab), (ac),
(d) and (cef) are the elements of the sequence.
These elements are sometimes referred as transactions.
An element may contain a set of items. Items within an element are unordered and we list them
alphabetically.
For example, (cef) is the element and it consists of 3 items c, e and f.
Since, all three items belong to same element, their order does not matter. But we prefer to put
them in alphabetical order for convenience.
The order of the elements of the sequence matters unlike order of items in same transaction.
k-length Sequence:
The number of items involved in the sequence is denoted by K. A sequence of 2 items is called a
2-len sequence. While finding the 2-length candidate sequence this term comes into use.
Example of 2-length sequence is: {ab}, {(ab)}, {bc} and {(bc)}.
{bc} denotes a 2-length sequence where b and c are two different transactions. This can also
be written as {(b)(c)}
{(bc)} denotes a 2-length sequence where b and c are the items belonging to the same
transaction, therefore enclosed in the same parenthesis. This can also be written as {(cb)},
because the order of items in the same transaction does not matter.
Support in k-length Sequence:
Support means the frequency. The number of occurrences of a given k-length sequence in the
sequence database is known as the support. While finding the support the order is taken care.
Illustration:
Suppose we have 2 sequences in the database.
s1: <a(bc)b(cd)>
s2: <b(ab)abc(de)>
We need to find the support of {ab} and {(bc)}
Finding support of {ab}:
This is present in first sequence.
s1: <a(bc)b(cd)>
Since, a and b belong to different elements, their order matters.
In second sequence {ab} is not found but {ba} is present.
s2: <b(ab)abc(de)> Thus we don’t consider this.
Hence, support of {ab} is 1.
Finding support of {bc}:
Since, b and c are present in same element, their order does not matter.
s1: <a(bc)b(cd)>, first occurrence.
s2: <b(ab)abc(de)>, it seems correct, but is not. b and c are present in different elements here.
So, we don’t consider it.
Hence, support of {(bc)} is 1.
How to join L1 and L1 to give C2?
L1 is the final 1-length sequence after pruning. After pruning all the entries left in the set have
supported greater than the threshold.
Case 1: Join {ab} and {ac}
s1: {ab}, s2: {ac}
After removing a from s1 and c from s2.
s1’={b}, s2’={a}
s1′ and s2′ are not same, so s1 and s2 can’t be joined.
Case 2: Join {ab} and {be}
s1: {ab}, s2: {be}
After removing a from s1 and e from s2.
s1’={b}, s2’={b}
s1′ and s2′ are exactly same, so s1 and s2 be joined.
s1 + s2 = {abe}
Case 3: Join {(ab)} and {be}
s1: {(ab)}, s2: {be}
After removing a from s1 and e from s2.
s1’={(b)}, s2’={(b)}
s1′ and s2′ are exactly same, so s1 and s2 be joined.
s1 + s2 = {(ab)e}
s1 and s2 are joined in such a way that items belong to correct elements or transactions.
Pruning Phase: While building Ck (candidate set of k-length), we delete a candidate sequence
that has a contiguous (k-1) subsequence whose support count is less than the minimum support
(threshold). Also, delete a candidate sequence that has any subsequence without minimum
support.
{abg} is a candidate sequence of C3.
{abg} is a candidate sequence of C3.
To check if {abg} is proper candidate or not, without checking its support, we check the support
of its subsets.
Because subsets of 3-length sequence will be 1 and 2 length sequences. We build the candidate
sets increment like 1-length, 2-length and so on.
Subsets of {abg} are: {ab], {bg} and {ag}
Check support of all three subsets. If any of them have support less than minimum support then
delete the sequence {abg} from the set C3 otherwise keep it.
SPADE:
SPIRIT Episode Discovery:
Time Series Analysis:
Time series analysis is a specific way of analyzing a sequence of data points collected over an
interval of time. In time series analysis, analysts record data points at consistent intervals over a set
period of time rather than just recording the data points intermittently or randomly. However, this
type of analysis is not merely the act of collecting data over time.
What sets time series data apart from other data is that the analysis can show how variables change
over time. In other words, time is a crucial variable because it shows how the data adjusts over the
course of the data points as well as the final results. It provides an additional source of information
and a set order of dependencies between the data.
Time series analysis typically requires a large number of data points to ensure consistency and
reliability. An extensive data set ensures you have a representative sample size and that analysis
can cut through noisy data. It also ensures that any trends or patterns discovered are not outliers
and can account for seasonal variance. Additionally, time series data can be used for forecasting
predicting future data based on historical data.
Time series analysis is used for non-stationary data—things that are constantly fluctuating over
time or are affected by time. Industries like finance, retail, and economics frequently use time
series analysis because currency and sales are always changing. Stock market analysis is an
excellent example of time series analysis in action, especially with automated trading algorithms.
Likewise, time series analysis is ideal for forecasting weather changes, helping meteorologists
predict everything from tomorrow‘s weather report to future years of climate change. Examples of
time series analysis in action include:
Weather data
Rainfall measurements
Temperature readings
Heart rate monitoring (EKG)
Brain monitoring (EEG)
Quarterly sales
Stock prices
Automated stock trading
Industry forecasts
Interest rates
Spatial Mining:
Spatial Data Mining is the process of discovering interesting and useful patterns, spatial
relationships, which weren‘t earlier stored in spatial databases. In spatial data, mining analysts use
geographical or spatial information to produce business intelligence or other results. Challenges
involved in spatial data mining include identifying patterns or finding relevant objects to the
research project.
The general tools used for Spatial data Mining are Clementine See5/C5.0 and Enterprise Miner.
These tools are preferable for analyzing scientific and engineering data, astronomical data,
multimedia data, genomic data, and web data.
Apart from this information, it may contain the different types of an attribute that helps identify a
geographical location and its characteristics.
Spatial Mining Tasks:
Spatial Clustering:
Spatial clustering aims to partition spatial data into a series of meaningful subclasses, called spatial
clusters, such that spatial objects in the same cluster are similar to each other, and are dissimilar to
those in different clusters.
What is the purpose of sales if the retailer doesn‘t know who their customers are? It‘s a definite
need to understand about their customers. It starts by analyzing them with various factors.
Finding the source by which the customer gets to know about that retailing platform would help
in enhancing the advertisement of retailers to attract a completely new set of people. By finding
the days they have frequently purchased can help in discount sales or special boost up on festival
days. The time they spend buying per order can give us useful statistical data to enhance growth.
The amount of money spent on the order can help the retailer in separating the customer crowd
into groups of High paid orders, medium-paid orders, and low-paid orders. This will increase the
targeted customers or help in introducing customized packages depending on price. By knowing
the language and payment method preferences, retailers can provide required services to satisfy
the customers. Managing a good business relationship with the customer can gain trust and
loyalty that can bring a rapid profit for the retailer. The retention of customers in their company
will help them to withstand the competition between similar other companies.
RFM Value:
RFM stands for Recency, Frequency, Monetary value. Recency is nothing but the nearest or
recent time when the customer made a purchase. Frequency is how often the purchase had taken
place and Monetary value is the amount spent by the customers on the purchase. RFM can surge
monetization by holding on to the regular and potential customers by keeping them happy with
satisfying results. It can also help in pulling back the trailing customers who tend to reduce the
purchase. The more the RFM score, the more the growth of sales is. RFM also prevents from
sending over requests to engaged customers and it helps to implement new marketing techniques
to low ordering customers. RFM helps in identifying innovative solutions.
Market-based analysis:
The market-based analysis is a technique used to study and analyze the shopping sequence of a
customer to increase revenue/sales. This is done by analyzing datasets of a particular customer by
learning their shopping history, frequently bought items, items grouped like a combination to
use.
A very good example is the loyalty card issued by the retailer to customers. From the customer‘s
point of view, the card is needed to keep track of discounts in the future, incentive criteria details,
and the history of transactions. But, if we take this loyalty card from a retailer point of view, the
applications of market-based analysis will be layered inside to collect the details about the
transaction.
This analysis can be achieved with data science techniques or various algorithms. This can even
be achieved without technical skills. Microsoft Excel platform is used to analyze the customer
purchases, frequently bought or frequently grouped items. The spreadsheets can be organized by
using ID as specified for different transactions. This analysis helps in suggesting products for the
customer which may pair well with their current purchase which leads to cross-selling and
improved profits. It also helps to track the purchase rate per month or year. It manifests the
correct time for the retailer to make the desired offers to attract the right customers for the
targeted products.
Everything nowadays needs advertising. Because advertising the product helps people know
about its existence, use, and features. It takes the product from the warehouse to the real world. If
it has to attract the right customers, data must be analyzed. This is the right call to sales or market
campaign performed by the retailers. The marketing campaigns must be initiated with the right
plans else it may lead to loss of company by over-investing in untargeted Advertisements. The
sales campaign depends on the time, location, and preference of the customer. The platform in
which the campaign takes place also plays a major role in pulling the right customers in. It
requires regular analysis of the sales and its associated data taking place in a particular platform
at a certain time. The traffic in social or network platforms will give us the favoring of
campaigned product or not. The retailer can make changes in the campaign with the previous
statistics which rapidly increases the sales profit and prevents overspending. Learning about the
customer profits and the company profits can enhance the usage of campaigns. The number of
sales per one campaign can also guide the retailer on whether to invest in it or not. A trial-and-
error method can be converted into a well-transformed method by the efficient handling of data.
A multi-channel sale campaign also helps to analyze the purchases and surges the revenue, profit,
and number of customers.
Role of data mining in telecommunication industries
In the highly evolving and competitive surroundings, the telecommunication industry plays a
major in handling huge data sets of customers, network and call data. To thrive in such an
environment, the Telecommunication Industry must find a way to handle data easily. Data
Mining is preferred to enhance the business and to solve the problem in this industry. The major
function includes fraud call identification and spotting the defects in a network to isolate the
faults. Data mining can also enhance effective marketing techniques. Anyways, this industry
confronts challenges in dealing with the logical and time aspect in data mining which calls the
need to foresee rarity in telecommunication data to detect network faults or buyer frauds in real-
time.
Whenever a call starts in the telecommunication network, the details of the call are recorded. The
date and instant of time in which it happens, the duration of call along with the time when it ends.
Since all the data of a call is collected in real-time, it is ready to be processed with data mining
techniques. But we should segregate data from the customer level not from isolated single phone
call levels. Thus, by efficient extraction of data, one can find the customer calling pattern.
Some of the data that help to find the pattern are
average time duration of calls
Time in which the call took place (Daytime/Night-time)
The average number of calls on weekdays
Calls generated with varied area code
Calls generated per day, etc.
By sensing the proper customer call details, one can progress the business growth. If a customer
makes more calls during dayshift working hours, that makes them distinguished as a part of a
business firm. If the night-time call rate is high, it may be used only for residential or domestic
purposes. By the frequent variance in the area code, one can segregate the business calls because
people calling for the residential purpose may call over limited area codes in a period. But the
data collected in the evening time cannot give the exact detail of whether the customer belongs to
a business or residential firm.
Data of customers:
Due to the use of well-developed complex appliances used in telecommunication networks, there
is a possibility that every part of the system may generate errors and messages. This leads to a
large amount of network data being processed. This data must be separated, grouped, and stored
in order if the system causes any network fault isolation. This ensures that the error or status
message of any part of the network system would reach the technical specialist. So, they could
rectify it. Since the database is enormous, when a large number of status or error messages get
generated, it becomes difficult to solve the problems manually. So, some sets of errors and
messages can be automatized to reduce the strain. A methodical approach of data mining can
manage the network system efficiently which can enhance the functions.
Even though raw data are processed in data mining, it must be in a well sensed and properly
arranged format to be processed. And, in the telecommunication industry dealing with the giant
database, it‘s an important need. First, clashing and contrary data must be identified to avoid
inconsistency. Making sure of the removal of undesired data fields heaping space. The data must
be organized and mapped by finding the relationship between datasets to avoid redundancy.
Clustering or grouping similar data can be done by algorithms in the data mining field. It can
help in analyzing the patterns like calling patterns or customer behavior patterns. Group of
frequencies is made by analyzing the similarities between them. By doing this, data can easily be
understood which leads to easy manipulation and use.
Customer profiling:
The telecommunication industry deals with a large scale of customer details. It starts observing
patterns of the customer from call data to profile the customers to predict future trends. By
knowing the customer pattern, the company can decide the promotion methods offered to the
customer. If the call ranges within an area code. The promotion made in that aspect would gain a
group of customers. This can efficiently monetize the promotion techniques and stop the
company from investing in a single subscriber but it can attract a group of people with the right
plan. Privacy issues arise when the customer‘s call history or details are monitored.
One of the significant problems that the telecommunication industry faces is that Customer
churn. This can also be stated as customer turnover in which the company loses its client. In this
case, the client leaves and switches to another telecommunication company. If the customer
churn rate is high in a company, the respective company will experience severe loss of revenue
and profit which will lead to its decline in growth. This issue can be fixed by data mining
techniques to collect patterns of customers and profiling them. Incentive offers provided by
companies attract the regular user of some other company. By profiling the data, the customer
churn can be effectively forecasted by their behaviors like subscription history, the plan they
chose, and so on. While collecting data from the paid customers, it‘s also possible to collect data
of the receiver or non-customer but with a set of restrictions.
Fraud detection:
Fraud is a critical problem for telecommunication industries which causes loss of revenue and
also causes deterioration in customer relations. Two major fraud activity involved
is subscription theft and super-imposed frauds. The subscription fraud involves collecting the
details of customers mostly from the KYC (Know Your Customer) documents like name,
address, and ID proof details. These details are needed to sign up for telecom services with
authenticating approval but without any type of intention to pay for using the service using the
account. Some offender not only stops with the illegitimate use of services but perform bypass
fraud by diverting voice traffic from local to international protocols which causes destructive loss
to the telecommunication company. In super-imposed frauds, it starts with a legitimate account
and a legal activity but with further lead to the overlapped or imposed activity by some other
person illegally using the services rather than the account holder. But by collecting the behavioral
pattern of the account holder, if a suspect is found on super-imposed fraudulent activities it will
lead to immediate actions like blocking or deactivating the account user. This will prevent further
damage to the company.