0% found this document useful (0 votes)
14 views

Data Mining - Digital Notes (Unit I To V)

Uploaded by

Vijay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Data Mining - Digital Notes (Unit I To V)

Uploaded by

Vijay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

MALLA REDDY UNIVERSITY

LECTURE NOTES ON DATA MINING

(MR20-1CS0241)

B. Tech (CSE) - ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING


(MR20-1CS0241) DATA MINING

UNIT 1
INTRODUCTION TO DATA MINING:
Introduction to Data Mining – Data Mining Tasks – Major Issues in Data Mining – Data Preprocessing –
Data sets, Association Rule Mining: Efficient and Scalable Frequent Item set Mining Methods – Mining
Various Kinds of Association Rules.

UNIT 2
CLASSIFICATION AND PREDICTION: Decision Tree Introduction, Bayesian Classification,
Rule Based Classification, Classification by Back propagation, Support Vector Machines,
Associative Classification, classification using frequent patterns, Lazy Learners, Other
Classification Methods: Genetic Algorithm.

UNIT 3
CLUSTERING ANALYSIS: Types of Data in Cluster Analysis, Partitioning Methods,
Hierarchical methods, Density-Based Methods, Grid-Based Methods, Probabilistic Model-Based
Clustering, Clustering High-Dimensional Data, Clustering with Constraint, Outliers and Outlier
Analysis.

UNIT 4
WEB AND TEXT MINING: Introduction, web mining, web content mining, web structure
mining, web usage mining, Text mining, unstructured text, episode rule discovery for texts,
hierarchy of categories, text clustering.

UNIT 5
TEMPORAL AND SPATIAL DATA MINING: Introduction; Temporal Data Mining ,
Temporal Association Rules, Sequence Mining, GSP algorithm, SPADE,SPIRIT Episode
Discovery, Time Series Analysis, Spatial Mining, Spatial Mining Tasks, Spatial Clustering, Data
Mining Applications: Data Mining for Retail and Telecommunication industries.

UNIT 1
INTRODUCTION TO DATA MINING

INTRODUCTION TO DATA MINING:


Introduction to Data Mining-Data Mining Tasks-Components of Data Mining Algorithms-Data
Mining supporting Techniques-Major Issues in Data Mining-Measurement and Data-Data
Preprocessing-Data sets; Association Rule Mining-Efficient and Scalable Frequent Item set Mining
Methods-Mining Various Kinds of Association Rules.

Introduction to Data Mining:

1.1 What Is Data Mining?

Data mining refers to extracting or mining knowledge from large amounts of data. The term is
actually a misnomer. The overall goal of the data mining process is to extract information from a data
set and transform it into an understandable structure for further use. The key properties of data
mining are Automatic discovery of patterns Prediction of likely outcomes Creation of actionable
information Focus on large datasets and databases

1.2 The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable business
information in a large database for example, finding linked products in gigabytes of store scanner
data and mining a mountain for a vein of valuable ore. Both processes require either sifting through
an immense amount of material, or intelligently probing it to find exactly where the value resides.
Given databases of sufficient size and quality, data mining technology can generate new business
opportunities by providing these capabilities:

Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive hands-on
analysis can now be answered directly from the data — quickly. A typical example of a predictive
problem is targeted marketing. Data mining uses data on past promotional mailings to identify the
targets most likely to maximize return on investment in future mailings. Other predictive problems
include forecasting bankruptcy and other forms of default, and identifying segments of a population
likely to respond similarly to given events.

Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern discovery is the
analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions and
identifying anomalous data that could represent data entry keying errors.
Data Mining Tasks:

Tasks of Data Mining

Data mining involves six common classes of tasks:


Anomaly detection (Outlier/change/deviation detection) – The identification of
unusual data records, that might be interesting or data errors that require further
investigation.
Association rule learning (Dependency modelling) – Searches for relationships
between variables. For example a supermarket might gather data on customer purchasing
habits. Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.
Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
 Regression – attempts to find a function which models the data with the least error.

Components of Data Mining Algorithms:

Data Mining is the procedure of selection, exploration, and modeling of high quantities of
information to find regularities or relations that are at first unknown to obtain clear and beneficial
results for the owner of the database.
Data mining is an interdisciplinary field, the assemblage of a set of disciplines, such as database
systems, statistics, machine learning, visualization, and data science. It is based on the data mining
methods used, approaches from other disciplines can be used, including neural networks, fuzzy and
rough set theory, knowledge representation, inductive logic programming, or high-performance
computing.
It is established on the types of data to be mined or on the given data mining application, the data
mining system can also integrate methods from spatial data analysis, data retrieval, pattern
identification, image analysis, signal processing, computer graphics, network technology,
economics, business, bioinformatics, or psychology.
A data mining query language can be designed to incorporate these primitives, enabling users to
flexibly connect with data mining systems. A data mining query language supports an authority on
which user-friendly graphical interfaces can be constructed. This promotes a data mining system's
communication with other data systems and its integration with the complete data processing
environment.
It is designing an inclusive data mining language is challenging because data mining protects a wide
spectrum of functions, from data characterization to evolution analysis. Each task has several
requirements. The design of an effective data mining query language needed broad learning of the
power, limitation, and underlying structure of the different types of data mining tasks.
Data mining functionalities are used to define the type of patterns that have to be discovered in data
mining tasks. In general, data mining tasks can be classified into two types including descriptive and
predictive. Descriptive mining tasks define the common features of the data in the database and the
predictive mining tasks act inference on the current information to develop predictions.
The major components of data mining are as follows −

 Databases − This is one or a set of databases, data warehouses, spreadsheets, and another
type of data repository where data cleaning and integration techniques can be implemented.
 Data warehouse server − This component fetches the relevant records based on users request
from a data warehouse.
 Knowledge base − It is a knowledge domain that is employed for discovering interesting
patterns.
 Data mining engine − It uses a functional module that is used to perform tasks including
classification, association, cluster analysis, etc.
 Pattern evaluation module − This component uses interestingness measures that
communicate with data mining structure to target the search towards interesting patterns.
 User interface − This interface enables users to interact with the system by describing a data
mining function or a query through the graphical user interface.

Data Mining supporting Techniques:

Organizations have access to more data now than they have ever had before. However, making sense
of the huge volumes of structured and unstructured data to implement organization-wide
improvements can be extremely challenging because of the sheer amount of information. If not
properly addressed, this challenge can minimize the benefits of all the data.

Data mining is the process by which organizations detect patterns in data for insights relevant to their
business needs. It‘s essential for both business intelligence and data science. There are many data
mining techniques organizations can use to turn raw data into actionable insights. These involve
everything from cutting-edge artificial Intelligence to the basics of data preparation, which are both
key for maximizing the value of data investments.

1. Data cleaning and preparation


2. Tracking patterns
3. Classification
4. Association
5. Outlier detection
6. Clustering
7. Regression
8. Prediction
9. Sequential patterns
10. Decision trees
11. Statistical techniques
12. Visualization
13. Neural networks
14. Data warehousing
15. Long-term memory processing
16. Machine learning and artificial intelligence

1. Data cleaning and preparation


Data cleaning and preparation is a vital part of the data mining process. Raw data must be cleansed
and formatted to be useful in different analytic methods. Data cleaning and preparation includes
different elements of data modeling, transformation, data migration, ETL, ELT, data integration, and
aggregation. It‘s a necessary step for understanding the basic features and attributes of data to
determine its best use.

The business value of data cleaning and preparation is self-evident. Without this first step, data is
either meaningless to an organization or unreliable due to its quality. Companies must be able to trust
their data, the results of its analytics, and the action created from those results.
These steps are also necessary for data quality and proper data governance.

2. Tracking patterns
Tracking patterns is a fundamental data mining technique. It involves identifying and monitoring
trends or patterns in data to make intelligent inferences about business outcomes. Once an
organization identifies a trend in sales data, for example, there‘s a basis for taking action to capitalize
on that insight. If it‘s determined that a certain product is selling more than others for a particular
demographic, an organization can use this knowledge to create similar products or services, or
simply better stock the original product for this demographic.

3. Classification
Classification data mining techniques involve analyzing the various attributes associated with
different types of data. Once organizations identify the main characteristics of these data types,
organizations can categorize or classify related data. Doing so is critical for identifying, for example,
personally identifiable information organizations may want to protect or redact from documents.

4. Association
Association is a data mining technique related to statistics. It indicates that certain data (or events
found in data) are linked to other data or data-driven events. It is similar to the notion of co-
occurrence in machine learning, in which the likelihood of one data-driven event is indicated by the
presence of another.

The statistical concept of correlation is also similar to the notion of association. This means that the
analysis of data shows that there is a relationship between two data events: such as the fact that the
purchase of hamburgers is frequently accompanied by that of French fries.

5. Outlier detection
Outlier detection determines any anomalies in datasets. Once organizations find aberrations in their
data, it becomes easier to understand why these anomalies happen and prepare for any future
occurrences to best achieve business objectives. For instance, if there‘s a spike in the usage of
transactional systems for credit cards at a certain time of day, organizations can capitalize on this
information by figuring out why it‘s happening to optimize their sales during the rest of the day.
6. Clustering
Clustering is an analytics technique that relies on visual approaches to understanding data. Clustering
mechanisms use graphics to show where the distribution of data is in relation to different types of
metrics. Clustering techniques also use different colors to show the distribution of data. Graph
approaches are ideal for using cluster analytics. With graphs and clustering in particular, users can
visually see how data is distributed to identify trends that are relevant to their business objectives.
7. Regression
Regression techniques are useful for identifying the nature of the relationship between variables in a
dataset. Those relationships could be causal in some instances, or just simply correlate in others.
Regression is a straightforward white box technique that clearly reveals how variables are related.
Regression techniques are used in aspects of forecasting and data modeling.
8. Prediction
Prediction is a very powerful aspect of data mining that represents one of four branches of analytics.
Predictive analytics use patterns found in current or historical data to extend them into the future.
Thus, it gives organizations insight into what trends will happen next in their data. There are several
different approaches to using predictive analytics. Some of the more advanced involve aspects of
machine learning and artificial intelligence. However, predictive analytics doesn‘t necessarily depend
on these techniques it can also be facilitated with more straightforward algorithms.
9. Sequential patterns
This data mining technique focuses on uncovering a series of events that takes place in sequence. It‘s
particularly useful for data mining transactional data. For instance, this technique can reveal what
items of clothing customers are more likely to buy after an initial purchase of say, a pair of shoes.
Understanding sequential patterns can help organizations recommend additional items to customers
to spur sales.
10. Decision trees
Decision trees are a specific type of predictive model that lets organizations effectively mine data.
Technically, a decision tree is part of machine learning, but it is more popularly known as a white
box machine learning technique because of its extremely straightforward nature.
A decision tree enables users to clearly understand how the data inputs affect the outputs. When
various decision tree models are combined they create predictive analytics models known as a
random forest. Complicated random forest models are considered black box machine learning
techniques, because it‘s not always easy to understand their outputs based on their inputs. In most
cases, however, this basic form of ensemble modeling is more accurate than using decision trees on
their own.
11. Statistical techniques
Statistical techniques are at the core of most analytics involved in the data mining process. The
different analytics models are based on statistical concepts, which output numerical values that are
applicable to specific business objectives. For instance, neural networks use complex statistics based
on different weights and measures to determine if a picture is a dog or a cat in image recognition
systems.
Statistical models represent one of two main branches of artificial intelligence. The models for some
statistical techniques are static, while others involving machine learning get better with time.
12. Visualization
Data visualizations are another important element of data mining. They grant users insight into data
based on sensory perceptions that people can see. Today‘s data visualizations are dynamic, useful for
streaming data in real-time, and characterized by different colors that reveal different trends and
patterns in data.
Dashboards are a powerful way to use data visualizations to uncover data mining insights.
Organizations can base dashboards on different metrics and use visualizations to visually highlight
patterns in data, instead of simply using numerical outputs of statistical models.
13. Neural networks
A neural network is a specific type of machine learning model that is often used with AI and deep
learning. Named after the fact that they have different layers which resemble the way neurons work
in the human brain, neural networks are one of the more accurate machine learning models used
today.
Although a neural network can be a powerful tool in data mining, organizations should take caution
when using it: some of these neural network models are incredibly complex, which makes it difficult
to understand how a neural network determined an output.
14. Data warehousing
Data warehousing is an important part of the data mining process. Traditionally, data warehousing
involved storing structured data in relational database management systems so it could be analyzed
for business intelligence, reporting, and basic dash boarding capabilities. Today, there are cloud data
warehouses and data warehouses in semi-structured and unstructured data stores like Hadoop. While
data warehouses were traditionally used for historic data, many modern approaches can provide an
in-depth, real-time analysis of data.
15. Long-term memory processing
Long term memory processing refers to the ability to analyze data over extended periods of time. The
historic data stored in data warehouses is useful for this purpose. When an organization can perform
analytics on an extended period of time, it‘s able to identify patterns that otherwise might be too
subtle to detect. For example, by analyzing attrition over a period of several years, an organization
may find subtle clues that could lead to reducing churn in finance.

16. Machine learning and artificial intelligence


Machine learning and artificial intelligence (AI) represent some of the most advanced developments
in data mining. Advanced forms of machine learning like deep learning offer highly accurate
predictions when working with data at scale. Consequently, they‘re useful for processing data in AI
deployments like computer vision, speech recognition, or sophisticated text analytics using Natural
Language Processing. These data mining techniques are good for determining value from semi-
structured and unstructured data.
Major Issues in Data Mining:

Mining different kinds of knowledge in databases. - The need of different users is not the same.
And Different user may be in interested in different kind of knowledge. Therefore it is necessary for
data mining to cover broad range of knowledge discovery task.

Interactive mining of knowledge at multiple levels of abstraction. - The data mining processneeds
to be interactive because it allows users to focus the search for patterns, providing and refining data
mining requests based on returned results.

Incorporation of background knowledge. - To guide discovery process and to express the


discovered patterns, the background knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at multiple level of abstraction.

Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.

Presentation and visualization of data mining results. - Once the patterns are discovered it needs
to be expressed in high level languages, visual representations. This representations should be easily
understandable by the users.

Handling noisy or incomplete data. - The data cleaning methods are required that can handle the
noise, incomplete objects while mining the data regularities. If data cleaning methods are not there
then the accuracy of the discovered patterns will be poor.

Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should be
interesting because either they represent common knowledge or lack novelty.

Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.
Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of
databases, wide distribution of data,and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithm divide the data
into partitions which is further processed parallel. Then the results from the partitions is merged.
The incremental algorithms, updates databases without having mine the data again from scratch.

Measurement and Data:


In a Data Mining sense, the similarity measure is a distance with dimensions describing object
features. That means if the distance among two data points is small then there is a high degree of
similarity among the objects and vice versa. The similarity is subjective and depends heavily on the
context and application
Data Preprocessing:
Data Integration:
It combines data from multiple sources into a coherent data store, as in data warehousing. These
sources may include multiple databases, data cubes, or flat files.

The data integration systems are formally defined as riple<G,S,M>

Where G: The global schema

S: Heterogeneous source of schemas

M: Mapping between the queries of source and global schema


Issues in Data integration:
1. Schema integration and object matching:
How can the data analyst or the computer be sure that customer id in one database and customer
number in another reference to the same attribute.
2. Redundancy:
An attribute (such as annual revenue, for instance) may be redundant if it can be derived from
another attribute or set of attributes. Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
3. Detection and resolution of data value conflicts:
For the same real-world entity, attribute values from different sources may differ.
Data Transformation:
In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Data transformation can involve the following:
Smoothing, this works to remove noise from the data. Such techniques include binning, regression,
and clustering.
Aggregation, where summary or aggregation operations are applied to the data. For example, the
daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is
typically used in constructing a data cube for analysis of the data at multiple granularities.
Generalization of the data, where low-level or ―primitive‖ (raw) data are replaced by higher-level
concepts through the use of concept hierarchies. For example, categorical attributes, like street, can
be generalized to higher-level concepts, like city or country.
Normalization, where the attribute data are scaled so as to fall within a small specified range, such
as 1:0 to 1:0, or 0:0 to 1:0.
Attribute construction (or feature construction),where new attributes are constructed and added
from the given set of attributes to help the mining process
Data Reduction:
Data reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on
the reduced data set should be more efficient yet produce the same (or almost the same) analytical
results.
Strategies for data reduction include the following:
Data cube aggregation, where aggregation operations are applied to the data in the construction of a
data cube.
Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions
may be detected and removed.
Dimensionality reduction, where encoding mechanisms are used to reduce the dataset size.
Numerosity reduction, where the data are replaced or estimated by alternative, smaller data
representations such as parametric models (which need store only the model parameters instead of
the actual data) or nonparametric methods such as clustering, sampling, and the use of histograms.

Discretization and concept hierarchy generation, where raw data values for attributes are replaced
by ranges or higher conceptual levels. Data discretization is a form of numerosity reduction that is
very useful for the automatic generation of concept hierarchies. Discretization and concept hierarchy
generation are powerful tools for data mining, in that they allow the mining of data at multiple levels
of abstraction.

Data sets; Association Rule Mining:


Association Rule Mining:

 Association rule mining is a popular and well researched method for discovering
interestingrelations between variables in large databases.
 It is intended to identify strong rules discovered in databases using different
measures of interestingness.
 Based on the concept of strong rules, Rakesh Agrawal et al. introduced association
rules.
Problem Definition:
The problem of association rule mining is defined as:

Let be a set of binary attributes called items.

Let be a set of transactions called the database.


Each transaction in has a unique transaction ID and contains a subset of the items in .
A rule is defined as an implication of the form
where and .
The sets of items (for short itemsets) and are called antecedent (left-hand-side or LHS) and
consequent (right-hand-side or RHS) of the rule respectively.
Example:
To illustrate the concepts, we use a small example from the supermarket domain. The set of
items is and a small database containing the items (1
codes presence and 0 absence of an item in a transaction) is shown in the table.

An example rule for the supermarket could be meaning that if


butter and bread are bought, customers also buy milk.
Example database with 4 items and 5 transactions

Transaction milk bread butter beer


ID
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0

Important concepts of Association Rule Mining:

 The support of an itemset is defined as the proportion of transactions in


the data set which contain the itemset. In the example
database, the itemset has a support of
since it occurs in 20% of alltransactions (1 out of 5 transactions).

The confidenceof a rule is defined

For example, the rule has a confidence of

in the database, which means that for 100% of the transactions


containing butter and bread the rule is correct (100% of the times a customer buys butter
and bread, milk is bought as well). Confidence can be interpreted as an estimate of the

probability , the probability of finding the RHS of the rule in transactions


under the condition that these transactions also contain the LHS.

The liftof a rule is defined as

or the ratio of the observed support to that expected if X and Y were independent. The

rule has a lift of .

The conviction of a rule is defined as

The rule has a conviction of ,

and can be interpreted as the ratio of the expected frequency that X occurs without Y
(that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.
Efficient and Scalable Frequent Item set Mining Methods:

 Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for


miningfrequent itemsets for Boolean association rules.
 The name of the algorithm is based on the fact that the algorithm uses prior
knowledge offrequent itemset properties.
 Apriori employs an iterative approach known as a level-wise search, where k-
itemsets are used to explore (k+1)-itemsets.
 First, the set of frequent 1-itemsets is found by scanning the database to accumulate
the count for each item, and collecting those items that satisfy minimum support.
The resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-
itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can
be found.
 The finding of each Lkrequires one full scan of the database.
 A two-step process is followed in Aprioriconsisting of joinand prune action.
Example:

TID List of item IDs


T10 I1, I2, I5
0
T20 I2, I4
0
T30 I2, I3
0
T40 I1, I2, I4
0
T50 I1, I3
0
T60 I2, I3
0
T70 I1, I3
0
T80 I1, I2, I3, I5
0
T90 I1, I2, I3
0

There are nine transactions in this database, that is, |D| = 9.


Steps:

1. In the first iteration of the algorithm, each item is a member of the set of candidate1-
itemsets, C1. The algorithm simply scans all of the transactions in order to countthe number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of
frequent 1-itemsets, L1, can thenbe determined. It consists of the candidate 1-itemsets
satisfying minimum support.In our example, all of the candidates in C1 satisfy minimum
support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1
togenerate a candidate set of 2-itemsets, C2.No candidates are removed fromC2 during the
prune step because each subset of thecandidates is also frequent.
4. Next, the transactions inDare scanned and the support count of each candidate itemsetInC2
isaccumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2-
itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, Fromthejoin step, we first getC3 =L2x
L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based on the
Apriori property that all subsets of a frequentitemsetmust also be frequent, we can determine
that the four latter candidates cannotpossibly be frequent.

7.The transactions in D are scanned in order to determine L3, consisting of those candidate
3-itemsets in C3 having minimum support.
8.The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.
Generating Association Rules from Frequent Item sets:
Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them.

Example:

Mining Various Kinds of Association Rules:

For many applications, it is difficult to find strong associations among data items at low or
primitive levels of abstraction due to the sparsity of data at those levels.
Strong associations discovered at high levels of abstraction may represent
commonsense knowledge.
Therefore, data mining systems should provide capabilities for mining association
rules at multiple levels of abstraction, with sufficient flexibility for easy traversal
among different abstraction spaces.
Association rules generated from mining data at multiple levels of abstraction arecalled
multiple-level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence framework.
In general, a top-down strategy is employed, where counts are accumulated for the
calculation of frequent itemsets at each concept level, starting at the concept level 1 and
working downward in the hierarchy toward the more specific concept levels,until no
more frequent itemsets can be found.

A concepthierarchy defines a sequence of mappings froma set of low-level concepts to


higherlevel,more general concepts. Data can be generalized by replacing low-level
conceptswithin the data by their higher-level concepts, or ancestors, from a concept hierarchy.

The concept hierarchy has five levels, respectively referred to as levels 0to 4, starting with level
0 at the root node for all.
 Here, Level 1 includes computer, software, printer&camera, and computer
accessory.

 Level 2 includes laptop computer, desktop computer, office software, antivirus


software

 Level 3 includes IBM desktop computer, . . . , Microsoft office software, and so on.

 Level 4 is the most specific abstraction level of this hierarchy.

Approaches for Mining Multilevel Association Rules:

1. Uniform Minimum Support:


 The same minimum support threshold is used when mining at each level of
abstraction.

 When a uniform minimum support threshold is used, the search procedure is


simplified.

 The method is also simple in that users are required to specify only one minimum
supportthreshold.
 The uniform support approach, however, has some difficulties. It is unlikely that
items at lower levels of abstraction will occur as frequently as those at higher
levels of abstraction.
 If the minimum support threshold is set too high, it could miss some meaningful
associationsoccurring at low abstraction levels. If the threshold is set too low, it may
generate many uninteresting associations occurring at high abstraction levels.
2. Reduced Minimum Support:
 Each level of abstraction has its own minimum support threshold.
 The deeper the level of abstraction, the smaller the corresponding threshold is.
 For example, the minimum support thresholds for levels 1 and 2 are 5% and 3%,
respectively.In this way, ―computer,‖ ―laptop computer,‖ and ―desktop computer‖ are all
considered frequent.

3. Group-Based Minimum Support:


 Because users or experts often have insight as to which groups are more important
than others, it is sometimes more desirable to set up user-specific, item, or group
based minimal support thresholds when mining multilevel rules.
 For example, a user could set up the minimum support thresholds based on product
price, or on items of interest, such as by setting particularly low support thresholds
for laptop computers and flash drives in order to pay particular attention to the
association patterns containing items in these categories.

Mining Multidimensional Association Rules from Relational Databases and Data


Warehouses:

o Single dimensional or intra dimensional association rule contains a single distinct


predicate (e.g., buys) with multiple occurrences i.e., the predicate occurs more than
once within the rule.
buys(X, ―digital camera‖)=>buys(X, ―HP printer‖)
o Association rules that involve two or more dimensions or predicates can be
referred toas multidimensional association rules.
age(X, “20…29”)^occupation(X, “student”)=>buys(X, “laptop”)
o Above Rule contains three predicates (age, occupation,and buys), each of which
occurs only once in the rule. Hence, we say that it has norepeated predicates.
o Multidimensional association rules with no repeated predicates arecalled
interdimensional association rules.
o We can also mine multidimensional associationrules with repeated predicates,
which contain multiple occurrences of some predicates.These rules are called
hybrid- dimensional association rules. An example of sucha rule is the following,
where the predicate buys is repeated:
age(X, ―20…29‖)^buys(X, ―laptop‖)=>buys(X, ―HP printer‖)
Mining Quantitative Association Rules:

 Quantitative association rules are multidimensional association rules in which the


numeric attributes are dynamically discretized during the mining process so as to
satisfy some mining criteria, such as maximizing the confidence or compactness of
the rules mined.
 In this section, we focus specifically on how to mine quantitative association rules
having two quantitative attributes on the left-hand side of the rule and one
categorical attribute onthe right-hand side of the rule. That is
Aquan1 ^Aquan2 =>Acat
whereAquan1 and Aquan2 are tests on quantitative attribute interval
Acat tests a categorical attribute from the task-relevant data.
 Such rules have been referred to as two-dimensional quantitative
association rules,because they contain two quantitative dimensions.
 For instance, suppose you are curious about the association relationship between
pairs of quantitative attributes, like customer age and income, and the type of
television (such as high-definition TV, i.e., HDTV) that customers like to buy.
An example of such a 2-D quantitative association rule is
age(X, ―30…39‖)^income(X, ―42K…48K‖)=>buys(X, ―HDTV‖)

From Association Mining to Correlation Analysis:


A correlation measure can be used to augment the support-confidence framework
forassociation rules. This leads to correlation rules of the form
A=>B [support, confidence, correlation]
That is, a correlation rule is measured not only by its support and confidence but also by
the correlation between itemsets A and B. There are many different correlation measures
from which to choose. In this section, we study various correlation measures to determine
which would be good for mining large data sets.

Lift is a simple correlation measure that is given as follows. The


occurrence of itemsetA is independent of the occurrence of itemset B if
= P(A)P(B); otherwise, itemsets A and B are dependent and correlated as events. This
definition can easily be extended to more than two itemsets.

The lift between the occurrence of A and B can bemeasured by computing

 If the lift(A,B) is less than 1, then the occurrence of A is negativelycorrelated


with theoccurrence of B.
 If the resulting value is greater than 1, then A and B are positively correlated,
meaning thatthe occurrence of one implies the occurrence of the other.
 If the resulting value is equal to 1, then A and B are independent and there is no
correlationbetween them.
UNIT II

CLASSIFICATION AND PREDICTION: Decision Tree Introduction, Bayesian Classification,


Rule Based Classification, Classification by Back propagation, Support Vector Machines,
Associative Classification, classification using frequent patterns, Lazy Learners, Other
Classification Methods: Genetic Algorithm.
Classification and Prediction:

 Classification and prediction are two forms of data analysis that can be used to extract
modelsdescribing important data classes or to predict future data trends.
 Classification predicts categorical (discrete, unordered) labels, prediction models
continuous valued functions.
 For example, we can build a classification model to categorize bank loan applications
as either safe or risky, or a prediction model to predict the expenditures of potential
customerson computer equipment given their income and occupation.
 A predictor is constructed that predicts a continuous-valued function, or ordered
value, as opposed to a categorical label.
 Regression analysis is a statistical methodology that is most often used for numeric
prediction.
 Many classification and prediction methods have been proposed by researchers in
machine learning, pattern recognition, and statistics.
 Most algorithms are memory resident, typically assuming a small data size. Recent
data mining research has built on such work, developing scalable classification and
prediction techniques capable of handling large disk-resident data.
Decision Tree Introduction:
 Decision tree induction is the learning of decision trees from class-labeled training
tuples.
 A decision tree is a flowchart-like tree structure, where
 Each internal nodedenotes a test on an attribute.
 Each branch represents an outcome of the test.
 Each leaf node holds a class label.
 The topmost node in a tree is the root node.

 The construction of decision treeclassifiers does not require any domain knowledge
or parameter setting, and therefore I appropriate for exploratory knowledge
discovery.
 Decision trees can handle high dimensionaldata.
 Their representation of acquired knowledge in tree formis intuitive and
generallyeasy to assimilate by humans.
 The learning and classification steps of decision treeinduction are simple and
fast. In general, decision tree classifiers have good accuracy.
 Decision tree induction algorithmshave been used for classification in many
application areas, such as medicine,manufacturing and production, financial analysis,
astronomy, andmolecular biology.
Algorithm for Decision Tree Induction:

The algorithm is called with three parameters:


 Data partition
 Attribute list
 Attribute selection method
o The parameter attribute list is a list ofattributes describing the tuples.
o Attribute selection method specifies a heuristic procedurefor selecting the attribute
that
―best‖ discriminates the given tuples according to class.
o The tree starts as a single node, N, representing the training tuples in D.
o If the tuples in D are all of the same class, then node N becomes a leaf and is
labeledwiththat class .
o Allof the terminating conditions are explained at the end of the algorithm.
o Otherwise, the algorithm calls Attribute selection method to determine the
splittingcriterion.
o The splitting criterion tells us which attribute to test at node N by determiningthe
―best‖ way to separate or partition the tuples in D into individual classes.

There are three possible scenarios.Let A be the splitting attribute. A has v distinct values,
{a1, a2, … ,av}, based on the training data.

1 A is discrete-valued:
• In this case, the outcomes of the test at node N correspond directly to the
knownvalues of A.
• A branch is created for each known value, aj, of A and labeled with that
value. Aneed not be considered in any future partitioning of the tuples.
2 A is continuous-valued:
In this case, the test at node N has two possible outcomes, corresponding to the conditions
A <=split point and A >split point, respectively
Where split point is the split-point returned by Attribute selection method as part of
thesplitting criterion.
3 A is discrete-valued and a binary tree must be produced:
The test at node N is of the form―A€SA?‖.
SA is the splitting subset for A, returned by Attribute selection method as part of the splitting
criterion. It is a subset of the known values of A.
(a) If A is Discrete valued (b)If A is continuous valued (c) IfA is discrete-valued and a binary
tree must be produced:

Bayesian Classification:
• Bayesian classifiers are statistical classifiers.
• They can predictclass membership probabilities, such as the probability that a given tuple
belongs toa particular class.
• Bayesian classification is based on Bayes‘ theorem.

Bayes’ Theorem:
• Let X be a data tuple. In Bayesian terms, X is considered ―evidence.‖and it is described by
measurements made on a set of n attributes.
• Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
• For classification problems, we want to determine P(H|X), the probability that the
hypothesisH holds given the ―evidence‖ or observed data tuple X.
• P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
• Bayes‘ theorem is useful in that it providesa way of calculating the posterior
probability,
• P(H|X), from P(H), P(X|H), and P(X).

Rule Based Classification:

IF-THEN Rules

Rule-based classifier makes use of a set of IF-THEN rules for classification.


Let us consider a rule R1,

R1: IF age = youth AND student = yes


THEN buy_computer = yes
Points to remember:
 The IF part of the rule is called rule antecedent or precondition.
 The THEN part of the rule is called rule consequent.
 The antecedent part the condition consists of one or more attribute tests and these tests are
logically ANDed.
 The consequent part consists of class prediction.
Can also write rule R1 as follows:
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction:
Points to remember −
To extract a rule from a decision tree
 One rule is created for each path from the root to the leaf node.
 To form a rule antecedent, each splitting criterion is logically ANDed.
 The leaf node holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm


Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We
do not require generating a decision tree first. In this algorithm, each rule for a given class covers
many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general
strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by the
rule is removed and the process continues for the rest of the tuples. This is because the path to each
leaf in a decision tree corresponds to a rule.

Note − The Decision tree induction can be considered as learning a set of rules
simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class at a
time. When learning a rule from a class Ci, we want the rule to cover all the tuples from class C
only and no tuple form any other class.

Rule Pruning
The rule is pruned is due to the following reason −
 The Assessment of quality is made on the original set of training data. The rule may perform
well on training data but less well on subsequent data. That's why the rule pruning is
required.
 The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has
greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg

where pos and neg is the number of positive tuples covered by R, respectively.
Classification by Back propagation:
Backpropagation is an algorithm that back propagates the errors from output nodes to the input
nodes. Therefore, it is simply referred to as backward propagation of errors. It uses in the vast
applications of neural networks in data mining like Character recognition, Signature
verification, etc.

Neural Network:
Neural networks are an information processing paradigm inspired by the human nervous system.
Just like in the human nervous system, we have biological neurons in the same way in neural
networks we have artificial neurons, artificial neurons are mathematical functions derived from
biological neurons. The human brain is estimated to have about 10 billion neurons, each
connected to an average of 10,000 other neurons. Each neuron receives a signal through a
synapse, which controls the effect of the signal on the neuron.

Backpropagation:

Backpropagation is a widely used algorithm for training feedforward neural networks. It


computes the gradient of the loss function with respect to the network weights and is very
efficient, rather than naively directly computing the gradient with respect to each individual
weight. This efficiency makes it possible to use gradient methods to train multi-layer networks
and update weights to minimize loss; variants such as gradient descent or stochastic gradient
descent are often used.
The backpropagation algorithm works by computing the gradient of the loss function with
respect to each weight via the chain rule, computing the gradient layer by layer, and iterating
backward from the last layer to avoid redundant computation of intermediate terms in the chain
rule.
Working of Backpropagation:
Neural networks use supervised learning to generate output vectors from input vectors that the
network operates on. It Compares generated output to the desired output and generates an error
report if the result does not match the generated output vector. Then it adjusts the weights
according to the bug report to get your desired output.

Backpropagation Algorithm:

Step 1: Inputs X, arrive through the preconnected path.


Step 2: The input is modeled using true weights W. Weights are usually chosen randomly.
Step 3: Calculate the output of each neuron from the input layer to the hidden layer to the output
layer.
Step 4: Calculate the error in the outputs
Backpropagation Error= Actual Output – Desired Output
Step 5: From the output layer, go back to the hidden layer to adjust the weights to reduce the
error.
Step 6: Repeat the process until the desired output is achieved.

Need for Backpropagation:


Backpropagation is ―backpropagation of errors‖ and is very useful for training neural networks.
It‘s fast, easy to implement, and simple. Backpropagation does not require any parameters to be
set, except the number of inputs. Backpropagation is a flexible method because no prior
knowledge of the network is required.

Types of Backpropagation
There are two types of backpropagation networks.
 Static backpropagation: Static backpropagation is a network designed to map static inputs
for static outputs. These types of networks are capable of solving static classification
problems such as OCR (Optical Character Recognition).
 Recurrent backpropagation: Recursive backpropagation is another network used for fixed-
point learning. Activation in recurrent backpropagation is feed-forward until a fixed value is
reached. Static backpropagation provides an instant mapping, while recurrent
backpropagation does not provide an instant mapping.

Advantages:
 It is simple, fast, and easy to program.
 Only numbers of the input are tuned, not any other parameter.
 It is Flexible and efficient.
 No need for users to learn any special functions.

Disadvantages:
 It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate
results.
 Performance is highly dependent on input data.
 Spending too much time training.
 The matrix-based approach is preferred over a mini-batch.
Support Vector Machines:
What are Support Vector Machines? Support Vector Machine (SVM) is a relatively
simple Supervised Machine Learning Algorithm used for classification and/or regression. It is
more preferred for classification but is sometimes very useful for regression as well. Basically,
SVM finds a hyper-plane that creates a boundary between the types of data. In 2-dimensional
space, this hyper-plane is nothing but a line. In SVM, we plot each data item in the dataset in an N-
dimensional space, where N is the number of features/attributes in the data. Next, find the optimal
hyperplane to separate the data. So by this, you must have understood that inherently, SVM can
only perform binary classification (i.e., choose between two classes). However, there are various
techniques to use for multi-class problems. Support Vector Machine for Multi-CLass Problems To
perform SVM on multi-class problems, we can create a binary classifier for each class of the data.
The two results of each classifier will be:
 The data point belongs to that class OR
 The data point does not belong to that class.
For example, in a class of fruits, to perform multi-class classification, we can create a binary
classifier for each fruit. For say, the ‗mango‘ class, there will be a binary classifier to predict if it
IS a mango OR it is NOT a mango. The classifier with the highest score is chosen as the output
of the SVM. SVM for complex (Non Linearly Separable) SVM works very well without any
modifications for linearly separable data. Linearly Separable Data is any data that can be
plotted in a graph and can be separated into classes using a straight line.

We use Kernelized SVM for non-linearly separable data. Say, we have some non-linearly
separable data in one dimension. We can transform this data into two dimensions and the data
will become linearly separable in two dimensions. This is done by mapping each 1-D data point
to a corresponding 2-D ordered pair. So for any non-linearly separable data in any dimension, we
can just map the data to a higher dimension and then make it linearly separable. This is a very
powerful and general transformation. A kernel is nothing but a measure of similarity between
data points. The kernel function in a kernelized SVM tells you, that given two data points in the
original feature space, what the similarity is between the points in the newly transformed feature
space. There are various kernel functions available, but two are very popular :
 Radial Basis Function Kernel (RBF): The similarity between two points in the transformed
feature space is an exponentially decaying function of the distance between the vectors and
the original input space as shown below. RBF is the default kernel used in SVM.

 Polynomial Kernel: The Polynomial kernel takes an additional parameter, ‗degree‘ that
controls the model‘s complexity and computational cost of the transformation

A very interesting fact is that SVM does not actually have to perform this actual transformation
on the data points to the new high dimensional feature space. This is called the kernel trick. The
Kernel Trick: Internally, the kernelized SVM can compute these complex transformations just
in terms of similarity calculations between pairs of points in the higher dimensional feature space
where the transformed feature representation is implicit. This similarity function, which is
mathematically a kind of complex dot product is actually the kernel of a kernelized SVM. This
makes it practical to apply SVM when the underlying feature space is complex or even infinite-
dimensional. The kernel trick itself is quite complex and is beyond the scope of this
article. Important Parameters in Kernelized SVC ( Support Vector Classifier)

1. The Kernel: The kernel, is selected based on the type of data and also the type of
transformation. By default, the kernel is Radial Basis Function Kernel (RBF).
2. Gamma : This parameter decides how far the influence of a single training example reaches
during transformation, which in turn affects how tightly the decision boundaries end up
surrounding points in the input space. If there is a small value of gamma, points farther apart
are considered similar. So more points are grouped together and have smoother decision
boundaries (maybe less accurate). Larger values of gamma cause points to be closer together
(may cause overfitting).
3. The ‘C’ parameter: This parameter controls the amount of regularization applied to the data.
Large values of C mean low regularization which in turn causes the training data to fit very
well (may cause overfitting). Lower values of C mean higher regularization which causes the
model to be more tolerant of errors (may lead to lower accuracy).

Pros of Kernelized SVM:


1. They perform very well on a range of datasets.
2. They are versatile: different kernel functions can be specified, or custom kernels can also be
defined for specific datatypes.
3. They work well for both high and low dimensional data.

Cons of Kernelized SVM:


1. Efficiency (running time and memory usage) decreases as the size of the training set
increases.
2. Needs careful normalization of input data and parameter tuning.
3. Does not provide a direct probability estimator.
4. Difficult to interpret why a prediction was made.

Associative Classification:
Associative Classification in Data Mining
Data mining is the process of discovering and extracting hidden patterns from different types of
data to help decision-makers make decisions. Associative classification is a common classification
learning method in data mining, which applies association rule detection methods and classification
to create classification models.

Association Rule learning in Data Mining:

Association rule learning is a machine learning method for discovering interesting relationships
between variables in large databases. It is designed to detect strong rules in the database based on
some interesting metrics. For any given multi-item transaction, association rules aim to obtain
rules that determine how or why certain items are linked.
Association rules are created by searching for information on common if-then patterns and using
specific criteria with support and trust to define what the key relationships are. They help to show
the frequency of an item in a given data since confidence is defined by the number of times an if -
then statement is found to be true. However, a third criterion called lift is often used to compare
expected and actual confidence. Lift shows how many times the if-then statement was predicted
to be true. Create association rules to compute itemsets based on data created by two or more
items. Association rules usually consist of rules that are well represented by the data.
There are different types of data mining techniques that can be used to find out the specific
analysis and result like Classification analysis, Clustering analysis, and multivariate analysis.
Association rules are mainly used to analyze and predict customer behavior.
 In Classification analysis, it is mostly used to question, make decisions, and predict behavior.
 In Clustering analysis, it is mainly used when no assumptions are made about possible
relationships in the data.
 In Regression analysis, it is used when we want to predict an infinitely dependent value of a
set of independent variables.

Associative Classification in Data Mining:

Bing Liu Et Al was the first to propose associative classification, in which he defined a model
whose rule is ―the right-hand side is constrained to be the attribute of the classification class‖.An
associative classifier is a supervised learning model that uses association rules to assign a target
value.
The model generated by the association classifier and used to label new records consists of
association rules that produce class labels. Therefore, they can also be thought of as a list of ―if-
then‖ clauses: if a record meets certain criteria (specified on the left side of the rule, also known
as antecedents), it is marked (or scored) according to the rule‘s category on the right. Most
associative classifiers read the list of rules sequentially and apply the first matching rule to mark
new records. Association classifier rules inherit some metrics from association rules, such as
Support or Confidence, which can be used to rank or filter the rules in the model and evaluate
their quality.

Types of Associative Classification:

There are different types of Associative Classification Methods, Some of them are given below.
1. CBA (Classification Based on Associations): It uses association rule techniques to classify
data, which proves to be more accurate than traditional classification techniques. It has to face
the sensitivity of the minimum support threshold. When a lower minimum support threshold is
specified, a large number of rules are generated.
2. CMAR (Classification based on Multiple Association Rules): It uses an efficient FP-tree,
which consumes less memory and space compared to Classification Based on Associations. The
FP-tree will not always fit in the main memory, especially when the number of attributes is large.
3. CPAR (Classification based on Predictive Association Rules): Classification based on
predictive association rules combines the advantages of association classification and traditional
rule-based classification. Classification based on predictive association rules uses a greedy
algorithm to generate rules directly from training data. Furthermore, classification based on
predictive association rules generates and tests more rules than traditional rule-based classifiers
to avoid missing important rules.
Classification using frequent patterns:
In Data Mining, Frequent Pattern Mining is a major concern because it is playing a major role in
Associations and Correlations. First of all, we should know what is a Frequent Pattern?
Frequent Pattern is a pattern which appears frequently in a data set. By identifying frequent
patterns we can observe strongly correlated items together and easily identify similar
characteristics, associations among them. By doing frequent pattern mining, it leads to further
analysis like clustering, classification and other data mining tasks.
Before moving to mine frequent patterns, we should focus on two terms which ―support‖ and
―confidence‖ because they can provide a measure if the Association rule is qualified or not for a
particular data set.
Support: how often a given rule appears in the database being mined
Confidence: the number of times a given rule turns out to be true in practice

Let’s practice it through a sample data set;

Example: One of possible Association Rule is A => D


Total no of Transactions(N) = 5
Frequency(A, D) = > Total no of instances together A with D is 3
Frequency(A) => Total no of occurrence in A
Support = 3 / 5
Confidence = 3 / 4

After getting a clear idea about the two terms Support and Confidence, we can move to frequent

pattern mining. Frequent pattern mining, there are 2 categories to be considered,

1. Mining frequent pattern with candidate generation

2. Mining frequent pattern without candidate generation

In this article, we are focusing on Mining frequent patterns with candidate generation with Apriori

Algorithm which is popularly used for Association mining. Let‘s understand Apriori Algorithm with

an example and it will help you to understand the concept behind it in a clear manner. Let‘s consider

the sample data set mentioned above and assume that the minimum support=2.

1. Generate Candidate set 1, do the first scan and generate One item set

In this stage, we get the sample data set and take each individual‘s count and make frequent item set

1(K = 1).

Candidate set 1

―Candidate set 1‖ figure shows you the individual item Support count. Hence the minimum support

is 2 and based on that, item E will remove from the Candidate set 1 as an infrequent item

(disqualified).
Frequent item set from the first scan

Frequent item set based on the minimum support value will be shown under the figure ―Frequent

item set from the first scan‖ as the ―One item set‖.

2. Generate Candidate set 2, do the second scan and generate Second item set

Through this step, you create frequent set 2 (K =2) and takes each of their Support counts.

Candidate set 2

―Candidate set 2‖ figure generates through joining Candidate set 1 and takes the frequency of

related occurrences. Hence the minimum support is 2, Itemset B, D will be removed from Candidate

set 2 as an infrequent item set.

Frequent item set from the second scan


―Frequent item set from the second scan‖ is the frequent item set based on the minimum support

value and it will generate the ―Second item set‖.

3. Generate Candidate set 3, do the third scan and generate Third item set

In this iteration create frequent set 3 (K = 3) and take count of Support. Then compare with the

minimum support value from the Candidate set 3.

Candidate set 3

As you can see here we compare Candidate set 3 with the minimum support value, generated Third

item set. The frequent set from the third scan is the same as above.

4. Generate Candidate set 4, do the fourth scan and generate Fourth item set

By considering the Frequent set, we can generate Candidate set 4, by joining Candidate set 3. Then

possible Candidate set 4 is; [A, B, C, D] which has minimum support less than 2. Therefore we have

to stop the calculation from here because we cannot go no anymore iterations. Therefore for the

above data set frequent patterns are [A, B, C] and [A, C, D].

By considering one of the frequent set as{A, B, C} and the possible Association rules as follows;

1. A => B, C
2. A, B => C
3. A, C => B
4. B => A, C
5. B, C => A
6. C => A, B
Then we assume that the minimum confidence = 50% and calculate each possible Association rules‘

Confidence then we can identify disqualified ones with respect to minimum confidence is less than

50%. Then rest of the Association rules which have the Confidence greater or equal to 50% percent

those are qualified rules.

Lazy Learners:
The classification methods discussed so far in this chapter—decision tree induction, Bayesian
classification, rule-based classification, classification by back propagation, support vector
machines, and classification based on association rule mining—are all examples of eager
learners. Eager learners, when given a set of training tuples, will construct a generalization (i.e.,
classification) model before receiving new (e.g., test) tuples to classify. We can think of the
learned model as being ready and eager to classify previously unseen tuples.
k-Nearest-Neighbor Classifiers
The k-nearest-neighbor method was first described in the early 1950s. The method is labor
intensive when given large training sets, and did not gain popularity until the 1960s when
increased computing power became available. It has since been widely used in the area of pattern
recognition.
Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test
tuple with training tuples that are similar to it. The training tuples are described by n attributes.
Each tuple represents a point in an n-dimensional space. In this way, all of the training tuples are
stored in an n-dimensional pattern space. When given an unknown tuple, a k-nearest-neighbor
classifier searches the pattern space for the k training tuples that are closest to the unknown tuple.
These k training tuples are the k ―nearest neighbors‖ of the unknown tuple.
―Closeness‖ is defined in terms of a distance metric, such as Euclidean distance. The
Euclidean distance between two points or tuples, say, X1 = (x11, x12, : : : , x1n) and X2 =
(x21, x22, : : , x2n), is

Case-Based Reasoning
Case-based reasoning (CBR) classifiers use a database of problem solutions to solve new
problems. Unlike nearest-neighbor classifiers, which store training tuples as points in Euclidean
space, CBR stores the tuples or cases‖ for problem solving as complex symbolic descriptions.
Business applications of CBR include problem resolution for customer service help desks, where
cases describe product-related diagnostic problems. CBR has also been applied to areas such as
engineering and law, where cases are either technical designs or legal rulings, respectively.
Medical education is another area for CBR, where patient case histories and treatments are used to
help diagnose and treat new patients.
When given a new case to classify, a case-based reasoner will first check if an identical training
case exists. If one is found, then the accompanying solution to that case is returned. If no identical
case is found, then the case-based reasoner will search for training cases having SCE Department
of Information Technology components that are similar to those of the new case. Conceptually,
these training cases may be considered as neighbors of the new case. If cases are represented as
graphs, this involves searching for sub graphs that are similar to sub graphs within the new case.
The case-based reasoner tries to combine the solutions of the neighboring training cases in order
to propose a solution for the new case. If incompatibilities arise with the individual solutions, then
backtracking to search for other solutions may be necessary. The case-based reasoner may employ
background knowledge and problem-solving strategies in order to propose a feasible combined
solution.
UNIT 3

CLUSTERING ANALYSIS: Types of Data in Cluster Analysis, Partitioning Methods,


Hierarchical methods, Density-Based Methods, Grid-Based Methods, Probabilistic Model-
Based Clustering, Clustering High-Dimensional Data, Clustering with Constraint, Outliers and Outlier
Analysis.
Cluster Analysis:
The process of grouping a set of physical or abstract objects into classes of similar
objectsis called clustering.
A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters.
A cluster of data objects can be treated collectively as one group and so may be
consideredas a form of data compression.
Cluster analysis tools based on k-means, k-medoids, and several methods have also been
built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and
SAS.

Applications:
Cluster analysis has been widely used in numerous applications, including market research,
pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value, and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
Clustering can also be used for outlier detection, Applications of outlier detection include
the detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.

Typical Requirements of Clustering In Data Mining:


 Scalability:
Many clustering algorithms work well on small data sets containing fewer than several
hundred data objects; however, a large database may contain millions of objects. Clustering
on a sample of a given large data set may lead to biased results.
Highly scalable clustering algorithms are needed.
 Ability to deal with different types of attributes:
Many algorithms are designed to cluster interval-based (numerical) data. However,
applications may require clustering other types of data, such as binary, categorical
(nominal), and ordinal data, or mixtures of these data types.
 Discovery of clusters with arbitrary shape:
Many clustering algorithms determine clusters based on Euclidean or Manhattan distance
measures. Algorithms based on such distance measures tend to find spherical clusters with
similar size and density.
However, a cluster could be of any shape. It is important to develop algorithms that can
detect clusters of arbitrary shape.
 Minimal requirements for domain knowledge to determine input parameters:
Many clustering algorithms require users to input certain parameters in cluster analysis
(such as the number of desired clusters). The clustering results can be quite sensitive to
input parameters. Parameters are often difficult to determine, especially for data sets
containing high-dimensional objects. This not only burdens users, but it also makes the
quality of clustering difficult to control.
 Ability to deal with noisy data:
Most real-world databases contain outliers or missing, unknown, or erroneous data.
Some clustering algorithms are sensitive to such data and may lead to clusters of poor
quality.
 Incremental clustering and insensitivity to the order of input records:
Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates)
into existing clustering structures and, instead, must determine a new clustering from
scratch. Some clustering algorithms are sensitive to the order of input data.
That is, given a set of data objects, such an algorithm may return dramatically different
clustering depending on the order of presentation of the input objects.
It is important to develop incremental clustering algorithms and algorithms that are
insensitive to the order of input.
 High dimensionality:
A database or a data warehouse can contain several dimensions or attributes. Many
clustering algorithms are good at handling low-dimensional data, involving only two to
three dimensions. Human eyes are good at judging the quality of clustering for up to three
dimensions. Finding clusters of data objects in high dimensional space is challenging,
especially considering that such data can be sparse and highly skewed.
 Constraint-based clustering:
Real-world applications may need to perform clustering under various kinds of constraints.
Suppose that your job is to choose the locations for a given number of new automatic
banking machines (ATMs) in a city. To decide upon this, you may cluster households
while considering constraints such as the city‘s rivers and highway networks, and the type
and number of customers per cluster. A challenging task is to find groups of data with good
clustering behavior that satisfy specified constraints.
 Interpretability and usability:
Users expect clustering results to be interpretable, comprehensible, and usable. That is,
clustering may need to be tied to specific semantic interpretations and applications. It is
important to study how an application goal may influence the selection of clustering
features and methods.

Major Clustering Methods:

 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods

Partitioning Methods:
A partitioning method constructs k partitions of the data, where each partition represents a
cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the
following requirements:
Each group must contain at least one object,
and Each object must belong to exactly one
group.

A partitioning method creates an initial partitioning. It then uses an iterative relocation


technique that attempts to improve the partitioning by moving objects from one group to
another.

The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.
Hierarchical Methods:
A hierarchical method creates a hierarchical decomposition of the given set of data objects.
A hierarchical method can be classified as being either agglomerative or divisive, based on
how the hierarchical decomposition is formed.

 The agglomerative approach, also called the bottom-up approach, starts with each object
forming a separate group. It successively merges the objects or groups that are close to
one another, until all of the groups are merged into one or until a termination condition
holds.
 The divisive approach, also called the top-down approach, starts with all of the objects in
the same cluster. In each successive iteration, a cluster is split up into smaller clusters,
until eventually each object is in one cluster, or until a termination condition holds.
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can
never be undone. This rigidity is useful in that it leads to smaller computation costs by not
having to worry about a combinatorial number of different choices.

There are two approaches to improving the quality of hierarchical clustering:

 Perform careful analysis of object ―linkages‖ at each hierarchical partitioning, such as in


Chameleon, or
 Integrate hierarchical agglomeration and other approaches by first using a hierarchical
agglomerative algorithm to group objects into micro clusters, and then performing macro
clustering on the micro clusters using another clustering method such as iterative
relocation.
Density-based methods:
 Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only spherical-shaped clusters and encounter difficulty at discovering
clusters of arbitrary shapes.
 Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold; that is, for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise (outliers)and discover clusters of
arbitrary shape.
 DBSCAN and its extension, OPTICS, are typical density-based methods that grow
clusters according to a density-based connectivity analysis. DENCLUE is a method that
clusters objects based on the analysis of the value distributions of density functions.
Grid-Based Methods:
 Grid-based methods quantize the object space into a finite number of cells that form a
grid structure.
 All of the clustering operations are performed on the grid structure i.e., on the quantized
space. The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only on the number
of cells in each dimension in the quantized space.
 STING is a typical example of a grid-based method. Wave Cluster applies wavelet
transformation for clustering analysis and is both grid-based and density-based.

Model-Based Methods:
 Model-based methods hypothesize a model for each of the clusters and find the best fit
of the data to the given model.
 A model-based algorithm may locate clusters by constructing a density function that
reflects the spatial distribution of the data points.
 It also leads to a way of automatically determining the number of clusters based on
standard statistics, taking ―noise‖ or outliers into account and thus yielding robust
clustering methods.
Tasks in Data Mining:
 Clustering High-Dimensional Data
 Constraint-Based Clustering
Clustering High-Dimensional Data:
It is a particularly important task in cluster analysis because many applications
require the analysis of objects containing a large number of features or dimensions.
For example, text documents may contain thousands of terms or keywords as
features, and DNA micro array data may provide information on the expression
levels of thousands of genes under hundreds of conditions.
Clustering high-dimensional data is challenging due to the curse of dimensionality.
Many dimensions may not be relevant. As the number of dimensions increases,
The data become increasingly sparse so that the distance measurement between
pairs of points become meaningless and the average density of points anywhere in
the data is likely to be low. Therefore, a different clustering methodology needs to
be developed for high-dimensional data.
CLIQUE and PROCLUS are two influential subspace clustering methods, which
search for clusters in subspaces of the data, rather than over the entire data space.
Frequent pattern–based clustering, another clustering methodology, extracts distinct
frequent patterns among subsets of dimensions that occur frequently. It uses such
patterns to group objects and generate meaningful clusters.

Constraint-Based Clustering:
It is a clustering approach that performs clustering by incorporation of user-specified
or application-oriented constraints.
A constraint expresses a user‘s expectation or describes properties of the desired
clustering results, and provides an effective means for communicating with the
clustering process.
Various kinds of constraints can be specified, either by a user or as per application
requirements.
Spatial clustering employs with the existence of obstacles and clustering under user-
specified constraints. In addition, semi-supervised clustering employs for pairwise
constraints in order to improve the quality of the resulting clustering.

Classical Partitioning Methods:


The most well-known and commonly used partitioning methods are
 The k-Means Method
 k-Medoids Method
Centroid-Based Technique: The K-Means Method:
The k-means algorithm takes the input parameter, k, and partitions a set of n objects into
clusters so that the resulting intracluster similarity is high but the intercluster similarity is
low.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which
can be viewed as the cluster‘s centroid or center of gravity.
The k-means algorithm proceeds as follows.
First, it randomly selects k of the objects, each of which initially represents a cluster
mean or center.
For each of the remaining objects, an object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the cluster mean.
It then computes the new mean for each cluster.
This process iterates until the criterion function converges.

Typically, the square-error criterion is used, defined as

Where E is the sum of the square error for all objects in the data set
pis the point in space representing a given object
miis the mean of cluster Ci

The k-means partitioning algorithm:


The k-means algorithm for partitioning, where each cluster‘s center is represented by the mean
value of the objects in the cluster.
Clustering of a set of objects based on the k-means method

The k-Medoids Method:

The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data. This effect is particularly
exacerbated due to the use of the square-error function.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick
actual objects to represent the clusters, using one representative object per cluster. Each
remaining object is clustered with the representative object to which it is the most similar.
The partitioning method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point. That is, an
absolute-error criterion is used, defined as

whereE is the sum of the absolute error for all objects in the data set

pis the point in space representing a given object in clusterCj

ojis the representative object of Cj


The initial representative objects are chosen arbitrarily. The iterative process of replacing
representative objects by non-representative objects continues as long as the quality of the
resulting clustering is improved.
This quality is estimated using a cost function that measures the average dissimilarity
between an object and the representative object of its cluster.
To determine whether a non-representative object, oj random, is a good replacement for a
current representative object, oj, the following four cases are examined for each of the no
representative objects.

Case 1:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.

Case 2:

pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to orandom, then p is reassigned to orandom.

Case 3:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced by orandomas a representative
object and p is still closest to oi, then the assignment does notchange.

Case 4:

pcurrently belongs to representative object, oi, i≠j. If ojis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned
toorandom.
Four cases of the cost function for k-medoids clustering

The k-Medoids Algorithm:

The k-medoids algorithm for partitioning based on medoid or central objects.


The k-medoids method ismore robust than k-means in the presence of noise and outliers,
because a medoid is lessinfluenced by outliers or other extreme values than a mean. However,
its processing is more costly than the k-means method.

Constraint-Based Cluster Analysis:


Constraint-based clustering finds clusters that satisfy user-specified preferences or constraints.
Depending on the nature of the constraints, constraint-based clustering may adopt rather different
approaches.
There are a few categories of constraints.
 Constraints on individual objects:

We can specify constraints on the objects to be clustered. In a real estate application, for
example, one may like to spatially cluster only those luxury mansions worth over a million
dollars. This constraint confines the setoff objects to be clustered. It can easily be
handled by preprocessing after which the problem reduces to an instance of unconstrained
clustering.

 Constraints on the selection of clustering parameters:

A user may like to set a desired range for each clustering parameter. Clustering parameters
are usually quite specific to the given clustering algorithm. Examples of parameters include
k, the desired number of clusters in a k-means algorithm; or e the radius and the minimum
number of points in the DBSCAN algorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus, their fine tuning and processing are usually not considered a form of constraint-based
clustering.
 Constraints on distance or similarity functions:

We can specify different distance or similarity functions for specific attributes of the
objects to be clustered, or different distance measures for specific pairs of objects. When
clustering sportsmen, for example, we may use different weighting schemes for height,
body weight, age, and skill level. Although this will likely change the mining results, it
may not alter the clustering process per se. However, in some cases, such changes may
make the evaluation of the distance function nontrivial, especially when it is tightly
intertwined with the clustering process.
 User-specified constraints on the properties of individual clusters:
A user may like to specify desired characteristics of the resulting clusters, which may
strongly influence the clustering process.
 Semi-supervised clustering based on partial supervision:
The quality of unsupervised clustering can be significantly improved using some weak
form of supervision. This may be in the form of pairwise constraints (i.e., pairs of objects
labeled as belonging to the same or different cluster). Such a constrained clustering process
is called semi-supervised clustering.
Outlier Analysis:

There exist data objects that do not comply with the general behavior or model of the data.
Such data objects, which are grossly different from or inconsistent with the remaining set
of data, are called outliers.
Many data mining algorithms try to minimize the influence of outliers or eliminate them all
together. This, however, could result in the loss of important hidden information because
one person‘s noise could be another person‘s signal. In other words, the outliers may be of
particular interest, such as in the case of fraud detection, where outliers may indicate
fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.
It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services. In addition, it is useful in customized marketing for
identifying the spending behavior of customers with extremely low or extremely high
incomes, or in medical analysis for finding unusual responses to various medical
treatments.

Outlier mining can be described as follows: Given a set of n data points or objects and k, the
expected number of outliers, find the top k objects that are considerably dissimilar,
exceptional, or inconsistent with respect to the remaining data. The outlier mining problem
can be viewed as two sub problems:
Define what data can be considered as inconsistent in a given data set,
andFind an efficient method to mine the outliers so defined.
Types of outlier detection:

 Statistical Distribution-Based Outlier Detection


 Distance-Based Outlier Detection
 Density-Based Local Outlier Detection
 Deviation-Based Outlier Detection

Statistical Distribution-Based Outlier Detection:


The statistical distribution-based approach to outlier detection assumes a distribution or
probability model for the given data set (e.g., a normal or Poisson distribution) and then
identifies outliers with respect to the model using a discordancy test. Application of the
test requires knowledge of the data set parameters knowledge of distribution parameters
such as the mean and variance and the expected number of outliers.
A statistical discordancy test examines two hypotheses:
A working hypothesis
An alternative hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from
an initial distribution model, F, that is,

The hypothesis is retained if there is no statistically significant evidence supporting its


rejection. A discordancy test verifies whether an object, oi, is significantly large (or
small) in relation to the distribution F. Different test statistics have been proposed for use
as a discordancy test, depending on the available knowledge of the data. Assuming that
some statistic, T, has been chosen for discordancy testing, and the value of the statistic
for object oi is vi, then the distribution of T is constructed. Significance probability,
SP(vi)=Prob(T > vi), is evaluated. If SP(vi) is sufficiently small, then oi is discordant and
the working hypothesis is rejected.
An alternative hypothesis, H, which states that oi comes from another distribution model,
G, is adopted. The result is very much dependent on which model F is chosen because
oimay be an outlier under one model and a perfectly valid value under another. The
alternative distribution is very important in determining the power of the test, that is, the
probability that the working hypothesis is rejected when oi is really an outlier.
There are different kinds of alternative distributions.
Inherent alternative distribution:
In this case, the working hypothesis that all of the objects come from distribution F is
rejected in favor of the alternative hypothesis that all of the objects arise from another
distribution, G:
H :oi € G, where i = 1, 2,…, n
F and G may be different distributions or differ only in parameters of the same
distribution.
There are constraints on the form of the G distribution in that it must have potential to
produce outliers. For example, it may have a different mean or dispersion, or a longer
tail.
Mixture alternative distribution:
The mixture alternative states that discordant values are not outliers in the F population,
but contaminants from some other population,
G. In this case, the alternative hypothesis is

Slippage alternative distribution:


This alternative states that all of the objects (apart from some prescribed small number)
arise independently from the initial model, F, with its given parameters, whereas the
remaining objects are independent observations from a modified version of F in which
the parameters have been shifted.
There are two basic types of procedures for detecting outliers:
Block procedures:
In this case, either all of the suspect objects are treated as outliers or all of them are accepted
as consistent.

Consecutive procedures:
An example of such a procedure is the inside out procedure. Its main idea is that the object
that is least likely to be an outlier is tested first. If it is found to be an outlier, then all of the
more extreme values are also considered outliers; otherwise, the next most extreme object is
tested, and so on. This procedure tends to be more effective than block procedures.

Distance-Based Outlier Detection:


The notion of distance-based outliers was introduced to counter the main limitationsimposed
by statistical methods. An object, o, in a data set, D, is a distance-based (DB)outlier with
parameters pct and dmin,that is, a DB(pct;dmin)-outlier, if at least a fraction,pct, of the
objects in D lie at a distance greater than dmin from o. In other words, rather that relying on
statistical tests, we can think of distance-based outliers as those objects that do not have
enough neighbors, where neighbors are defined based on distance from the given object. In
comparison with statistical-based methods, distance based outlier detection generalizes the
ideas behind discordancy testing for various standard distributions. Distance-based outlier
detection avoids the excessive computation that can be associated with fitting the observed
distribution into some standard distribution and in selecting discordancy tests.
For many discordancy tests, it can be shown that if an object, o, is an outlier accordingto the
given test, then o is also a DB(pct, dmin)-outlier for some suitably defined pct anddmin.
For example, if objects that lie three or more standard deviations from the mean
are considered to be outliers, assuming a normal distribution, then this definition can
be generalized by a DB(0.9988, 0.13s) outlier.
Several efficient algorithms for mining distance-based outliers have been developed.
Index-based algorithm:
Given a data set, the index-based algorithm uses multidimensional indexing structures, such
as R-trees or k-d trees, to search for neighbors of each object o within radius dmin around
that object. Let Mbe the maximum number of objects within the dmin-neighborhood of an
outlier. Therefore, onceM+1 neighbors of object o are found, it is clear that o is not an outlier.
This algorithm has a worst-case complexity of O(n2k), where n is the number of objects in
the data set and k is the dimensionality. The index-based algorithm scales well as k increases.
However, this complexity evaluation takes only the search time into account, even though the
task of building an index in itself can be computationally intensive.
Nested-loop algorithm:
The nested-loop algorithm has the same computational complexity as the index-based
algorithm but avoids index structure construction and tries to minimize the number of I/Os. It
divides the memory buffer space into two halves and the data set into several logical blocks.
By carefully choosing the order in which blocks are loaded into each half, I/O efficiency can
be achieved.
Cell-based algorithm:
To avoidO(n2) computational complexity, a cell-based algorithm was developed for memory-
resident data sets. Its complexity is O(ck+n), where c is a constant depending on the number
of cells and k is the dimensionality.

In this method, the data space is partitioned into cells with a side length equal to
Eachcell has two layers surrounding it. The first layer is one cell thick, while the secondis

cells thick, rounded up to the closest integer. The algorithm counts outliers on a
cell-by-cell rather than an object-by-object basis. For a given cell, it accumulates three
counts—the number of objects in the cell, in the cell and the first layer together, and in the
cell and both layers together. Let‘s refer to these counts as cell count, cell + 1 layer count,
andcell + 2 layers count, respectively.

Let Mbe the maximum number of outliers that can exist in the dmin-neighborhood of an
outlier.
An object, o, in the current cell is considered an outlier only if cell + 1 layer countis less
than or equal to M. If this condition does not hold, then all of the objects in the cell can
beremoved from further investigation as they cannot be outliers.
If cell_+ 2_layers_count is less than or equal to M, then all of the objects in the cell are
considered outliers. Otherwise, if this number is more than M, then it is possible that
some of the objects in the cell may be outliers. To detect these outliers, object-by-object
processing is used where, for each object, o, in the cell, objects in the second layer of o
are examined. For objects in the cell, only those objects having no more than M points in
their dmin-neighborhoods are outliers. The dmin-neighborhood of an object consists
ofthe object‘s cell, all of its first layer, and some of its second layer.
A variation to the algorithm is linear with respect to n and guarantees that no morethan three passes
over the data set are required. It can be used for large disk-residentdata sets, yet does not scale well
for high dimensions.
UNIT 4
WEB AND TEXT MINING: Introduction, web mining, web content mining, web structure
mining, web usage mining, Text mining, unstructured text, episode rule discovery for texts,
hierarchy of categories, text clustering.

Introduction to Web mining:


Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is discovering
useful information from the World-Wide Web and its usage patterns.

There are three types of web mining:


1. Web Content Mining

 Web content mining can be used for mining of useful data, information and knowledge
from web page content.
 Web structure mining helps to find useful knowledge or information pattern from the
structure of hyperlinks.
 Due to heterogeneity and absence of structure in web data, automated discovery of new
knowledge pattern can be challenging to some extent.
 Web content mining performs scanning and mining of the text, images and groups of web
pages according to the content of the input (query), by displaying the list in search engines.
For example: If an user wants to search for a particular book, then search engine provides
the list of suggestions.

2. Web Usage Mining

 Web usage mining is used for mining the web log records (access information of web
pages) and helps to discover the user access patterns of web pages.
 Web server registers a web log entry for every web page.
 Analysis of similarities in web log records can be useful to identify the potential customers
for e-commerce companies.

Web content mining:


Web content mining is referred to as text mining. Content mining is the browsing and mining of
text, images, and graphs of a Web page to decide the relevance of the content to the search query.
This browsing is done after the clustering of web pages through structure mining and supports the
results depending upon the method of relevance to the suggested query.
With a large amount of data that is available on the World Wide Web, content mining supports the
results lists to search engines in order of largest applicability to the keywords in the query.
It can be defined as the phase of extracting essential data from standard language text. Some data
that it can generate via text messages, files, emails, documents are written in common language
text. Text mining can draw beneficial insights or patterns from such data.
Text mining is an automatic procedure that facilitates natural language processing to derive
valuable insights from unstructured text. By changing data into information that devices can learn,
text mining automates the phase of classifying texts by sentiment, subjects, and intent.
Text mining is directed toward specific data supported by the user search data in search engines.
This enables the browsing of the entire Web to fetch the cluster content triggering the scanning of
definite web pages within those clusters.
The results are pages transmitted to the search engines through the largest level of applicability to
the lowest. Though the search engines can support connection to Web pages by the hundreds about
the search content, this kind of web mining allows the reduction of irrelevant data. Web text
mining is efficient when used in a content database dealing with definite subjects.
For instance, online universities need a library system to recall articles related to their frequent
areas of study. This definite content database allows to pull only the data within those subjects,
supporting the most specific outcomes of search queries in search engines.
This allowance of only the most relevant data being supported gives a larger quality of results.
This increase in productivity is direct to the need for content mining of text and visuals. The need
for this type of data mining is to gather, classify, organize and support the best possible data
accessible on the WWW to the user requesting the data.

Web structure mining:


Web structure mining is a tool that can recognize the relationship between web pages linked by
data or direct link connection. This structured data is discoverable by the provision of web
structure schema through database techniques for Web pages.

Web usage mining:


 Extends work of basic search engines
 Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
Web Usage Mining Applications
 Personalization
 Improve structure of a site‘s Web pages
 Aid in caching and prediction of future page references
 Improve design of individual pages
 Improve effectiveness of e-commerce (sales and advertising)
Web Usage Mining Activities
 Preprocessing Web log
o Cleanse
o Remove extraneous information
o Sessionize
Session: Sequence of pages referenced by one user at a sitting.
 Pattern Discovery
o Count patterns that occur in sessions
o Pattern is sequence of pages references in session.
o Similar to association rules
 Transaction: session
 Itemset: pattern (or subset)
 Order is important
 Pattern Analysis

Web Usage Mining Issues


 Identification of exact user not possible.
 Exact sequence of pages referenced by a user not possible due to caching.
 Session not well defined
 Security, privacy, and legal issues

Text mining:
Text mining is the process of exploring and analyzing large amounts of unstructured text data aided
by software that can identify concepts, patterns, topics, keywords and other attributes in the data.
It's also known as text analytics, although some people draw a distinction between the two terms;
in that view, text analytics refers to the application that uses text mining techniques to sort through
data sets.

Text mining has become more practical for data scientists and other users due to the development
of big data platforms and deep learning algorithms that can analyze massive sets of unstructured
data.

Mining and analyzing text helps organizations find potentially valuable business insights in
corporate documents, customer emails, call center logs, verbatim survey comments, social network
posts, medical records and other sources of text-based data. Increasingly, text mining capabilities
are also being incorporated into AI chatbots and virtual agents that companies deploy to provide
automated responses to customers as part of their marketing, sales and customer service operations.

How text mining works


Text mining is similar in nature to data mining, but with a focus on text instead of more structured
forms of data. However, one of the first steps in the text mining process is to organize and structure
the data in some fashion so it can be subjected to both qualitative and quantitative analysis.

Doing so typically involves the use of natural language processing (NLP) technology, which
applies computational linguistics principles to parse and interpret data sets.

The upfront work includes categorizing, clustering and tagging text; summarizing data sets;
creating taxonomies; and extracting information about things like word frequencies and
relationships between data entities. Analytical models are then run to generate findings that can
help drive business strategies and operational actions.
Applications of text mining
Sentiment analysis is a widely used text mining application that can track customer sentiment about
a company. Also known as opinion mining, sentiment analysis mines text from online reviews,
social networks, emails, call center interactions and other data sources to identify common threads
that point to positive or negative feelings on the part of customers. Such information can be used to
fix product issues, improve customer service and plan new marketing campaigns, among other
things.

Benefits of text mining


Using text mining and analytics to gain insight into customer sentiment can help companies detect
product and business problems and then address them before they become big issues that affect
sales. Mining the text in customer reviews and communications can also identify desired new
features to help strengthen product offerings. In each case, the technology provides an opportunity
to improve the overall customer experience, which will hopefully result in increased revenue and
profits.

Text mining challenges and issues


Text mining can be challenging because the data is often vague, inconsistent and contradictory.
Efforts to analyze it are further complicated by ambiguities that result from differences
in syntax and semantics, as well as the use of slang, sarcasm, regional dialects and technical
language specific to individual vertical industries. As a result, text mining algorithms must be
trained to parse such ambiguities and inconsistencies when they categorize, tag and summarize sets
of text data.
Unstructured text:
Mining of unstructured text delivers new insights by uncovering previously unknown
information, detecting patterns and trends, and identifying connections between seemingly
unrelated pieces of data.

Episode rule discovery for texts:

Hierarchy of categories:
Concept Hierarchy

 A concept hierarchy defines a sequence of mappings from a set of low-level concepts to


higher-level, more general concepts

Types of concept hierarchy

1. Binning
o In binning, first sort data and partition into (equi-depth) bins then one can smooth by
bin means, smooth by bin median, smooth by bin boundaries, etc.
2. Histogram analysis
o Histogram is a popular data reduction technique
o Divide data into buckets and store average (sum) for each bucket
o Can be constructed optimally in one dimension using dynamic programming
o Related to quantization problems.
3. Clustering analysis
o Partition data set into clusters, and one can store cluster representation only
o Can be very effective if data is clustered but not if data is ―smeared‖
o Can have hierarchical clustering and be stored in multi-dimensional index tree
structures
4. Entropy-based discretization
5. Segmentation by natural partitioning
o 3-4-5 rule can be used to segment numeric data into relatively uniform, ―natural‖
intervals.
o If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition
the range into 3 equi-width intervals
o If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range
into 4 intervals
o If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range
into 5 intervals

Text clustering:
UNIT 5

TEMPORAL AND SPATIAL DATA MINING: Introduction; Temporal Data Mining ,


Temporal Association Rules, Sequence Mining, GSP algorithm, SPADE,SPIRIT Episode
Discovery, Time Series Analysis, Spatial Mining, Spatial Mining Tasks, Spatial Clustering,
Data Mining Applications: Data Mining for Retail and Telecommunication industries.
Temporal Data Mining:

Temporal Data Mining is the process of extracting useful information from the pool of temporal
data. It is concerned with analyzing temporal data to extract and find the temporal patterns and
regularities in the datasets.

The various tasks of Temporal Data Mining are as follows:

 Data Characterization and Comparison


 Cluster Analysis
 Classification
 Association rules
 Prediction and Trend Analysis
 Pattern Analysis

The main objective of Temporal Data Mining is to find the temporal patterns, trends, and relations
within the data and extract meaningful information from the data to visualize how the data trend
has changed over a course of time.

Temporal Data Mining includes the processing of time-series data, and sequences of data to
determine and compute the values of the same attributes over multiple time points

Temporal Association Rules:


A temporal association rule expresses that a set of items tends to appear along with another set of
items in the same transactions, in a specific time frame.
A Temporal Association Rule for d is an expression of the form X Þ Y [t1, t2], where X Í R, Y Í R
\ X, and [t1, t2] is a time frame corresponding to the lifespan of X È Y expressed in a granularity
determined by the user.
A temporal association rule has three factors associated with it: support, temporal support, both
already defined, and confidence, that will be defined next.

Phase 1. Find every set of items(itemsets) X Í R that is frequent, i.e. their frequency exceeds the
established minimum support s .

Phase 2. Use the frequent itemsets X to find the rules: test for every Y Ì X, with Y ¹ Æ , if the
rule X \ Y Þ Y satisfies with enough confidence, i.e. it exceeds the established minimum
confidence q .

In the following paragraph, we introduce suitable modifications to support temporal association


rules discovery:
Phase 1T. Find every itemset X Í R such that X is frequent in its lifespan lX, i.e. s(X,lX, d) ³ s and
|lX| ³ t .

Phase 2T. Use the frequent itemsets X to find the rules: verify for every Y Ì X, with Y ¹ Æ , if the
rule X \ Y Þ Y [t1, t2] is satisfied with enough confidence, in other words, exceeds the minimum
confidence q established in the interval [t1, t2].

Sequence Mining:
Sequential pattern mining is the mining of frequently appearing series events or subsequences as
patterns. An instance of a sequential pattern is users who purchase a Canon digital camera are to
purchase an HP color printer within a month.
For retail information, sequential patterns are beneficial for shelf placement and promotions. This
industry, and telecommunications and different businesses, can also use sequential patterns for
targeted marketing, user retention, and several tasks.
There are several areas in which sequential patterns can be used such as Web access pattern
analysis, weather prediction, production processes, and web intrusion detection.
Given a set of sequences, where each sequence includes a file of events (or elements) and each
event includes a group of items, and given a user-specified minimum provide threshold of min
sup, sequential pattern mining discover all frequent subsequences, i.e., the subsequences whose
occurrence frequency in the group of sequences is no less than min_sup.
Let I = {I1, I2,..., Ip} be the set of all items. An itemset is a nonempty set of items. A sequence is an
ordered series of events. A sequence s is indicated {e1, e2, e3 … el} where event e1 appears before
e2, which appears before e3, etc. Event ej is also known as element of s.
In the case of user purchase information, an event defines a shopping trip in which a customer
purchase items at a specific store. The event is an itemset, i.e., an unordered list of items that the
customer purchased during the trip. The itemset (or event) is indicated (x1x2···xq), where xk is an
item.
An item can appear just once in an event of a sequence, but can appear several times in different
events of a sequence. The multiple instances of items in a sequence is known as the length of the
sequence. A sequence with length l is known as l-sequence.
A sequence database, S, is a group of tuples, (SID, s), where SID is a sequence_ID and s is a
sequence. For instance, S includes sequences for all users of the store. A tuple (SID, s) is include a
sequence α, if α is a subsequence of s.
This phase of sequential pattern mining is an abstraction of user-shopping sequence analysis.
Scalable techniques for sequential pattern mining on such records are as follows −
There are several sequential pattern mining applications cannot be covered by this phase. For
instance, when analyzing Web clickstream series, gaps among clicks become essential if one
required to predict what the next click can be.
In DNA sequence analysis, approximate patterns become helpful because DNA sequences can
include (symbol) insertions, deletions, and mutations. Such diverse requirements can be
considered as constraint relaxation or application.
GSP algorithm:
GSP is a very important algorithm in data mining. It is used in sequence mining from large
databases. Almost all sequence mining algorithms are basically based on a prior algorithm. GSP
uses a level-wise paradigm for finding all the sequence patterns in the data. It starts with finding
the frequent items of size one and then passes that as input to the next iteration of the GSP
algorithm. The database is passed multiple times to this algorithm. In each iteration, GSP
removes all the non-frequent itemsets. This is done based on a threshold frequency which is
called support. Only those itemsets are kept whose frequency is greater than the support count.
After the first pass, GSP finds all the frequent sequences of length-1 which are called 1-
sequences. This makes the input to the next pass, it is the candidate for 2-sequences. At the end
of this pass, GSP generates all frequent 2-sequences, which makes the input for candidate 3-
sequences. The algorithm is recursively called until no more frequent itemsets are found.
Basic of Sequential Pattern (GSP) Mining:
 Sequence: A sequence is formally defined as the ordered set of items {s1, s2, s3, …, sn}. As
the name suggests, it is the sequence of items occurring together. It can be considered as a
transaction or purchased items together in a basket.
 Subsequence: The subset of the sequence is called a subsequence. Suppose {a, b, g, q, y, e,
c} is a sequence. The subsequence of this can be {a, b, c} or {y, e}. Observe that the
subsequence is not necessarily consecutive items of the sequence. From the sequences of
databases, subsequences are found from which the generalized sequence patterns are found at
the end.
 Sequence pattern: A sub-sequence is called a pattern when it is found in multiple sequences.
The goal of the GSP algorithm is to mine the sequence patterns from the large database. The
database consists of the sequences. When a subsequence has a frequency equal to more than
the ―support‖ value. For example: the pattern <a, b> is a sequence pattern mined from
sequences {b, x, c, a}, {a, b, q}, and {a, u, b}.

Methods for Sequential Pattern Mining:


 Apriori-based Approaches
 GSP
 SPADE
 Pattern-Growth-based Approaches
 FreeSpan
 PrefixSpan
Sequence Database: A database that consists of ordered elements or events is called a sequence
database. Example of a sequence database:
S.No. SID sequences

1. 100 <a(ab)(ac)d(cef)> or <a{ab}{ac}d{cef}>

2. 200 <(ad)c(bcd)(abe)>

3. 300 <(ef)(ab)(def)cb>

4. 400 <eg(adf)CBC>

Transaction: The sequence consists of many elements which are called transactions.
<a(ab)(ac)d(cef)> is a sequence whereas (a), (ab), (ac),
(d) and (cef) are the elements of the sequence.
These elements are sometimes referred as transactions.
An element may contain a set of items. Items within an element are unordered and we list them
alphabetically.
For example, (cef) is the element and it consists of 3 items c, e and f.
Since, all three items belong to same element, their order does not matter. But we prefer to put
them in alphabetical order for convenience.
The order of the elements of the sequence matters unlike order of items in same transaction.
k-length Sequence:
The number of items involved in the sequence is denoted by K. A sequence of 2 items is called a
2-len sequence. While finding the 2-length candidate sequence this term comes into use.
Example of 2-length sequence is: {ab}, {(ab)}, {bc} and {(bc)}.
 {bc} denotes a 2-length sequence where b and c are two different transactions. This can also
be written as {(b)(c)}
 {(bc)} denotes a 2-length sequence where b and c are the items belonging to the same
transaction, therefore enclosed in the same parenthesis. This can also be written as {(cb)},
because the order of items in the same transaction does not matter.
Support in k-length Sequence:
Support means the frequency. The number of occurrences of a given k-length sequence in the
sequence database is known as the support. While finding the support the order is taken care.
Illustration:
Suppose we have 2 sequences in the database.
s1: <a(bc)b(cd)>
s2: <b(ab)abc(de)>
We need to find the support of {ab} and {(bc)}
Finding support of {ab}:
This is present in first sequence.
s1: <a(bc)b(cd)>
Since, a and b belong to different elements, their order matters.
In second sequence {ab} is not found but {ba} is present.
s2: <b(ab)abc(de)> Thus we don’t consider this.
Hence, support of {ab} is 1.
Finding support of {bc}:
Since, b and c are present in same element, their order does not matter.
s1: <a(bc)b(cd)>, first occurrence.
s2: <b(ab)abc(de)>, it seems correct, but is not. b and c are present in different elements here.
So, we don’t consider it.
Hence, support of {(bc)} is 1.
How to join L1 and L1 to give C2?
L1 is the final 1-length sequence after pruning. After pruning all the entries left in the set have
supported greater than the threshold.
Case 1: Join {ab} and {ac}
s1: {ab}, s2: {ac}
After removing a from s1 and c from s2.
s1’={b}, s2’={a}
s1′ and s2′ are not same, so s1 and s2 can’t be joined.
Case 2: Join {ab} and {be}
s1: {ab}, s2: {be}
After removing a from s1 and e from s2.
s1’={b}, s2’={b}
s1′ and s2′ are exactly same, so s1 and s2 be joined.
s1 + s2 = {abe}
Case 3: Join {(ab)} and {be}
s1: {(ab)}, s2: {be}
After removing a from s1 and e from s2.
s1’={(b)}, s2’={(b)}
s1′ and s2′ are exactly same, so s1 and s2 be joined.
s1 + s2 = {(ab)e}
s1 and s2 are joined in such a way that items belong to correct elements or transactions.
Pruning Phase: While building Ck (candidate set of k-length), we delete a candidate sequence
that has a contiguous (k-1) subsequence whose support count is less than the minimum support
(threshold). Also, delete a candidate sequence that has any subsequence without minimum
support.
{abg} is a candidate sequence of C3.
{abg} is a candidate sequence of C3.
To check if {abg} is proper candidate or not, without checking its support, we check the support
of its subsets.
Because subsets of 3-length sequence will be 1 and 2 length sequences. We build the candidate
sets increment like 1-length, 2-length and so on.
Subsets of {abg} are: {ab], {bg} and {ag}
Check support of all three subsets. If any of them have support less than minimum support then
delete the sequence {abg} from the set C3 otherwise keep it.

SPADE:
SPIRIT Episode Discovery:
Time Series Analysis:

Time series analysis is a specific way of analyzing a sequence of data points collected over an
interval of time. In time series analysis, analysts record data points at consistent intervals over a set
period of time rather than just recording the data points intermittently or randomly. However, this
type of analysis is not merely the act of collecting data over time.

What sets time series data apart from other data is that the analysis can show how variables change
over time. In other words, time is a crucial variable because it shows how the data adjusts over the
course of the data points as well as the final results. It provides an additional source of information
and a set order of dependencies between the data.

Time series analysis typically requires a large number of data points to ensure consistency and
reliability. An extensive data set ensures you have a representative sample size and that analysis
can cut through noisy data. It also ensures that any trends or patterns discovered are not outliers
and can account for seasonal variance. Additionally, time series data can be used for forecasting
predicting future data based on historical data.

Time series analysis examples

Time series analysis is used for non-stationary data—things that are constantly fluctuating over
time or are affected by time. Industries like finance, retail, and economics frequently use time
series analysis because currency and sales are always changing. Stock market analysis is an
excellent example of time series analysis in action, especially with automated trading algorithms.
Likewise, time series analysis is ideal for forecasting weather changes, helping meteorologists
predict everything from tomorrow‘s weather report to future years of climate change. Examples of
time series analysis in action include:

 Weather data
 Rainfall measurements
 Temperature readings
 Heart rate monitoring (EKG)
 Brain monitoring (EEG)
 Quarterly sales
 Stock prices
 Automated stock trading
 Industry forecasts
 Interest rates
Spatial Mining:

Spatial Data Mining is the process of discovering interesting and useful patterns, spatial
relationships, which weren‘t earlier stored in spatial databases. In spatial data, mining analysts use
geographical or spatial information to produce business intelligence or other results. Challenges
involved in spatial data mining include identifying patterns or finding relevant objects to the
research project.

The general tools used for Spatial data Mining are Clementine See5/C5.0 and Enterprise Miner.
These tools are preferable for analyzing scientific and engineering data, astronomical data,
multimedia data, genomic data, and web data.

Spatial data must contain:

 Latitude and longitude information.


 UTM easting or northing.
 Other coordinates denote a point‘s location in space, which helps in identifying a location.

Apart from this information, it may contain the different types of an attribute that helps identify a
geographical location and its characteristics.
Spatial Mining Tasks:
Spatial Clustering:
Spatial clustering aims to partition spatial data into a series of meaningful subclasses, called spatial
clusters, such that spatial objects in the same cluster are similar to each other, and are dissimilar to
those in different clusters.

Data Mining for Retail and Telecommunication industries:


Data Mining plays a major role in segregating useful data from a heap of big data. By analyzing
the patterns and peculiarities, it enables us to find the relationship between data sets. When the
unprocessed raw data is processed into useful information, it can be applied to enhance the
growth of many fields we depend on in our day-to-day life.
This article shows the data mining role in the retail and telecommunication industries.
Role of data mining in retail industries
In the dynamic and fast-growing retail industry, the consumption of goods increases day by day
which in turn increases the data collected and used. The retail industry includes the sales of
goods to the customer through retailers. It covers from a local booth in the street to the big malls
in cities. For eg: The grocery shop owner in a defined area would know about their customer
details after-sales for few months. When he notes the need of his customer, it would be easy to
enhance the sales. The same happens in the big retail industries. They collect customers‘
responses to a product, the time zone, their location, shopping cart history, etc. Preference of
brands and products help the company to create targeted ad to increase the sales and profit.

Knowing the customers:

What is the purpose of sales if the retailer doesn‘t know who their customers are? It‘s a definite
need to understand about their customers. It starts by analyzing them with various factors.
Finding the source by which the customer gets to know about that retailing platform would help
in enhancing the advertisement of retailers to attract a completely new set of people. By finding
the days they have frequently purchased can help in discount sales or special boost up on festival
days. The time they spend buying per order can give us useful statistical data to enhance growth.
The amount of money spent on the order can help the retailer in separating the customer crowd
into groups of High paid orders, medium-paid orders, and low-paid orders. This will increase the
targeted customers or help in introducing customized packages depending on price. By knowing
the language and payment method preferences, retailers can provide required services to satisfy
the customers. Managing a good business relationship with the customer can gain trust and
loyalty that can bring a rapid profit for the retailer. The retention of customers in their company
will help them to withstand the competition between similar other companies.

RFM Value:

RFM stands for Recency, Frequency, Monetary value. Recency is nothing but the nearest or
recent time when the customer made a purchase. Frequency is how often the purchase had taken
place and Monetary value is the amount spent by the customers on the purchase. RFM can surge
monetization by holding on to the regular and potential customers by keeping them happy with
satisfying results. It can also help in pulling back the trailing customers who tend to reduce the
purchase. The more the RFM score, the more the growth of sales is. RFM also prevents from
sending over requests to engaged customers and it helps to implement new marketing techniques
to low ordering customers. RFM helps in identifying innovative solutions.

Market-based analysis:

The market-based analysis is a technique used to study and analyze the shopping sequence of a
customer to increase revenue/sales. This is done by analyzing datasets of a particular customer by
learning their shopping history, frequently bought items, items grouped like a combination to
use.
A very good example is the loyalty card issued by the retailer to customers. From the customer‘s
point of view, the card is needed to keep track of discounts in the future, incentive criteria details,
and the history of transactions. But, if we take this loyalty card from a retailer point of view, the
applications of market-based analysis will be layered inside to collect the details about the
transaction.
This analysis can be achieved with data science techniques or various algorithms. This can even
be achieved without technical skills. Microsoft Excel platform is used to analyze the customer
purchases, frequently bought or frequently grouped items. The spreadsheets can be organized by
using ID as specified for different transactions. This analysis helps in suggesting products for the
customer which may pair well with their current purchase which leads to cross-selling and
improved profits. It also helps to track the purchase rate per month or year. It manifests the
correct time for the retailer to make the desired offers to attract the right customers for the
targeted products.

Potent sales campaign:

Everything nowadays needs advertising. Because advertising the product helps people know
about its existence, use, and features. It takes the product from the warehouse to the real world. If
it has to attract the right customers, data must be analyzed. This is the right call to sales or market
campaign performed by the retailers. The marketing campaigns must be initiated with the right
plans else it may lead to loss of company by over-investing in untargeted Advertisements. The
sales campaign depends on the time, location, and preference of the customer. The platform in
which the campaign takes place also plays a major role in pulling the right customers in. It
requires regular analysis of the sales and its associated data taking place in a particular platform
at a certain time. The traffic in social or network platforms will give us the favoring of
campaigned product or not. The retailer can make changes in the campaign with the previous
statistics which rapidly increases the sales profit and prevents overspending. Learning about the
customer profits and the company profits can enhance the usage of campaigns. The number of
sales per one campaign can also guide the retailer on whether to invest in it or not. A trial-and-
error method can be converted into a well-transformed method by the efficient handling of data.
A multi-channel sale campaign also helps to analyze the purchases and surges the revenue, profit,
and number of customers.
Role of data mining in telecommunication industries
In the highly evolving and competitive surroundings, the telecommunication industry plays a
major in handling huge data sets of customers, network and call data. To thrive in such an
environment, the Telecommunication Industry must find a way to handle data easily. Data
Mining is preferred to enhance the business and to solve the problem in this industry. The major
function includes fraud call identification and spotting the defects in a network to isolate the
faults. Data mining can also enhance effective marketing techniques. Anyways, this industry
confronts challenges in dealing with the logical and time aspect in data mining which calls the
need to foresee rarity in telecommunication data to detect network faults or buyer frauds in real-
time.

Call detail data:

Whenever a call starts in the telecommunication network, the details of the call are recorded. The
date and instant of time in which it happens, the duration of call along with the time when it ends.
Since all the data of a call is collected in real-time, it is ready to be processed with data mining
techniques. But we should segregate data from the customer level not from isolated single phone
call levels. Thus, by efficient extraction of data, one can find the customer calling pattern.
Some of the data that help to find the pattern are
 average time duration of calls
 Time in which the call took place (Daytime/Night-time)
 The average number of calls on weekdays
 Calls generated with varied area code
 Calls generated per day, etc.
By sensing the proper customer call details, one can progress the business growth. If a customer
makes more calls during dayshift working hours, that makes them distinguished as a part of a
business firm. If the night-time call rate is high, it may be used only for residential or domestic
purposes. By the frequent variance in the area code, one can segregate the business calls because
people calling for the residential purpose may call over limited area codes in a period. But the
data collected in the evening time cannot give the exact detail of whether the customer belongs to
a business or residential firm.

Data of customers:

When it comes to the telecommunication industry, there would be an enormous number of


customers. This customer database is sustained for any further queries in the data mining process.
For example, when a customer fraud case is encountered, these customer details would help in
the identification of the person with the details in the customer database like name, address of the
person. It would be easy to trace them and solve the issue. This dataset can also be extracted
from external sources because mostly this information would be common. It also includes th e
plan chosen for subscription, proper payment history. By using this dataset, we can escalate the
growth in telecommunication industries.
Network Data:

Due to the use of well-developed complex appliances used in telecommunication networks, there
is a possibility that every part of the system may generate errors and messages. This leads to a
large amount of network data being processed. This data must be separated, grouped, and stored
in order if the system causes any network fault isolation. This ensures that the error or status
message of any part of the network system would reach the technical specialist. So, they could
rectify it. Since the database is enormous, when a large number of status or error messages get
generated, it becomes difficult to solve the problems manually. So, some sets of errors and
messages can be automatized to reduce the strain. A methodical approach of data mining can
manage the network system efficiently which can enhance the functions.

Preparing and clustering data:

Even though raw data are processed in data mining, it must be in a well sensed and properly
arranged format to be processed. And, in the telecommunication industry dealing with the giant
database, it‘s an important need. First, clashing and contrary data must be identified to avoid
inconsistency. Making sure of the removal of undesired data fields heaping space. The data must
be organized and mapped by finding the relationship between datasets to avoid redundancy.
Clustering or grouping similar data can be done by algorithms in the data mining field. It can
help in analyzing the patterns like calling patterns or customer behavior patterns. Group of
frequencies is made by analyzing the similarities between them. By doing this, data can easily be
understood which leads to easy manipulation and use.

Customer profiling:

The telecommunication industry deals with a large scale of customer details. It starts observing
patterns of the customer from call data to profile the customers to predict future trends. By
knowing the customer pattern, the company can decide the promotion methods offered to the
customer. If the call ranges within an area code. The promotion made in that aspect would gain a
group of customers. This can efficiently monetize the promotion techniques and stop the
company from investing in a single subscriber but it can attract a group of people with the right
plan. Privacy issues arise when the customer‘s call history or details are monitored.
One of the significant problems that the telecommunication industry faces is that Customer
churn. This can also be stated as customer turnover in which the company loses its client. In this
case, the client leaves and switches to another telecommunication company. If the customer
churn rate is high in a company, the respective company will experience severe loss of revenue
and profit which will lead to its decline in growth. This issue can be fixed by data mining
techniques to collect patterns of customers and profiling them. Incentive offers provided by
companies attract the regular user of some other company. By profiling the data, the customer
churn can be effectively forecasted by their behaviors like subscription history, the plan they
chose, and so on. While collecting data from the paid customers, it‘s also possible to collect data
of the receiver or non-customer but with a set of restrictions.
Fraud detection:

Fraud is a critical problem for telecommunication industries which causes loss of revenue and
also causes deterioration in customer relations. Two major fraud activity involved
is subscription theft and super-imposed frauds. The subscription fraud involves collecting the
details of customers mostly from the KYC (Know Your Customer) documents like name,
address, and ID proof details. These details are needed to sign up for telecom services with
authenticating approval but without any type of intention to pay for using the service using the
account. Some offender not only stops with the illegitimate use of services but perform bypass
fraud by diverting voice traffic from local to international protocols which causes destructive loss
to the telecommunication company. In super-imposed frauds, it starts with a legitimate account
and a legal activity but with further lead to the overlapped or imposed activity by some other
person illegally using the services rather than the account holder. But by collecting the behavioral
pattern of the account holder, if a suspect is found on super-imposed fraudulent activities it will
lead to immediate actions like blocking or deactivating the account user. This will prevent further
damage to the company.

You might also like