CSC 425 Data Mining and Warehousing 2024
CSC 425 Data Mining and Warehousing 2024
On
DR. OLADIPO, ID
Department of Computer Science,
University of Ilorin, Ilorin.
LESSON ONE:
There is a huge amount of data available in the Information Industry. This data is of no use until it
is converted into useful information. It is necessary to analyze this huge amount of data and extract
useful information from it.
Extraction of information is not the only process we need to perform; data mining also involves
other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining,
Pattern Evaluation and Data Presentation. Once all these processes are over, we would be able to
use this information in many applications such as Fraud Detection, Market Analysis, Production
Control, Science Exploration, etc.
Data Mining is defined as extracting information from huge sets of data. In other words, we can
say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications −
• Market Analysis
• Fraud Detection
• Customer Retention
• Production Control
• Science Exploration
Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid
Listed below are the various fields of market where data mining is used −
• Customer Profiling − Data mining helps determine what kind of people buy what kind of
products.
• Identifying Customer Requirements − Data mining helps in identifying the best products
for different customers. It uses prediction to find the factors that may attract new customers.
• Cross Market Analysis − Data mining performs Association/correlations between
product sales.
• Target Marketing − Data mining helps to find clusters of model customers who share the
same characteristics such as interests, spending habits, income, etc.
• Determining Customer purchasing pattern − Data mining helps in determining
customer purchasing pattern.
• Providing Summary Information − Data mining provides us various multidimensional
summary reports.
• Finance Planning and Asset Evaluation − It involves cash flow analysis and prediction,
contingent claim analysis to evaluate assets.
• Resource Planning − It involves summarizing and comparing the resources and spending.
• Competition − It involves monitoring competitors and market directions.
Data mining is also used in the fields of credit card services and telecommunication to detect
frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call, time
of the day or week, etc. It also analyzes the patterns that deviate from expected norms.
Data mining deals with the kind of patterns that can be mined. On the basis of the kind of data to
be mined, there are two categories of functions involved in Data Mining −
• Descriptive
• Classification and Prediction
The descriptive function deals with the general properties of data in the database. Here is the list
of descriptive functions −
• Class/Concept Description
• Mining of Frequent Patterns
• Mining of Associations
• Mining of Correlations
• Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For example, in a
company, the classes of items for sales include computer and printers, and concepts of customers
include big spenders and budget spenders. Such descriptions of a class or a concept are called
class/concept descriptions. These descriptions can be derived by the following two ways −
• Data Characterization − This refers to summarizing data of class under study. This class
under study is called as Target Class.
• Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.
Frequent patterns are those patterns that occur frequently in transactional data. Here is the list of
kind of frequent patterns −
• Frequent Item Set − It refers to a set of items that frequently appear together, for example,
milk and bread.
• Frequent Subsequence − A sequence of patterns that occur frequently such as purchasing
a camera is followed by memory card.
• Frequent Sub Structure − Substructure refers to different structural forms, such as
graphs, trees, or lattices, which may be combined with item-sets or subsequences.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased together. This
process refers to the process of uncovering the relationship among data and determining
association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is sold with
bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other clusters.
Classification is the process of finding a model that describes the data classes or concepts. The
purpose is to be able to use this model to predict the class of objects whose class label is unknown.
This derived model is based on the analysis of sets of training data. The derived model can be
presented in the following forms −
• Classification − It predicts the class of objects whose class label is unknown. Its objective
is to find a derived model that describes and distinguishes data classes or concepts. The
Derived Model is based on the analysis set of training data i.e. the data object whose class
label is well known.
• Prediction − It is used to predict missing or unavailable numerical data values rather than
class labels. Regression Analysis is generally used for prediction. Prediction can also be
used for identification of distribution trends based on available data.
• Outlier Analysis − Outliers may be defined as the data objects that do not comply with the
general behavior or model of the data available.
• Evolution Analysis − Evolution analysis refers to the description and model regularities
or trends for objects whose behavior changes over time.
• We can specify a data mining task in the form of a data mining query.
• This query is input to the system.
• A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data mining
system. Here is the list of Data Mining Task Primitives −
This is the portion of database in which the user is interested. This portion includes the following
• Database Attributes
• Data Warehouse dimensions of interest
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Clustering
• Outlier Analysis
• Evolution Analysis
The background knowledge allows data to be mined at multiple levels of abstraction. For example,
the Concept hierarchies are one of the background knowledge that allows data to be mined at
multiple levels of abstraction.
This is used to evaluate the patterns that are discovered by the process of knowledge discovery.
There are different interesting measures for different kind of knowledge.
This refers to the form in which discovered patterns are to be displayed. These representations may
include the following. −
• Rules
• Tables
• Charts
• Graphs
• Decision Trees
• Cubes
LESSON TWO
Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These
factors also create some issues. Here in this tutorial, we will discuss the major issues regarding −
• Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.
• Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for
one system to mine all these kind of data.
• Mining information from heterogeneous databases and global information systems −
The data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining the knowledge from them
adds challenges to data mining.
LESSON THREE
Data Warehouse
A data warehouse exhibits the following characteristics to support the management's decision-
making process −
Data warehousing is the process of constructing and using the data warehouse. A data warehouse
is constructed by integrating the data from multiple heterogeneous sources. It supports analytical
reporting, structured and/or ad hoc queries, and decision making.
Data warehousing involves data cleaning, data integration, and data consolidations. To integrate
heterogeneous databases, we have the following two approaches −
This is the traditional approach to integrate heterogeneous databases. This approach is used to
build wrappers and integrators on top of multiple heterogeneous databases. These integrators are
also known as mediators.
• When a query is issued to a client side, a metadata dictionary translates the query into the
queries, appropriate for the individual heterogeneous site involved.
• Now these queries are mapped and sent to the local query processor.
• The results from heterogeneous sites are integrated into a global answer set.
Disadvantages
• The Query Driven Approach needs complex integration and filtering processes.
• It is very inefficient and very expensive for frequent queries.
• This approach is expensive for queries that require aggregations.
Today's data warehouse systems follow update-driven approach rather than the traditional
approach discussed earlier. In the update-driven approach, the information from multiple
heterogeneous sources is integrated in advance and stored in a warehouse. This information is
available for direct querying and analysis.
Advantages
Query processing does not require interface with the processing at local sources.
Online Analytical Mining integrates with Online Analytical Processing with data mining and
mining knowledge in multidimensional databases. Here is the diagram that shows the integration
of both OLAP and OLAM −
1.4 Importance of OLAM
• High quality of data in data warehouses − The data mining tools are required to work
on integrated, consistent, and cleaned data. These steps are very costly in the preprocessing
of data. The data warehouses constructed by such preprocessing are valuable sources of
high quality data for OLAP and data mining as well.
• Available information processing infrastructure surrounding data warehouses −
Information processing infrastructure refers to accessing, integration, consolidation, and
transformation of multiple heterogeneous databases, web-accessing and service facilities,
reporting and OLAP analysis tools.
• OLAP−based exploratory data analysis − Exploratory data analysis is required for
effective data mining. OLAM provides facility for data mining on various subset of data
and at different levels of abstraction.
• Online selection of data mining functions − Integrating OLAP with multiple data mining
functions and online analytical mining provide users with the flexibility to select desired
data mining functions and swap data mining tasks dynamically.
LESSON FOUR
Data mining is defined as extracting the information from a huge set of data. In other words we
can say that data mining is mining the knowledge from data. This information can be used for any
of the following applications −
• Market Analysis
• Fraud Detection
• Customer Retention
• Production Control
• Science Exploration
Data mining engine is very essential to the data mining system. It consists of a set of functional
modules that perform the following functions −
• Characterization
• Association and Correlation Analysis
• Classification
• Prediction
• Cluster analysis
• Outlier analysis
• Evolution analysis
This is the domain knowledge. This knowledge is used to guide the search or evaluate the
interestingness of the resulting patterns.
Some people treat data mining same as knowledge discovery, while others view data mining as an
essential step in the process of knowledge discovery. Here is the list of steps involved in the
knowledge discovery process −
• Data Cleaning
• Data Integration
• Data Selection
• Data Transformation
• Data Mining
• Pattern Evaluation
• Knowledge Presentation
User interface is the module of data mining system that helps the communication between users
and the data mining system. User Interface allows the following functionalities −
Data Integration is a data preprocessing technique that merges the data from multiple
heterogeneous data sources into a coherent data store. Data integration may involve inconsistent
data and therefore needs data cleaning.
Data cleaning is a technique that is applied to remove the noisy data and correct the inconsistencies
in data. Data cleaning involves transformations to correct the wrong data. Data cleaning is
performed as a data preprocessing step while preparing the data for a data warehouse.
Data Selection is the process where data relevant to the analysis task are retrieved from the
database. Sometimes data transformation and consolidation are performed before the data selection
process.
1.8 Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other clusters.
In this step, data is transformed or consolidated into forms appropriate for mining, by performing
summary or aggregation operations.
LESSON FIVE
Some people don’t differentiate data mining from knowledge discovery while others view data
mining as an essential step in the process of knowledge discovery. Here is the list of steps involved
in the knowledge discovery process −
• Data Cleaning − In this step, the noise and inconsistent data is removed.
• Data Integration − In this step, multiple data sources are combined.
• Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
• Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
• Data Mining − In this step, intelligent methods are applied in order to extract data patterns.
• Pattern Evaluation − In this step, data patterns are evaluated.
• Knowledge Presentation − In this step, knowledge is represented.
There is a large variety of data mining systems available. Data mining systems may integrate
techniques from the following −
• Database Technology
• Statistics
• Machine Learning
• Information Science
• Visualization
• Other Disciplines
Apart from these, a data mining system can also be classified based on the kind of (a) databases
mined, (b) knowledge mined, (c) techniques utilized, and (d) applications adapted.
We can classify a data mining system according to the kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data, etc. And the
data mining system can be classified accordingly.
For example, if we classify a database according to the data model, then we may have a relational,
transactional, object-relational, or data warehouse mining system.
We can classify a data mining system according to the kind of knowledge mined. It means the data
mining system is classified on the basis of functionalities such as −
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Outlier Analysis
• Evolution Analysis
We can classify a data mining system according to the kind of techniques used. We can describe
these techniques according to the degree of user interaction involved or the methods of analysis
employed.
We can classify a data mining system according to the applications adapted. These applications
are as follows −
• Finance
• Telecommunications
• DNA
• Stock Markets
• E-mail
If a data mining system is not integrated with a database or a data warehouse system, then there
will be no system to communicate with. This scheme is known as the non-coupling scheme. In this
scheme, the main focus is on data mining design and on developing efficient and effective
algorithms for mining the available data sets.
(a) No Coupling − In this scheme, the data mining system does not utilize any of the database
or data warehouse functions. It fetches the data from a particular source and processes that
data using some data mining algorithms. The data mining result is stored in another file.
(b) Loose Coupling − In this scheme, the data mining system may use some of the functions
of database and data warehouse system. It fetches the data from the data respiratory
managed by these systems and performs data mining on that data. It then stores the mining
result either in a file or in a designated place in a database or in a data warehouse.
(c) Semi−tight Coupling − In this scheme, the data mining system is linked with a database
or a data warehouse system and in addition to that, efficient implementations of a few data
mining primitives can be provided in the database.
(d) Tight coupling − In this coupling scheme, the data mining system is smoothly integrated
into the database or data warehouse system. The data mining subsystem is treated as one
functional component of an information system.
LESSON SIX:
The Data Mining Query Language (DMQL) was proposed by Han, Fu, Wang, et al. for the
DBMiner data mining system. The Data Mining Query Language is actually based on the
Structured Query Language (SQL). Data Mining Query Languages can be designed to support ad
hoc and interactive data mining. This DMQL provides commands for specifying primitives. The
DMQL can work with databases and data warehouses as well. DMQL can be used to define data
mining tasks. Particularly we examine how to define data warehouses and data marts in DMQL.
or
Here we will discuss the syntax for Characterization, Discrimination, Association, Classification,
and Prediction.
Characterization
The analyze clause, specifies aggregate measures, such as count, sum, or count%.
For example −
Discrimination
For example, a user may define big spenders as customers who purchase items that cost $100 or
more on an average; and budget spenders as customers who purchase items at less than $100 on
an average. The mining of discriminant descriptions for customers from each of these categories
can be specified in the DMQL as −
Association
For Example −
where X is key of customer relation; P and Q are predicate variables; and W, Y, and Z are object
variables.
Classification
For example, to mine patterns, classifying customer credit rating where the classes are determined
by the attribute credit_rating, and mine classification is determined as
classifyCustomerCreditRating.
analyze credit_rating
Prediction
-schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
-
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior
-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
Interestingness measures and thresholds can be specified by the user with the statement −
For Example −
We have a syntax, which allows users to specify the display of discovered patterns in one or more
forms.
display as <result_form>
For Example −
display as table
As a market manager of a company, you would like to characterize the buying habits of customers
who can purchase items priced at no less than $100; with respect to the customer's age, type of
item purchased, and the place where the item was purchased. You would like to know the
percentage of customers having that characteristic. In particular, you are only interested in
purchases made in Canada, and paid with an American Express credit card. You would like to
view the resulting descriptions in the form of a table.
Standardizing the Data Mining Languages will serve the following purposes −
There are two forms of data analysis that can be used for extracting models describing important
classes or to predict future data trends. These two forms are as follows −
• Classification
• Prediction
Classification models predict categorical class labels; and prediction models predict continuous
valued functions. For example, we can build a classification model to categorize bank loan
applications as either safe or risky, or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their income and occupation.
Following are the examples of cases where the data analysis task is Classification −
• A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
• A marketing manager at a company needs to analyze a customer with a given profile, who
will buy a new computer.
In both of the above examples, a model or classifier is constructed to predict the categorical labels.
These labels are risky or safe for loan application data and yes or no for marketing data.
Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend during a
sale at his company. In this example we are bothered to predict a numeric value. Therefore the
data analysis task is an example of numeric prediction. In this case, a model or a predictor will be
constructed that predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for numeric
prediction.
With the help of the bank loan application that we have discussed above, let us understand the
working of classification. The Data Classification process includes two steps −
In this step, the classifier is used for classification. Here the test data is used to estimate the
accuracy of classification rules. The classification rules can be applied to the new data tuples if the
accuracy is considered acceptable.
(a) Data Cleaning − Data cleaning involves removing the noise and treatment of missing
values. The noise is removed by applying smoothing techniques and the problem of
missing values is solved by replacing a missing value with most commonly occurring value
for that attribute.
(b) Relevance Analysis − Database may also have the irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
(c) Data Transformation and reduction − The data can be transformed by any of the
following methods.
(i) Normalization − The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within a
small specified range. Normalization is used when in the learning step, the neural
networks or the methods involving measurements are used.
(ii) Generalization − The data can also be transformed by generalizing it to the higher
concept. For this purpose we can use the concept hierarchies.
Note − Data can also be reduced by some other methods such as wavelet transformation, binning,
histogram analysis, and clustering.
Here is the criteria for comparing the methods of Classification and Prediction −
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label
correctly and the accuracy of the predictor refers to how well a given predictor can guess
the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and using the classifier or
predictor.
• Robustness − It refers to the ability of classifier or predictor to make correct predictions
from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or predictor
efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor understands.
LESSON EIGHT
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at
a company is likely to buy a computer or not. Each internal node represents a test on an attribute.
Each leaf node represents a class.
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known
as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3
and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees are
constructed in a top-down recursive divide-and-conquer manner.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers.
The pruned trees are smaller and less complex.
Cost Complexity
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the probability
that a given tuple belongs to a particular class.
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
Bayesian Belief Networks specify joint conditional probability distributions. They are also known
as Belief Networks, Bayesian Networks, or Probabilistic Networks.
The following diagram shows a directed acyclic graph for six Boolean variables.
The arc in the diagram allows representation of causal knowledge. For example, lung cancer is
influenced by a person's family history of lung cancer, as well as whether or not the person is a
smoker. It is worth noting that the variable PositiveXray is independent of whether the patient has
a family history of lung cancer or that the patient is a smoker, given that we know the patient has
lung cancer.
The conditional probability table for the values of the variable LungCancer (LC) showing each
possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker (S) is as
follows −
LESSON TEN
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule
in the following from −
Points to remember −
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision
tree.
Points to remember −
• One rule is created for each path from the root to the leaf node.
• To form a rule antecedent, each splitting criterion is logically ANDed.
• The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering Algorithm
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We
do not require to generate a decision tree first. In this algorithm, each rule for a given class covers
many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general
strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by the
rule is removed and the process continues for the rest of the tuples. This is because the path to each
leaf in a decision tree corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class at a time.
When learning a rule from a class Ci, we want the rule to cover all the tuples from class C only
and no tuple form any other class.
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
Rule Pruning
The rule is pruned is due to the following reason −
• The Assessment of quality is made on the original set of training data. The rule may
perform well on training data but less well on subsequent data. That's why the rule pruning
is required.
• The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has
greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.
LESSON ELEVEN
Here we will discuss other classification methods such as Genetic Algorithms, Rough Set
Approach, and Fuzzy Set Approach.
The idea of genetic algorithm is derived from natural evolution. In genetic algorithm, first of all,
the initial population is created. This initial population consists of randomly generated rules. We
can represent each rule by a string of bits.
For example, in a given training set, the samples are described by two Boolean attributes such as
A1 and A2. And this given training set contains two classes such as C1 and C2.
We can encode the rule IF A1 AND NOT A2 THEN C2 into a bit string 100. In this bit
representation, the two leftmost bits represent the attribute A1 and A2, respectively.
Likewise, the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001.
Note − If the attribute has K values where K>2, then we can use the K bits to encode the attribute
values. The classes are also encoded in the same manner.
Points to remember −
• Based on the notion of the survival of the fittest, a new population is formed that consists
of the fittest rules in the current population and offspring values of these rules as well.
• The fitness of a rule is assessed by its classification accuracy on a set of training samples.
• The genetic operators such as crossover and mutation are applied to create offspring.
• In crossover, the substring from pair of rules are swapped to form a new pair of rules.
• In mutation, randomly selected bits in a rule's string are inverted.
We can use the rough set approach to discover structural relationship within imprecise and noisy
data.
Note − This approach can only be applied on discrete-valued attributes. Therefore, continuous-
valued attributes must be discretized before its use.
The Rough Set Theory is based on the establishment of equivalence classes within the given
training data. The tuples that forms the equivalence class are indiscernible. It means the samples
are identical with respect to the attributes describing the data.
There are some classes in the given real world data, which cannot be distinguished in terms of
available attributes. We can use the rough sets to roughly define such classes.
For a given class C, the rough set definition is approximated by two sets as follows −
The following diagram shows the Upper and Lower Approximation of class C −
Fuzzy Set Theory is also called Possibility Theory. This theory was proposed by Lotfi Zadeh in
1965 as an alternative the two-value logic and probability theory. This theory allows us to work
at a high level of abstraction. It also provides us the means for dealing with imprecise measurement
of data.
The fuzzy set theory also allows us to deal with vague or inexact facts. For example, being a
member of a set of high incomes is in exact (e.g. if $50,000 is high then what about $49,000 and
$48,000). Unlike the traditional CRISP set where the element either belong to S or its complement
but in fuzzy set theory the element can belong to more than one fuzzy set.
For example, the income value $49,000 belongs to both the medium and high fuzzy sets but to
differing degrees. Fuzzy set notation for this income value is as follows −
mmedium_income($49k)=0.15 and mhigh_income($49k)=0.96
where ‘m’ is the membership function that operates on the fuzzy sets of medium_income and
high_income respectively. This notation can be shown diagrammatically as follows −
LESSON TWELVE
Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.
Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
(a) Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.
(b) Clustering can also help marketers discover distinct groups in their customer base. And
they can characterize their customer groups based on the purchasing patterns.
(c) In the field of biology, it can be used to derive plant and animal taxonomies, categorize
genes with similar functionalities and gain insight into structures inherent to populations.
(d) Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to house
type, value, and geographic location.
(e) Clustering also helps in classifying documents on the web for information discovery.
(f) Clustering is also used in outlier detection applications such as detection of credit card
fraud.
(g) As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
The following points throw light on why clustering is required in data mining −
(a) Scalability − We need highly scalable clustering algorithms to deal with large databases.
(b) Ability to deal with different kinds of attributes − Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and binary
data.
(c) Discovery of clusters with attribute shape − The clustering algorithm should be capable
of detecting clusters of arbitrary shape. They should not be bounded to only distance
measures that tend to find spherical cluster of small sizes.
(d) High dimensionality − The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
(e) Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
(f) Interpretability − The clustering results should be interpretable, comprehensible, and
usable.
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition
of data. Each partition will represent a cluster and k ≤ n. It means that it will classify the data into
k groups, which satisfy the following requirements −
Points to remember −
• For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.
This method creates a hierarchical decomposition of the given set of data objects. We can classify
hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two
approaches here −
• Agglomerative Approach
• Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object forming
a separate group. It keeps on merging the objects or groups that are close to one another. It keep
on doing so until all of the groups are merged into one or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects in
the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down
until each object in one cluster or the termination condition holds. This method is rigid, i.e., once
a merging or splitting is done, it can never be undone.
Here are the two approaches that are used to improve the quality of hierarchical clustering −
This method is based on the notion of density. The basic idea is to continue growing the given
cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data point
within a given cluster, the radius of a given cluster has to contain at least a minimum number of
points.
In this, the objects together form a grid. The object space is quantized into finite number of cells
that form a grid structure.
Advantages
In this method, a model is hypothesized for each cluster to find the best fit of data for a given
model. This method locates the clusters by clustering the density function. It reflects spatial
distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.
Text databases consist of huge collection of documents. They collect this information from several
sources such as news articles, books, digital libraries, e-mail messages, web pages, etc. Due to
increase in the amount of information, the text databases are growing rapidly. In many of the text
databases, the data is semi-structured.
For example, a document may contain a few structured fields, such as title, author,
publishing_date, etc. But along with the structure data, the document also contains unstructured
text components, such as abstract and contents. Without knowing what could be in the documents,
it is difficult to formulate effective queries for analyzing and extracting useful information from
the data. Users require tools to compare the documents and rank their importance and relevance.
Therefore, text mining has become popular and an essential theme in data mining.
Information retrieval deals with the retrieval of information from a large number of text-based
documents. Some of the database systems are not usually present in information retrieval systems
because both handle different kinds of data. Examples of information retrieval system include −
Note − The main problem in an information retrieval system is to locate relevant documents in a
document collection based on a user's query. This kind of user's query consists of some keywords
describing an information need.
In such search problems, the user takes an initiative to pull relevant information out from a
collection. This is appropriate when the user has ad-hoc information need, i.e., a short-term need.
But if the user has a long-term information need, then the retrieval system can also take an initiative
to push any newly arrived information item to the user.
This kind of access to information is called Information Filtering. And the corresponding systems
are known as Filtering Systems or Recommender Systems.
We need to check the accuracy of a system when it retrieves a number of documents on the basis
of user's input. Let the set of documents relevant to a query be denoted as {Relevant} and the set
of retrieved document as {Retrieved}. The set of documents that are relevant and retrieved can be
denoted as {Relevant} ∩ {Retrieved}. This can be shown in the form of a Venn diagram as follows
−
There are three fundamental measures for assessing the quality of text retrieval −
• Precision
• Recall
• F-score
Precision
Precision is the percentage of retrieved documents that are in fact relevant to the query. Precision
can be defined as −
Recall
Recall is the percentage of documents that are relevant to the query and were in fact retrieved.
Recall is defined as −
F-score
F-score is the commonly used trade-off. The information retrieval system often needs to trade-off
for precision or vice versa. F-score is defined as harmonic mean of recall or precision as follows
−
The World Wide Web contains huge amounts of information that provides a rich source for data
mining.
The web poses great challenges for resource and knowledge discovery based on the following
observations −
(a) The web is too huge − The size of the web is very huge and rapidly increasing. This seems
that the web is too huge for data warehousing and data mining.
(b) Complexity of Web pages − The web pages do not have unifying structure. They are very
complex as compared to traditional text document. There are huge amount of documents
in digital library of web. These libraries are not arranged according to any particular sorted
order.
(c) Web is dynamic information source − The information on the web is rapidly updated.
The data such as news, stock markets, weather, sports, shopping, etc., are regularly
updated.
(d) Diversity of user communities − The user community on the web is rapidly expanding.
These users have different backgrounds, interests, and usage purposes. There are more than
100 million workstations that are connected to the Internet and still rapidly increasing.
(e) Relevancy of Information − It is considered that a particular person is generally interested
in only small portion of the web, while the rest of the portion of the web contains the
information that is not relevant to the user and may swamp desired results.
The basic structure of the web page is based on the Document Object Model (DOM). The DOM
structure refers to a tree like structure where the HTML tag in the page corresponds to a node in
the DOM tree. We can segment the web page by using predefined tags in HTML. The HTML
syntax is flexible therefore, the web pages does not follow the W3C specifications. Not following
the specifications of W3C may cause error in DOM tree structure.
The DOM structure was initially introduced for presentation in the browser and not for description
of semantic structure of the web page. The DOM structure cannot correctly identify the semantic
relationship between the different parts of a web page.
• The purpose of VIPS is to extract the semantic structure of a web page based on its visual
presentation.
• Such a semantic structure corresponds to a tree structure. In this tree each node corresponds
to a block.
• A value is assigned to each node. This value is called the Degree of Coherence. This value
is assigned to indicate the coherent content in the block based on visual perception.
• The VIPS algorithm first extracts all the suitable blocks from the HTML DOM tree. After
that it finds the separators between these blocks.
• The separators refer to the horizontal or vertical lines in a web page that visually cross with
no blocks.
• The semantics of the web page is constructed on the basis of these blocks.
Data mining is widely used in diverse areas. There are a number of commercial data mining system
available today and yet there are many challenges in this field. In this tutorial, we will discuss the
applications and the trend of data mining.
The financial data in banking and financial industry is generally reliable and of high quality which
facilitates systematic data analysis and data mining. Some of the typical cases are as follows −
• Design and construction of data warehouses for multidimensional data analysis and data
mining.
• Loan payment prediction and customer credit policy analysis.
• Classification and clustering of customers for targeted marketing.
• Detection of money laundering and other financial crimes.
Data Mining has its great application in Retail Industry because it collects large amount of data
from on sales, customer purchasing history, goods transportation, consumption and services. It is
natural that the quantity of data collected will continue to expand rapidly because of the increasing
ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead to
improved quality of customer service and good customer retention and satisfaction. Here is the list
of examples of data mining in the retail industry −
• Design and Construction of data warehouses based on the benefits of data mining.
• Multidimensional analysis of sales, customers, products, time and region.
• Analysis of effectiveness of sales campaigns.
• Customer Retention.
• Product recommendation and cross-referencing of items.
(c) Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data
transmission, etc. Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data mining is become
very important to help and understand the business.
In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics. Following are the aspects in which data mining contributes for
biological data analysis −
The applications discussed above tend to handle relatively small and homogeneous data sets for
which the statistical techniques are appropriate. Huge amount of data have been collected from
scientific domains such as geosciences, astronomy, etc. A large amount of data sets is being
generated because of the fast numerical simulations in various fields such as climate and ecosystem
modeling, chemical engineering, fluid dynamics, etc. Following are the applications of data mining
in the field of Scientific Applications −
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability of
network resources. In this world of connectivity, security has become the major issue. With
increased usage of internet and availability of the tools and tricks for intruding and attacking
network prompted intrusion detection to become a critical component of network administration.
Here is the list of areas in which data mining technology may be applied for intrusion detection −
There are many data mining system products and domain specific data mining applications. The
new data mining systems and applications are being added to the previous systems. Also, efforts
are being made to standardize data mining languages.
• Data Types − The data mining system may handle formatted text, record-based data, and
relational data. The data could also be in ASCII text, relational database data or data
warehouse data. Therefore, we should check what exact format the data mining system can
handle.
• System Issues − We must consider the compatibility of a data mining system with different
operating systems. One data mining system may run on only one operating system or on
several. There are also data mining systems that provide web-based user interfaces and
allow XML data as input.
• Data Sources − Data sources refer to the data formats in which data mining system will
operate. Some data mining system may work only on ASCII text files while others on
multiple relational sources. Data mining system should also support ODBC connections or
OLE DB for ODBC connections.
• Data Mining functions and methodologies − There are some data mining systems that
provide only one data mining function such as classification while some provides multiple
data mining functions such as concept description, discovery-driven OLAP analysis,
association mining, linkage analysis, statistical analysis, classification, prediction,
clustering, outlier analysis, similarity search, etc.
• Coupling data mining with databases or data warehouse systems − Data mining
systems need to be coupled with a database or a data warehouse system. The coupled
components are integrated into a uniform information processing environment. Here are
the types of coupling listed below −
o No coupling
o Loose Coupling
o Semi tight Coupling
o Tight Coupling
• Data Mining query language and graphical user interface − An easy-to-use graphical
user interface is important to promote user-guided, interactive data mining. Unlike
relational database systems, data mining systems do not share underlying data mining
query language.
Data mining concepts are still evolving and here are the latest trends that we get to see in this field
• Application Exploration.
• Scalable and interactive data mining methods.
• Integration of data mining with database systems, data warehouse systems and web
database systems.
• SStandardization of data mining query language.
• Visual data mining.
• New methods for mining complex types of data.
• Biological data mining.
• Data mining and software engineering.
• Web mining.
• Distributed data mining.
• Real time data mining.
• Multi database data mining.
• Privacy protection and information security in data mining.
LESSON SIXTEEN
(a) Data Reduction − The basic idea of this theory is to reduce the data representation which
trades accuracy for speed in response to the need to obtain quick approximate answers to
queries on very large databases. Some of the data reduction techniques are as follows −
o Singular value Decomposition
o Wavelets
o Regression
o Log-linear models
o Histograms
o Clustering
o Sampling
o Construction of Index Trees
(b) Data Compression − The basic idea of this theory is to compress the given data by
encoding in terms of the following −
o Bits
o Association Rules
o Decision Trees
o Clusters
(c) Pattern Discovery − The basic idea of this theory is to discover patterns occurring in a
database. Following are the areas that contribute to this theory −
o Machine Learning
o Neural Network
o Association Mining
o Sequential Pattern Matching
o Clustering
(d) Probability Theory − This theory is based on statistical theory. The basic idea behind this
theory is to discover joint probability distributions of random variables.
(e) Probability Theory − According to this theory, data mining finds the patterns that are
interesting only to the extent that they can be used in the decision-making process of some
enterprise.
(f) Microeconomic View − As per this theory, a database schema consists of data and patterns
that are stored in a database. Therefore, data mining is the task of performing induction on
databases.
(g) Inductive databases − Apart from the database-oriented techniques, there are statistical
techniques available for data analysis. These techniques can be applied to scientific data
and data from economic and social sciences as well.
(a) Regression − Regression methods are used to predict the value of the response variable
from one or more predictor variables where the variables are numeric. Listed below are the
forms of Regression −
o Linear
o Multiple
o Weighted
o Polynomial
o Nonparametric
o Robust
(d) Mixed-effect Models − These models are used for analyzing grouped data. These models
describe the relationship between a response variable and some co-variates in the data
grouped according to one or more factors.
(e) Factor Analysis − Factor analysis is used to predict a categorical response variable. This
method assumes that independent variables follow a multivariate normal distribution.
(f) Time Series Analysis − Following are the methods for analyzing time-series data −
o Auto-regression Methods.
o Univariate ARIMA (AutoRegressive Integrated Moving Average) Modeling.
o Long-memory time-series modeling.
1.3 Visual Data Mining
Visual Data Mining uses data and/or knowledge visualization techniques to discover implicit
knowledge from large data sets. Visual data mining can be viewed as an integration of the
following disciplines −
• Data Visualization
• Data Mining
• Computer Graphics
• Multimedia Systems
• Human Computer Interaction
• Pattern Recognition
• High-performance Computing
Generally, data visualization and data mining can be integrated in the following ways −
(a) Data Visualization − The data in a database or a data warehouse can be viewed in several
visual forms that are listed below −
o Boxplots
o 3-D Cubes
o Data distribution charts
o Curves
o Surfaces
o Link graphs etc.
(b) Data Mining Result Visualization − Data Mining Result Visualization is the presentation
of the results of data mining in visual forms. These visual forms could be scattered plots,
boxplots, etc.
(c) Data Mining Process Visualization − Data Mining Process Visualization presents the
several processes of data mining. It allows the users to see how the data is extracted. It also
allows the users to see from which database or data warehouse the data is cleaned,
integrated, preprocessed, and mined.
Audio data mining makes use of audio signals to indicate the patterns of data or the features of
data mining results. By transforming patterns into sound and musing, we can listen to pitches and
tunes, instead of watching pictures, in order to identify anything interesting.
1.5 Data Mining and Collaborative Filtering
Consumers today come across a variety of goods and services while shopping. During live
customer transactions, a Recommender System helps the consumer by making product
recommendations. The Collaborative Filtering Approach is generally used for recommending
products to customers. These recommendations are based on the opinions of other customers.
REFERENCES