Module 4
Module 4
Sequence or path analysis: This technique identifies those patterns where one event
leads to a subsequent event. For example, consumers may demand a backpack/carry bag
depending on the items and the quantity of items they are buying.
Classification: This technique identifies new groups from the stored data and explores the
previously unknown facts. For example, a restaurant could mine the customer data to
identify when the maximum number of customers visit and what do they order. Based on
this information, special daily offers can be introduced to increase customers and revenue.
Forecasting: This technique is used for discovering patterns in data that can lead to
practicable predictions about the future. For example, life insurance companies frame
policies on the basis of prediction on human life.
Architecture of data mining
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text
files, and other documents. You need a huge amount of historical data for data mining to be
successful. Organizations typically store data in databases or data warehouses. Data
warehouses may comprise one or more databases, text files spreadsheets, or other
repositories of data. Sometimes, even plain text files or spreadsheets may contain
information. Another primary source of data is the World Wide Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected. As the information comes from various sources and in
different formats, it can't be used directly for the data mining procedure because the data
may not be complete and accurate. So, the first data requires to be cleaned and unified.
More information than needed will be collected from various data sources, and only the
data of interest will have to be selected and passed to the server. These procedures are not
as easy as we think. Several methods may be performed on the data as part of selection,
integration, and cleaning.
Database or Data Warehouse Server:
The database or data warehouse server consists of the original data that is ready to be
processed. Hence, the server is cause for retrieving the relevant data that is based on data
mining as per user request.
Data Mining Engine:
The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization, classification,
clustering, prediction, time-series analysis, etc. In other words, we can say data mining is the
root of our data mining architecture. It comprises instruments and software used to obtain
insights and knowledge from data collected from various data sources and stored within the
data warehouse.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be
helpful to guide the search or evaluate the stake of the result patterns. The
knowledge base may even contain user views and data from user experiences that
might be helpful in the data mining process. The data mining engine may receive
inputs from the knowledge base to make the result more accurate and reliable. The
pattern assessment module regularly interacts with the knowledge base to get
inputs, and also update it.
Functionalities of data mining
Class/Concept Descriptions
A class or concept implies there is a data set or set of features that define the
class or a concept. A class can be a category of items on a shop floor, and a
concept could be the abstract idea on which data may be categorized like
products to be put on clearance sale and non-sale products. There are two
concepts here, one that helps with grouping and the other that helps in
differentiating.
Frequent item set: This term refers to a group of items that are commonly
found together, such as milk and sugar.
Frequent substructure: It refers to the various types of data structures that can
be combined with an item set or subsequences, such as trees and graphs.
Frequent Subsequence: A regular pattern series, such as buying a phone
followed by a cover.
Association Analysis
It analyses the set of items that generally occur together in a transactional dataset. It
is also known as Market Basket Analysis for its wide use in retail sales. Two parameters
are used for determining the association rules:
It provides which identifies the common item set in the database.
Confidence is the conditional probability that an item occurs when another item occurs
in a transaction.
Classification
It defines predict some unavailable data values or spending trends. An object can be
anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase or decrease trends in time-related
information. There are primarily two types of predictions in data mining: numeric and class
predictions.
Numeric predictions are made by creating a linear regression model that is based on historical
data. Prediction of numeric values helps businesses ramp up for a future event that might
impact the business positively or negatively.
Class predictions are used to fill in missing class information for products using a training data
set where the class for products is known.
Cluster Analysis
In image processing, pattern recognition and bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the classes are not predefined. Data attributes
represent the classes. Similar data are grouped together, with the difference being that a class
label is not known. Clustering algorithms group data based on similar features and
dissimilarities.
Outlier analysis
Outlier analysis is important to understand the quality of data. If there are too many outliers, you
cannot trust the data or draw patterns. An outlier analysis determines if there is something out of
turn in the data and whether it indicates a situation that a business needs to consider and take
measures to mitigate. An outlier analysis of the data that cannot be grouped into any classes by the
algorithms is pulled up.
Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis
models are designed to capture evolutionary trends in data helping to characterize, classify, cluster
or discriminate time-related data.
Correlation Analysis
Correlation is a mathematical technique for determining whether and how strongly two attributes is
related to one another. It refers to the various types of data structures, such as trees and graphs,
that can be combined with an item set or subsequence. It determines how well two numerically
measured continuous variables are linked.
Types of data
Kinds of data to be mined
Basically there are two types of data set:
1. Record data set
a) Transactional data
b) Data Matrix
c) Sparse data matrix
d) Graph-based data
e) Ordered data
i)Sequential data
ii)Sequence data
iii)Time-series data
iv) Spatial data
Unsupervised Classification
Here there is no concentration drawn to predetermined attributes. It also has very few things to do with
the target value. It is only used to map out hidden relations and data structures.
Semi-Supervised Classification
This classification of data mining is the middle ground between the supervised and unsupervised
classification of data mining. Here, the fusion of unlabeled and labelled datasets becomes functional at
the time of the training period.
Reinforcement Classification
This data mining classification type involves trial and error to determine ways to best react to
situations. It also allows the software agent to understand its behaviour depending on the
environmental reviews.
Classification Based on the kind of Knowledge Mined
Banking/crediting
Data mining aids financial institutions in areas, such as credit default and loan delivery. Data mining
can also support credit card issuers in detecting potentially fraudulent credit card transactions.
Law enforcement
Data mining assists law enforcement agencies in identifying criminal suspects, as well as in
catching them by investigating trends in location, habits, crime type and other behaviour patterns.
Researchers
Data mining supports researchers by increasing the pace of their data analysis
process; thus, providing them more time to work on other projects.
Manufacturing
Data mining is applied widely to determine the range of control parameters in
the manufacturing sector. These optimal control parameters are then used to
manufacture products with the desired quality.
Government
Data mining supports government agencies by extracting and analysing records
of financial transactions, for example, it helps banks to discover patterns that
can identify money laundering or criminal activities.
Challenges of Data mining
Security and social issues
User interface issues
Mining methodology issues
Performance issues
Data source issues
Ethical considerations
Suitability and validity
Guard against the possibility that a predisposition by investigators or data
providers might predetermine the analytic result. Employ data selection or
sampling methods and analytic approaches that are designed to assure valid
analyses in either frequentist or Bayesian approaches.
Privacy and Confidentiality
The aims of data mining effort
Moral imperatives
Global issues
Challenges + Ethical issues
Classification
Classification is used to classify each item in a set of data into one of predefined set of
classes or groups.
The data analysis task classification is where a model or classifier is constructed to
predict categorical labels (the class label attributes).
Classification is a data mining function that assigns items in a collection to target
categories or classes.
The goal of classification is to accurately predict the target class for each case in the
data.
For example, a classification model could be used to identify loan applicants as low,
medium, or high credit risks.
A classification task begins with a data set in which the class assignments are known.
For example, a classification model that predicts credit risk could be developed based
on observed data for many loan applicants over a period of time.
In addition to the historical credit rating, the data might track employment history,
home ownership or rental, years of residence, number and type of investments, and
so on.
Classifications are discrete and do not imply order.
Continuous, floating-point values would indicate a numerical, rather than a categorical,
target.
A predictive model with a numerical target uses a regression algorithm, not a classification
algorithm.
The simplest type of classification problem is binary classification.
In binary classification, the target attribute has only two possible values: for example, high
credit rating or low credit rating.
Multiclass targets have more than two values: for example, low, medium, high, or unknown
credit rating.
In the model build (training) process, a classification algorithm finds relationships between
the values of the predictors and the values of the target.
Different classification algorithms use different techniques for finding relationships.
These relationships are summarized in a model, which can then be applied to a different data
set in which the class assignments are unknown.
Classification has many applications in customer segmentation, business modeling,
marketing, credit analysis, and biomedical and drug response modeling.
Figure: Classification model illustration
Step 1: A classifier is built describing a predetermined set of data classes or
concepts. (This is also known as supervised learning).
Step 2: Here, the model is used for classification. First, the predictive accuracy
of the classifier is estimated. (This is also known as unsupervised learning).
The commonly used methods for data mining classification tasks can be
classified into the following groups.
1.Decision tree induction methods,
2.Rule-based methods,
3.Memory based learning,
4.Neural networks,
5.Bayesian network,
6.Support vector machines.
Clustering
The task of grouping data points based on their similarity with each other is called
Clustering or Cluster Analysis.
This method is defined under the branch of Unsupervised Learning, which aims at
gaining insights from unlabelled data points, that is, unlike supervised learning we
don’t have a target variable.
Clustering aims at forming groups of homogeneous data points from a
heterogeneous dataset.
It evaluates the similarity based on a metric like Euclidean distance, Cosine
similarity, Manhattan distance, etc. and then group the points with highest
similarity score together.
Example: Clustering
Types of clustering
1. Ease of Understanding
Because data mining tools can visually capture this model in a very practical way, people can understand
how it works after a short explanation. It is not necessary to have extensive knowledge of data mining or
web programming languages.
5. Uses of Statistics
Decision trees and statistics work hand in hand to provide greater reliability to the model
that is being developed. Since each result is supported by various statistical tests, the
probability of any of the options analyzed can be known exactly.