Unit 4 Data Warehousing and Data Mining
Unit 4 Data Warehousing and Data Mining
Classification
Classification is to identify the category or the class label of a new observation. First, a set of data is
used as training data. The set of input data and the corresponding outputs are given to the
algorithm. So, the training data set includes the input data and their associated class labels. Using the
training dataset.
Classification Works
The functioning of classification with the assistance of the bank loan application has been mentioned
above. There are two stages in the data classification system: classifier or model creation and
classification classifier.
1. Developing the Classifier or model creation: This level is the learning stage or the learning
process. The classification algorithms construct the classifier in this stage. A classifier is
constructed from a training set composed of the records of databases and their
corresponding class names.
2. Applying classifier for classification: The classifier is used for classification at this level. The
test data are used here to estimate the accuracy of the classification algorithm. If the
consistency is deemed sufficient, the classification rules can be expanded to cover new data
records. It includes:
3. Data Classification Process: The data classification process can be categorized into five steps:
o Create the goals of data classification, strategy, workflows, and architecture of data
classification.
Data Generalization
Data Generalization is the process of summarizing data by replacing relatively low-level
values with higher level concepts. It is a form of descriptive data mining.
There are two basic approaches of data generalization.
1. Data cube approach:
It is also known as OLAP approach.
It is an efficient approach as it is helpful to make the past selling graph.
In this approach, computation and results are stored in the Data cube.
It uses Roll-up and Drill-down operations on a data cube.
These operations typically involve aggregate functions, such as count(), sum(),
average(), and max().
These materialized views can then be used for decision support, knowledge discovery,
and many other applications.
2. Attribute oriented induction:
It is an online data analysis, query oriented and generalization-based approach.
In this approach, we perform generalization on basis of different values of each
attribute within the relevant data set. after that same tuple are merged and their
respective counts are accumulated in order to perform aggregation.
It performs off-line aggregation before an OLAP or data mining query is submitted
for processing.
On the other hand, the attribute-oriented induction approach, at least in its initial
proposal, a relational database query – oriented, generalized based (on-line data
analysis technique).
It is not limited to particular measures nor categorical data.
Attribute oriented induction approach uses two methods:
(i). Attribute
(ii). Attribute generalization.
Analytical Characterization
The class characterization that includes the analysis of attribute/dimension relevance is called
analytical characterization.
The class comparison that includes such analysis is called analytical comparison.
3. Remove irrelevant and weakly relevant attributes using the selected relevance
analysis:
We evaluate each attribute in the candidate relation using the selected relevance
analysis measure.
This step results in an initial target class working relation and initial contrasting class
working relation.
4. Generate the concept description using AOI:
We need to perform the Attribute Oriented Induction process using a less conservative
set of attribute generalization thresholds.
Class characterization, only ITCWR is included.
Class Comparison both ITCWR and ICCWR are included.
Relevance Measures
Information Gain (ID3)
Gain Ratio (C4.5)
Gini Index
Chi^2 contingency table statistics
Uncertainty Coefficient
1.Correlation Analysis
Correlation analysis is a statistical technique for determining the strength of a link between
two variables. It is used to detect patterns and trends in data and to forecast future
occurrences.
Consider a problem with different factors to be considered for making optimal
conclusions
Correlation explains how these variables are dependent on each other.
The sign of the correlation coefficient indicates the direction of the relationship
between variables. It can be either positive, negative, or zero.
The Pearson correlation coefficient is the most often used metric of correlation. It expresses
the linear relationship between two variables in numerical terms. The Pearson correlation
coefficient, written as "r," is as follows:
r=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2r=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ)
where,
r: Correlation coefficient
xixi : i^th value first dataset X
xˉxˉ : Mean of first dataset X
yiyi : i^th value second dataset Y
yˉyˉ : Mean of second dataset Y
The correlation coefficient, denoted by "r", ranges between -1 and 1.
r = -1 indicates a perfect negative correlation.
r = 0 indicates no linear correlation between the variables.
r = 1 indicates a perfect positive correlation.
Types of Correlation
There are three types of correlation:
Correlation
1. Positive Correlation: Positive correlation indicates that two variables have a direct
relationship. As one variable increases, the other variable also increases. For example,
there is a positive correlation between height and weight. As people get taller, they
also tend to weigh more.
2. Negative Correlation: Negative correlation indicates that two variables have an
inverse relationship. As one variable increases, the other variable decreases. For
example, there is a negative correlation between price and demand. As the price of a
product increases, the demand for that product decreases.
3. Zero Correlation: Zero correlation indicates that there is no relationship between two
variables. The changes in one variable do not affect the other variable. For example,
there is zero correlation between shoe size and intelligence.
Regression Analysis
Regression refers to a data mining technique that is used to predict the numeric values in a
given data set. For example, regression might be used to predict the product or service cost or
other variables. It is also used in various industries for business and marketing behavior, trend
analysis, and financial forecast
Types of Regression
Distance-Based Algorithms
Distance-based algorithms are a class of machine learning algorithms that operate by
measuring the distance between data points in a feature space. These algorithms are
commonly used for clustering, classification, and anomaly detection tasks. Some notable
distance-based algorithms include:
1. K-nearest Neighbours (KNN):
A simple yet effective algorithm for classification and regression tasks. It
classifies a data point by a majority vote of its k nearest neighbours, where the
class label or output value is determined based on the majority class or
average of the neighbours.
2. K-means Clustering:
A popular clustering algorithm that partitions a dataset into k clusters by
iteratively assigning data points to the nearest cluster centroid and updating
centroids based on the mean of data points in each cluster. It aims to minimize
the within-cluster sum of squares.
3. Hierarchical Clustering:
A clustering algorithm that builds a hierarchy of clusters by recursively
merging or splitting clusters based on proximity measures such as Euclidean
distance or linkage criteria. It produces a dendrogram representing the
clustering structure of the data.
4. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
A density-based clustering algorithm that identifies clusters based on regions
of high density separated by regions of low density. It does not require
specifying the number of clusters in advance and is robust to noise and
outliers.
5. OPTICS (Ordering Points To Identify the Clustering Structure):
A variation of DBSCAN that produces a reachability plot representing the
clustering structure of the data. It provides more flexibility in identifying
clusters of varying densities and shapes.
Decision Tree-Based Algorithms
Decision tree-based algorithms are supervised learning algorithms that recursively partition
the feature space into subsets based on attribute values, aiming to maximize predictive
accuracy or minimize impurity. These algorithms are commonly used for classification and
regression tasks. Some prominent decision tree-based algorithms include:
1. CART (Classification and Regression Trees):
A versatile algorithm that builds binary decision trees by recursively splitting
the feature space based on attribute-value tests. It can be used for both
classification and regression tasks and supports various splitting criteria such
as Gini impurity and mean squared error.
2. Random Forest:
An ensemble learning method that constructs multiple decision trees and
combines their predictions through voting or averaging. It improves prediction
accuracy and generalization by reducing overfitting and capturing diverse
patterns in the data.
3. Gradient Boosting Machines (GBM):
A boosting algorithm that builds an ensemble of weak learners, typically
decision trees, in a sequential manner. It minimizes a loss function by fitting
each new tree to the residual errors of the previous trees, leading to improved
predictive performance.
4. XG Boost (Extreme Gradient Boosting):
An optimized implementation of gradient boosting that uses a more efficient
tree construction algorithm and regularization techniques to improve speed
and accuracy. It is widely used in competitions and real-world applications for
its performance and scalability.
Distance-based algorithms measure the similarity or dissimilarity between data points in a
feature space, while decision tree-based algorithms recursively partition the feature space
based on attribute values. Both types of algorithms are widely used in machine learning and
data mining for various tasks, offering different
TID Items
1 Bread, Milk
Before we start defining the rule, let us first see the basic definitions. Support Count( )
– Frequency of occurrence of a itemset.
Here ({Milk, Bread, Diaper})=2
Frequent Itemset – An itemset whose support is greater than or equal to minsup
threshold. Association Rule – An implication expression of the form X -> Y, where X and Y
are any 2 itemsets.
Example: {Milk, Diaper}->{Beer}
Rule Evaluation Metrics –
Support(s) – The number of transactions that include items in the {X} and {Y} parts
of the rule as a percentage of the total number of transaction.It is a measure of how
frequently the collection of items occur together as a percentage of all transactions.
Support = (X+Y) total – It is interpreted as fraction of transactions that contain
both X and Y.
Confidence(c) – It is the ratio of the no of transactions that includes all items in {B}
as well as the no of transactions that includes all items in {A} to the no of transactions
that includes all items in {A}.