0% found this document useful (0 votes)
77 views15 pages

Unit 4 Data Warehousing and Data Mining

Data Warehouse Notes

Uploaded by

ANIME ADDICTS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views15 pages

Unit 4 Data Warehousing and Data Mining

Data Warehouse Notes

Uploaded by

ANIME ADDICTS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Unit 4 Data warehousing and Data mining

Classification

Classification is to identify the category or the class label of a new observation. First, a set of data is
used as training data. The set of input data and the corresponding outputs are given to the
algorithm. So, the training data set includes the input data and their associated class labels. Using the
training dataset.

Classification Works
The functioning of classification with the assistance of the bank loan application has been mentioned
above. There are two stages in the data classification system: classifier or model creation and
classification classifier.

1. Developing the Classifier or model creation: This level is the learning stage or the learning
process. The classification algorithms construct the classifier in this stage. A classifier is
constructed from a training set composed of the records of databases and their
corresponding class names.

2. Applying classifier for classification: The classifier is used for classification at this level. The
test data are used here to estimate the accuracy of the classification algorithm. If the
consistency is deemed sufficient, the classification rules can be expanded to cover new data
records. It includes:

o Sentiment Analysis: Sentiment analysis is highly helpful in social media monitoring.


We can use it to extract social media insights. We can build sentiment analysis
models to read and analyse misspelled words with advanced machine learning
algorithms. The accurate trained models provide consistently accurate outcomes and
result in a fraction of the time.

o Document Classification: We can use document classification to organize the


documents into sections according to the content. Document classification refers to
text classification; we can classify the words in the entire document.
o Image Classification: Image classification is used for the trained categories of an
image. These could be the caption of the image, a statistical value, a theme. You can
tag images to train your model for relevant categories by applying supervised
learning algorithms.

o Machine Learning Classification: It uses the statistically demonstrable algorithm


rules to execute analytical tasks that would take humans hundreds of more hours to
perform.

3. Data Classification Process: The data classification process can be categorized into five steps:

o Create the goals of data classification, strategy, workflows, and architecture of data
classification.

o Classify confidential details that we store.

o Using marks by data labelling.

o To improve protection and obedience, use effects.

Data Generalization
Data Generalization is the process of summarizing data by replacing relatively low-level
values with higher level concepts. It is a form of descriptive data mining.
There are two basic approaches of data generalization.
1. Data cube approach:
 It is also known as OLAP approach.
 It is an efficient approach as it is helpful to make the past selling graph.
 In this approach, computation and results are stored in the Data cube.
 It uses Roll-up and Drill-down operations on a data cube.
 These operations typically involve aggregate functions, such as count(), sum(),
average(), and max().
 These materialized views can then be used for decision support, knowledge discovery,
and many other applications.
2. Attribute oriented induction:
 It is an online data analysis, query oriented and generalization-based approach.
 In this approach, we perform generalization on basis of different values of each
attribute within the relevant data set. after that same tuple are merged and their
respective counts are accumulated in order to perform aggregation.
 It performs off-line aggregation before an OLAP or data mining query is submitted
for processing.
 On the other hand, the attribute-oriented induction approach, at least in its initial
proposal, a relational database query – oriented, generalized based (on-line data
analysis technique).
 It is not limited to particular measures nor categorical data.
 Attribute oriented induction approach uses two methods:
(i). Attribute
(ii). Attribute generalization.

Analytical Characterization
The class characterization that includes the analysis of attribute/dimension relevance is called
analytical characterization.

The class comparison that includes such analysis is called analytical comparison.

Attribute Relevance Analysis


1. Data Collection:
 It is collecting the data for both the target class and the contrasting class by query
processing.

2. Preliminary relevance analysis using conservative AOI:


 This step identifies a set of dimensions and attributes on which the selected relevance
measure is to be applied.
 The relation obtained by such an application of Attribute Oriented Induction is called
the candidate relation of the mining task.

3. Remove irrelevant and weakly relevant attributes using the selected relevance
analysis:
 We evaluate each attribute in the candidate relation using the selected relevance
analysis measure.
 This step results in an initial target class working relation and initial contrasting class
working relation.
4. Generate the concept description using AOI:
 We need to perform the Attribute Oriented Induction process using a less conservative
set of attribute generalization thresholds.
 Class characterization, only ITCWR is included.
 Class Comparison both ITCWR and ICCWR are included.

Relevance Measures
 Information Gain (ID3)
 Gain Ratio (C4.5)
 Gini Index
 Chi^2 contingency table statistics
 Uncertainty Coefficient

Analysis of attribute relevance, Mining Class Comparison


In data mining, there are two techniques that are commonly used for exploring data and
uncovering patterns: attribute relevance analysis and mining class comparison.
Attribute Relevance Analysis: This technique is used to determine the importance of
different attributes or variables in a dataset. Attribute relevance analysis can help in
identifying the variables that have the greatest impact on a target variable, or the variables
that are most closely related to each other. There are several methods for conducting attribute
relevance analysis.

Mining Class Comparison: This technique is used to compare different classes or


groups within a dataset. For example, it can be used to compare the purchasing habits of male
and female customers, or to compare the performance of different products in a market.
Mining class comparison involves identifying patterns or differences between different
classes, and can be used to identify factors that contribute to these differences. Some
common techniques used in mining class comparison include association rule mining,
clustering, and decision tree analysis.
Both attribute relevance analysis and mining class comparison are important techniques in
data mining that can help in identifying patterns and relationships within data. By
understanding these patterns and relationships, businesses can make informed decisions and
improve their operations.
statistical measures in large databases
Relational database systems supports five built-in aggregate functions such as count(), sum(),
avg(), max() and min(). These aggregate functions can be used as basic measures in the
descriptive mining of multidimensional information. There are two descriptive statistical
measures such as measures of central tendency and measures of data dispersion can be used
effectively in high multidimensional databases.
Measures of central tendency − Measures of central tendency such as mean, median, mode,
and mid-range.
Mean − The arithmetic average is evaluated simply by inserting together all values and
splitting them by the number of values. It uses data from every single value. Let x 1, x2,... xn be
a set of N values or observations like salary. The mean of this set of values is
X′=∑Ni=1XiN=X1+X2⋯XnNX′=∑i=1NXiN=X1+X2⋯XnN
This corresponds to the assembled aggregate function, average (avg()) supported in the
relational database system. In several data cubes, sum and count are saved in pre-
computation. Therefore, the derivation of average is straightforward.
average=sum countaver age=sumcount
Median − There are two methods for computing the median, based on the distribution of
values.
If x1, x2, .... xn are arranged in descending order and n is odd. Thus the median is
(n+12)thvalue(n+12)thvalue
For example, 1, 4, 6, 7, 12, 14, 18
Median = 7
When n is even. Then the median is
(n2)thvalue+(n2+1)thvalue2(n2)thvalue+(n2+1)thvalue2
For example, 1, 4, 6, 7, 8, 12, 14, 16.
Median=7+82=7.5Median=7+82=7.5
The median is neither a distributive measure nor an algebraic measure, it is the holistic
measure. Although it is not simply to evaluate the exact median value in a huge database, an
approximate median can be effectively computed.
Mode − It is the most common value in a set of values. Distributions can be unimodal,
bimodal, or multimodal. If the data is categorical (measured on the nominal scale) then only
the mode can be computed. The mode can also be computed with ordinal and higher data, but
it is not suitable.
Measuring the dispersion of data − The degree to which numerical information tends to
spread is known as the dispersion or variance of the data. The most frequent measures of data
dispersion are range, interquartile range, and standard derivations.

Statistical Based Algorithm


1.Correlation Analysis
2.Regression Analysis
3.Bayesian model

1.Correlation Analysis
Correlation analysis is a statistical technique for determining the strength of a link between
two variables. It is used to detect patterns and trends in data and to forecast future
occurrences.
 Consider a problem with different factors to be considered for making optimal
conclusions
 Correlation explains how these variables are dependent on each other.
 The sign of the correlation coefficient indicates the direction of the relationship
between variables. It can be either positive, negative, or zero.
The Pearson correlation coefficient is the most often used metric of correlation. It expresses
the linear relationship between two variables in numerical terms. The Pearson correlation
coefficient, written as "r," is as follows:
r=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2r=∑(xi−xˉ)2∑(yi−yˉ)2∑(xi−xˉ)(yi−yˉ)
where,
 r: Correlation coefficient
 xixi : i^th value first dataset X
 xˉxˉ : Mean of first dataset X
 yiyi : i^th value second dataset Y
 yˉyˉ : Mean of second dataset Y
The correlation coefficient, denoted by "r", ranges between -1 and 1.
r = -1 indicates a perfect negative correlation.
r = 0 indicates no linear correlation between the variables.
r = 1 indicates a perfect positive correlation.
Types of Correlation
There are three types of correlation:

Correlation
1. Positive Correlation: Positive correlation indicates that two variables have a direct
relationship. As one variable increases, the other variable also increases. For example,
there is a positive correlation between height and weight. As people get taller, they
also tend to weigh more.
2. Negative Correlation: Negative correlation indicates that two variables have an
inverse relationship. As one variable increases, the other variable decreases. For
example, there is a negative correlation between price and demand. As the price of a
product increases, the demand for that product decreases.
3. Zero Correlation: Zero correlation indicates that there is no relationship between two
variables. The changes in one variable do not affect the other variable. For example,
there is zero correlation between shoe size and intelligence.

Regression Analysis

Regression refers to a data mining technique that is used to predict the numeric values in a
given data set. For example, regression might be used to predict the product or service cost or
other variables. It is also used in various industries for business and marketing behavior, trend
analysis, and financial forecast
Types of Regression

Regression is divided into five different types


1. Linear Regression
2. Logistic Regression
3. Lasso Regression
4. Ridge Regression
5. Polynomial Regression
Linear Regression
Linear regression is the type of regression that forms a relationship between the target
variable and one or more independent variables utilizing a straight line. The given equation
represents the equation of linear regression
Y = a + b*X + e.
Where,
a represents the intercept

Distance-Based Algorithms
Distance-based algorithms are a class of machine learning algorithms that operate by
measuring the distance between data points in a feature space. These algorithms are
commonly used for clustering, classification, and anomaly detection tasks. Some notable
distance-based algorithms include:
1. K-nearest Neighbours (KNN):
 A simple yet effective algorithm for classification and regression tasks. It
classifies a data point by a majority vote of its k nearest neighbours, where the
class label or output value is determined based on the majority class or
average of the neighbours.
2. K-means Clustering:
 A popular clustering algorithm that partitions a dataset into k clusters by
iteratively assigning data points to the nearest cluster centroid and updating
centroids based on the mean of data points in each cluster. It aims to minimize
the within-cluster sum of squares.
3. Hierarchical Clustering:
 A clustering algorithm that builds a hierarchy of clusters by recursively
merging or splitting clusters based on proximity measures such as Euclidean
distance or linkage criteria. It produces a dendrogram representing the
clustering structure of the data.
4. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
 A density-based clustering algorithm that identifies clusters based on regions
of high density separated by regions of low density. It does not require
specifying the number of clusters in advance and is robust to noise and
outliers.
5. OPTICS (Ordering Points To Identify the Clustering Structure):
 A variation of DBSCAN that produces a reachability plot representing the
clustering structure of the data. It provides more flexibility in identifying
clusters of varying densities and shapes.
Decision Tree-Based Algorithms
Decision tree-based algorithms are supervised learning algorithms that recursively partition
the feature space into subsets based on attribute values, aiming to maximize predictive
accuracy or minimize impurity. These algorithms are commonly used for classification and
regression tasks. Some prominent decision tree-based algorithms include:
1. CART (Classification and Regression Trees):
 A versatile algorithm that builds binary decision trees by recursively splitting
the feature space based on attribute-value tests. It can be used for both
classification and regression tasks and supports various splitting criteria such
as Gini impurity and mean squared error.
2. Random Forest:
 An ensemble learning method that constructs multiple decision trees and
combines their predictions through voting or averaging. It improves prediction
accuracy and generalization by reducing overfitting and capturing diverse
patterns in the data.
3. Gradient Boosting Machines (GBM):
 A boosting algorithm that builds an ensemble of weak learners, typically
decision trees, in a sequential manner. It minimizes a loss function by fitting
each new tree to the residual errors of the previous trees, leading to improved
predictive performance.
4. XG Boost (Extreme Gradient Boosting):
 An optimized implementation of gradient boosting that uses a more efficient
tree construction algorithm and regularization techniques to improve speed
and accuracy. It is widely used in competitions and real-world applications for
its performance and scalability.
Distance-based algorithms measure the similarity or dissimilarity between data points in a
feature space, while decision tree-based algorithms recursively partition the feature space
based on attribute values. Both types of algorithms are widely used in machine learning and
data mining for various tasks, offering different

Model Based Method


Statistical Approach
Statistical approaches are model-based approaches such as a model is produced for the data,
and objects are computed concerning how well they fit the model. Most statistical approaches
to outlier detection are depends on developing a probability distribution model.
Identifying the specific distribution of a data set − While several types of data can be
defined by a small number of common distributions, including Gaussian, Poisson, or
binomial, data sets with non-standard distributions are associatively common. Of course.
The number of attributes used − Some statistical outlier detection techniques use to an
individual attribute, but some techniques have been represented for multivariate data.
Mixtures of distributions − The data can be modelled as a combination of distributions, and
outlier detection schemes can be produced based on such models. Although potentially more
dynamic, such models are complex, both to learn and to use. For example, the distributions
required to be identified earlier objects can be defined as outliers.

Association Rules: Introduction


Association rule mining finds interesting associations and relationships among large sets of
data items. This rule shows how frequently a itemset occurs in a transaction. A typical
example is a Market Based Analysis. Market Based Analysis is one of the key techniques
used by large relations to show associations between items.It allows retailers to identify
relationships between the items that people buy together frequently. Given a set of
transactions, we can find rules that will predict the occurrence of an item based on the
occurrences of other items in the transaction.

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Before we start defining the rule, let us first see the basic definitions. Support Count( )
– Frequency of occurrence of a itemset.
Here ({Milk, Bread, Diaper})=2
Frequent Itemset – An itemset whose support is greater than or equal to minsup
threshold. Association Rule – An implication expression of the form X -> Y, where X and Y
are any 2 itemsets.
Example: {Milk, Diaper}->{Beer}
Rule Evaluation Metrics –
 Support(s) – The number of transactions that include items in the {X} and {Y} parts
of the rule as a percentage of the total number of transaction.It is a measure of how
frequently the collection of items occur together as a percentage of all transactions.
 Support = (X+Y) total – It is interpreted as fraction of transactions that contain
both X and Y.
 Confidence(c) – It is the ratio of the no of transactions that includes all items in {B}
as well as the no of transactions that includes all items in {A} to the no of transactions
that includes all items in {A}.

 Conf(X=>Y) = Supp(X Y) Supp(X) – It measures how often each item in Y


appears in transactions that contains items in X also.
 Lift(l) – The lift of the rule X=>Y is the confidence of the rule divided by the
expected confidence, assuming that the itemsets X and Y are independent of each
other.The expected confidence is the confidence divided by the frequency of {Y}.
 Lift(X=>Y) = Conf(X=>Y) Supp(Y) – Lift value near 1 indicates X and Y almost
often appear together as expected, greater than 1 means they appear together more
than expected and less than 1 means they appear less than expected.Greater lift values
indicate stronger association.
Example – From the above table, {Milk, Diaper}=>{Beer}
s= ({Milk, Diaper, Beer}) |T|
= 2/5
= 0.4

c= (Milk, Diaper, Beer) (Milk, Diaper)


= 2/3
= 0.67

l= Supp({Milk, Diaper, Beer}) Supp({Milk, Diaper})*Supp({Beer})


= 0.4/(0.6*0.6)
= 1.11
The Association rule is very useful in analyzing datasets. The data is collected using bar-code
scanners in supermarkets. Such databases consists of a large number of transaction records
which list all items bought by a customer on a single purchase. So the manager could know if
certain groups of items are consistently purchased together and use this data for adjusting
store layouts, cross-selling, promotions based on statistics..

Parallel and Distributed Algorithm


Parallel and distributed data mining The enormity and high dimensionality of datasets
typically available as input to the problem of association rule discovery, makes it an ideal
problem for solving multiple processors in parallel. The primary reasons are the memory and
CPU speed limitations faced by single processors. Thus it is critical to design efficient
parallel algorithms to do the task. Another reason for parallel algorithm comes from the fact
that many transaction databases are already available in parallel databases or they are
distributed at multiple sites to begin with. The cost of bringing them all to one site or one
computer for serial discovery of association rules can be prohibitively expensive. For
compute-intensive applications, parallelisation is an obvious means for improving
performance and achieving scalability. A variety of techniques may be used to distribute the
workload involved in data mining over multiple processors

Neural Network approach


Neural Network is an information processing paradigm that is inspired by the human nervous
system. As in the Human Nervous system, we have Biological neurons in the same way
in Neural networks we have Artificial Neurons which is a Mathematical Function that
originates from biological neurons. The human brain is estimated to have around 10 billion
neurons each connected on average to 10,000 other neurons. Each neuron receives signals
through synapses that control the effects of the signal on the neuron.
Neural Network Method in Data Mining
Neural Network Method is used For Classification, Clustering, Feature mining, prediction,
and pattern recognition. Neural network and the Hebbian learning rule is one of the earliest
and simplest learning rules for the neural network. The neural network model can be broadly
divided into the following three types:
 Feed-Forward Neural Networks: In Feed-Forward Network, if the output values
cannot be traced back to the input values and if for every input node, an output node is
calculated, then there is a forward flow of information and no feedback between the
layers. In simple words, the information moves in only one direction (forward) from
the input nodes, through the hidden nodes (if any), and to the output nodes. Such a
type of network is known as a feedforward network.

 Feedback Neural Network: Signals can travel in both directions in a feedback


network. Feedback neural networks are very powerful and can become very complex.
feedback networks are dynamic. Feedback neural network architectures are also
known as interactive or recurrent. Feedback loops are allowed in such networks. They
are used for content addressable memory.

You might also like