Pptcs 1661
Pptcs 1661
AND
DATA WAREHOUSING
ARYA S V
Lecturer in Computer Science
School of Distance Education
University of Kerala
OVERVIEW
• Introduction • Decision Tree
• Data Mining • Bayesian Classifier
• Data pre-processing • Lazy Classifier
• Data Warehousing • K-Nearest Neighbor
• Data Cube method
• OLAP • Rule based Classification
• Market Basket Analysis • Cluster Analysis
• Association Rule • Partition Methods
• Apriori Algorithm • K-means and K-medoids
• Classification vs Prediction • Outlier Detection
Introduction to Data Mining
• Data is raw fact or disconnected fact.
• Information is the Processed data.
• Knowledge is derived from information by applying
rules to it.
• Collection
• Preparation
• Input
• Processing
• Output
• Storage
Major tasks in Data Preprocessing
• Data cleaning
Clean the data by filling in missing values,
smoothing noisy data, identifying or removing outliers,
and resolving inconsistencies.
• Data integration.
Integrating multiple databases, data cubes, or files.
• Data reduction Data reduction
Reduces the size of data and makes it suitable and
feasible for analysis.
• Data transformation.
Converting data from one format or structure into
another format or structure.
Data Warehouse
Users and An OLAP system is market-oriented and is used OLTP system is customer-oriented and is
for data analysis by knowledge workers, used for transaction and query processing
system including managers, executives, and analysts. by clerks, clients, and information
orientatio technology professionals
n
Data An OLAP system manages large amounts of An OLTP system manages current data.
contents historic data.
Database typically adopts either a star or a snowflake An OLTP system usually adopts an entity-
design model and a subject -oriented database design. relationship (ER) data model and an
application -oriented database design.
View An OLAP system often spans multiple versions of An OLTP system focuses mainly on the
a database schema, due to the evolutionary current data within an enterprise or
process of an organization department, without referring to historic
data or data in different organizations
Access Accesses to OLAP systems are mostly read-only The access patterns of an OLTP system
patterns operations consist mainly of short, atomic transactions.
Data Warehouse Architecture
Data Cube
• Data warehouses and OLAP tools are based on a
multidimensional data model. This model views data in
the form of a data cube.
• A data cube allows data to be modeled and viewed in
multiple dimensions.
• It is defined by dimensions and facts.
• In general terms, dimensions are the perspectives or
entities with respect to which an organization wants to keep
records. Each dimension may have a table associated with
it, called a dimension table.
• Facts are numeric measures. The fact table contains the
names of the facts, or measures, as well as keys to each of
the related dimension tables.
A 3-D data cube
representation of the data A 4-D data cube representation of
according to time, item, and sales data, according to time, item,
location. Here the measure location, and supplier. The
displayed is dollars_sold (in measure displayed is dollars sold
thousands) (in thousands).(only some of the
cube values are shown.)
Lattice of Cuboids, making up a 4-D
Data Cube For Time, Item, Location,
And Supplier.
0-D(apex) cuboid
time item location supplier
1-D cuboids
3-D cuboids
time,item,location time,item,supplier item,location,supplier
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Requirements for domain knowledge to determine input
parameters
• Ability to deal with noisy data
• Incremental clustering and insensitivity to input order
• Capability of clustering high-dimensionality data
• Constraint-based clustering
• Interpretability and usability
Partitioning Methods
• The simplest and most fundamental version of cluster
analysis is partitioning, which organizes the objects of a set
into several exclusive groups or clusters.
• The general criterion of a good partitioning is that objects in
the same cluster are “close” or related to each other,
whereas objects in different clusters are “far apart” or very
different.
• General Characteristics of Partitioning methods
Find mutually exclusive clusters of spherical shape
Distance-based
May use mean or medoid (etc.) to represent cluster center
Effective for small- to medium-size data sets
K-Means: A Centroid-Based Technique
Input:
• k: the number of clusters, (4) update the cluster
• D: a data set containing n means, that is, calculate
objects. the mean value of the
Output: A set of k clusters. objects for each cluster;
Method: (5) until no change;
(1) arbitrarily choose k objects from
D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the
cluster to which the object is the
most similar, based on the
mean value of the objects in the
cluster;
k-Medoids
• It is also called as Partitioning Around Medoid algorithm.
• A medoid can be defined as the point in the cluster, whose
dissimilarities with all the other points in the cluster is minimum.
• The dissimilarity of the medoid(Ci) and object(Pi) is calculated
by using 𝑬 = |𝑷𝒊 − 𝑪𝒊|
Algorithm:
1. Initialize: select k random points out of the n data points as the
medoids.
2. Associate each data point to the closest medoid by using any
common distance metric methods.
3. While the cost decreases:
For each medoid m, for each data o point which is not a medoid:
1.Swap m and o, associate each data point to the closest medoid,
recompute the cost.
2.If the total cost is more than that in the previous step, undo the
swap.
Outlier Detection in Clustering
• An outlier is a data object that
deviates significantly from the rest of The objects in
the objects, as if it were generated region R are outliers
by a different mechanism.
• Outliers are referred as “abnormal”
data.
• Outliers are different from noisy data.
Noise is a random error or variance in
a measured variable.
• Outliers are interesting because they Types of Outliers
are suspected of not being generated Global Outliers
by the same mechanisms as the rest Contextual Outliers
of the data. Collective Outliers
• Outlier detection is also related to
novelty detection in evolving data
sets.
Outlier detection techniques.
Supervised, Semi- Statistical Methods,
Supervised, and Proximity-Based Methods and
Unsupervised Methods Clustering-Based Methods
• Statistical methods (also known as
• Supervised methods model data model-based methods) make
normality and abnormality. assumptions of data normality.
• In some application scenarios, • The effectiveness of proximity-
objects labeled as “normal” or based methods relies heavily on
“outlier” are not available. the proximity (or distance)
Thus, an unsupervised learning measure used.
method has to be used.
• Clustering-based methods assume
• In some cases where only a that the normal data objects
small set of the normal and/or belong to large and dense
outlier objects are labeled, but clusters, whereas outliers belong
most of the data are unlabeled. to small or sparse clusters, or do
not belong to any clusters.
Thank You