0% found this document useful (0 votes)
54 views

Data Warehousing & Modeling: Module - 2

The document discusses efficient data cube computation and indexing techniques for online analytical processing (OLAP) in data warehouses. It covers computing data cubes using operators like compute cube, issues like the curse of dimensionality, and materialization strategies like full, partial and shell cubes. It also discusses bitmap indexing and join indexing to speed up OLAP queries. Finally, it outlines different OLAP server architectures like ROLAP, MOLAP and HOLAP.

Uploaded by

rakshitha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Data Warehousing & Modeling: Module - 2

The document discusses efficient data cube computation and indexing techniques for online analytical processing (OLAP) in data warehouses. It covers computing data cubes using operators like compute cube, issues like the curse of dimensionality, and materialization strategies like full, partial and shell cubes. It also discusses bitmap indexing and join indexing to speed up OLAP queries. Finally, it outlines different OLAP server architectures like ROLAP, MOLAP and HOLAP.

Uploaded by

rakshitha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 144

Data Warehousing &

Modeling
Subject Code: 18CS641

Module -2
Module -2
Module-2:
Data warehouse implementation & Data mining: Efficient
Data Cube computation: An overview, Indexing OLAP Data:
Bitmap index and join index, Efficient processing of OLAP
Queries, OLAP server Architecture ROLAP versus MOLAP
Versus HOLAP. : Introduction: What is data mining,
Challenges, Data Mining Tasks, Data: Types of Data, Data
Quality, Data Preprocessing, Measures of Similarity and
Dissimilarity.
Textbook 2: Ch.4.4
Textbook 1: Ch.1.1,1.2,1.4, 2.1 to 2.4
Data Warehouse Implementation
🞂Data warehouse contain huge amounts of data.
🞂OLAP Servers must return decision support queries in
order of seconds.
🞂So it is crucial for data warehouse systems to support
highly efficient cube computation techniques, access
methods and query processing techniques.
Efficient Data Cube Computation: An
Overview
🞂 In multi dimensional data analysis , it is efficient to
compute aggregations.
🞂 In SQL terms,aggregations are referred as: group-by’s.
🞂 Each group-by can be represented by a cuboid
🞂 Set of group-by’s forms a lattice of cuboids defining a
data cube.
🞂 Issues related to the efficient computations of data
cubes are as followed:
Efficient Data Cube Computation: “The compute
cube Operator”

🞂SQL extends cube computation by including a


compute cube operator.
🞂The compute cube operator computes aggregates over
all subsets of the dimensions specified in the operation.
🞂This require excessive storage space, for large number
of dimensions.
🞂 A data cube is a lattice of cuboids.
Efficient Data Cube Computation: “The compute
cube” Operator
🞂Example: You would like to create a data cube for
All_Electronics that contains the following:
🞂 item, city, year, and sales_in_dollars

🞂Analyze the data with following queries:


◦ Compute the sum of sales, grouping by item and city
◦ Compute the sum of sales, grouping by item
◦ Compute the sum of sales, grouping by city
Efficient Data Cube Computation: “The compute
cube” Operator

🞂The total number of data cuboids or group-by’s can


be formed is 23=8

◦ {(city,item,year),
◦ (city,item), (city,year),(item,year)
◦ (city),(item),(year),
◦ ()}

🞂(),-group-by is empty(the dimensions are not


grouped)
◦ These group-by’s form a lattice of cuboids for the data cube
as shown in below figure
◦ The basic cuboid contains all three dimensions
◦ apex Cuboid -0D- group-by is empty
Efficient Data Cube Computation: “The compute
cube” Operator

● Total sum of all sales


● Most generalized(least
specific) cuboid
● Start from apex to
downward- drilling down

● Least generalized(most
specific) cuboid
● Rolling up
Efficient Data Cube Computation: “The compute
cube” Operator

🞂 SQL Query containing no group-by -0-D operation.


🞂
Example: Compute the sum of total sales

🞂 A cube operator on n dimensions is equivalent to a collection of


group-by statements, one for each subset of the n dimensions.

🞂 Therefore, the cube operator is the n-dimensional generalization of


the group-by operator.

🞂 Similar to the SQL syntax, the data cube could be defined in


DMQL as:
◦ define cube sales [item, city, year]: sum (sales_in_dollars)

🞂 Compute the sales aggregate cuboids as:


◦ compute cube sales
A Data Cube: sales
Item City cuboid
Base cuboid
I1 I2 I3 I4 I5 I6 All

New York
10 11 12 3 10 1 47
City

Chicago 11 9 6 9 6 7 48

Toronto 12 9 8 5 7 3 44

Vancouver 13 8 10 5 6 3 45
All
46 37 36 22 29 14 184
Item cuboid

Aggregate cell Base cell Apex Cuboid


Efficient Data Cube Computation: “The Curse of
Dimensionality”
• Fast on-line analytical processing takes minimum time if
aggregates for all the cuboids are precomputed.

🞂 fast response time


🞂 avoids redundant computations

🞂 Pre-computation of the full cube requires excessive amount of


memory and depends on number of dimensions and cardinality of
dimensions.

🞂 Called as Curse of Dimensionality


Efficient Data Cube Computation: “The Curse of
Dimensionality”
🞂 How many cuboids are there in an n-dimensional data cube
with dimension cardinality (elements in the given dimension)?”

🞂 Many dimensions may have hierarchies, for example time


● day < month < quarter < year

🞂 The total number of cuboids that can be generated is

🞂 Where Li is the number of levels for dimension i.


🞂 1-> virtual top level (all)i.e. removal of dimension
🞂 5 Levels
Materialization
🞂 Materialized view is a virtual table,involves results of query
precomputed
🞂 Data cube materialization/ pre-computation involves three choices:

◦ No materialization: Don’t precompute any of the non-base


cuboid. Leads to multidimensional aggregation on the fly and is
slow.

◦ Full materialization: Precompute all the cubes. Running queries


will be very fast. Requires huge memory.-Full cube

◦ Partial Materialization: Selectively compute a proper subset of the


cuboids, which contains only those cells that satisfy some user
specified criterion.-Sub cube
Types of cubes
■Full cube: All cells and cuboids are materialized. All
possible combination of dimensions and values.
or

■Iceberg cube: Partial materialization. Materializing only the


cells in a cuboid whose measure value is above the minimum
support threshold.
count(*) >= min support Iceberg Condition

■Shell cube: precomputing the cuboids for only a small


number of dimensions (e.g., three to five) of a data cube.
remaining queries computed on the fly.
Indexing OLAP Data:
Bitmap Index is popular in OLAP products because it allows
quick searching in data cubes.

🞂Bitmap Index

🞂Join Index
Indexing OLAP Data: Bitmap Index
🞂 The bitmap index is an alternative representation of the record ID
(RID) list.

🞂 In the bitmap index for a given attribute, there is a distinct bit


vector, Bv, for each value v in the attribute’s domain.

🞂 If a given attribute’s domain consists of n values, then n bits are


needed for each entry in the bitmap index (i.e., there are n bit
vectors).

🞂 If the attribute has the value v for a given row in the data table,
then the bit representing that value is set to 1 in the corresponding
row of the bitmap index. All other bits for that row are set to 0.
Bitmap Index Advantages
🞂It is efficient compared to hash and tree indices.

🞂It is useful for low cardinality domains(few unique


values:sem,gender) because comparison,join and aggregation
operation processing time is reduced.

🞂It leads to reductions in space and IO.Since string of


characters in single bit representation.

🞂It can be adapted for higher cardinality (many possible values


for attribute-cid,usn) domains using compression techniques.
Indexing OLAP Data: Join Index
🞂 Traditional indexing maps the value in a given column to a list of
rows having that value.

🞂 In contrast, join indexing registers the joinable rows of two relations


from a relational database.

🞂 In data warehouses, join index relates the values of the dimensions


of a star schema to rows in the fact table.
🞂 E.g. fact table: Sales and two dimensions location and item

🞂 A join index on location maintains for each distinct location list of R-


IDs of the tuples recording the Sales in the city.
🞂 Join indices can span multiple dimensions – composite join index
Efficient Processing of OLAP Queries

🞂Materializing cuboids and constructing OLAP index


structures is to speed up query processing in data cubes.

1.Determine which operations should be performed on the


available cuboids:
◦selection, projection,roll-up (group-by),drill-down
operations specified in query to perform OLAP
operations.
Efficient Processing of OLAP Queries
2. Determine which materialized cuboid(s) should be selected for
OLAP operation.
◦The one with low cost

Example:
🞂
Suppose that we define a data cube for AllElectronics of the form:

“sales cube [time, item, location]: sum(sales in dollars).”

🞂
The dimension hierarchies used are:
🞂“day < month < quarter < year” for time
🞂“item name < brand < type” for item
🞂“street < city < province or state < country” for location.
Efficient Processing of OLAP Queries

Let the query to be processed be on


{brand, province_or_state} with the condition “year = 2010”
and there are 4 materialized cuboids available:

1) {year, item_name, city}


2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2010

Which should be selected to process the query?


OLAP Server Architectures:

🞂 Implementations of a warehouse server for OLAP


processing includes:

🞂 Relational OLAP (ROLAP)Servers


🞂 Multidimensional OLAP (MOLAP) Servers
🞂Hybrid OLAP (HOLAP) Servers
🞂Specialized SQL Servers
OLAP Server Architectures: Relational OLAP
(ROLAP)
🞂 These are the intermediate servers that stand between a relational
back-end server and client front-end tools.
🞂 Uses a relational or extended-relational DBMS to store and manage
warehouse data.
🞂 ROLAP works directly with relational database.
🞂 Has greater scalability than MOLAP.
🞂 ROLAP Server includes optimization for each DBMS back end,
implementation of aggregation navigation logic, and additional tools
and services
🞂 ROLAP tools do not use pre-calculated data cubes but instead pose
the query to the standard relational database.
OLAP Server Architectures: Relational OLAP
(ROLAP)
🞂 ROLAP tools feature the ability to ask any question because the
methodology does not limit to the contents of a cube.
🞂 ROLAP also has the ability to drill down to the lowest level of
detail in the database.
🞂 ROLAP uses relational tables to store data for OLAP
🞂 Fact table associated with the a base cuboid→ Base fact table.
🞂 Base fact table stores data at the abstraction level indicated by the
join keys in the schema for the given data cube
🞂 Aggregated data stored in fact tables →Summary fact table
OLAP Server Architectures: Relational
OLAP (ROLAP)
OLAP Server Architectures: Multidimensional
OLAP (MOLAP)
🞂 MOLAP stores data in an optimized multi-dimensional array
storage called data cube.-Array based Multidimensional storage
engines.

🞂 The advantage of using a data cube is that it allows fast indexing


to precomputed summarized data.

🞂 MOLAP tools have a very fast response time.

🞂 The data cube contains all the possible answers to a given range of
questions.

🞂 MOLAP servers adopt a two-level storage representation to handle


dense(all data present) and sparse data sets(few values 0).

🞂 Sparse subcubes employ compression technology for efficient


storage utilization
OLAP Server Architectures: Hybrid OLAP
(HOLAP)
🞂 Combines ROLAP and MOLAP technology

🞂 HOLAP server may allow large volumes of detailed data to be


stored in a relational database

🞂 Aggregations are kept in a separate MOLAP store.

🞂 HOLAP tools can utilize both pre-calculated cubes and relational


data sources.

🞂 The hybrid OLAP approach combines ROLAP and MOLAP


technology, benefiting from the greater scalability of ROLAP and
the faster computation of MOLAP.

🞂 The Microsoft SQL Server 2000 supports a hybrid OLAP server.


Specialized SQL servers
🞂To meet the growing demand of OLAP processing in
relational databases, some database system vendors
implement specialized SQL servers.

🞂They provide advanced query language and query


processing support for SQL queries over star and
snowflake schemas in a read-only environment.
What Is Data Mining?( Definition)
🞂Data mining is the process of automatically discovering
useful information in large data repositories.

🞂Finding hidden information in a database.

🞂Data mining techniques are deployed to scour large


databases in order to find novel and useful patterns that
might otherwise remain unknown.

🞂They also provide capabilities to predict the outcome of


a future observation.
What is (not) Data Mining?
🞂 Looking up individual records using a database
management system.

🞂 Finding particular Web pages via a query to an Internet


search engine.

🞂 Above are tasks related to the area of information


retrieval.
Data Mining and Knowledge Discovery
● Data mining is an integral part of
Knowledge Discovery in Databases (KDD), which is the
overall process of converting raw data into useful information.
Data Preprocessing
🞂The input data can be stored in a variety of formats(flat
files,spreadsheets or relational tables)

🞂To transform the raw input data into an appropriate format


for subsequent analysis.

🞂It includes:
🞂Fusing data from multiple sources
🞂cleaning data to remove noise and duplicate observations
🞂selecting records and features that are relevant to the data
mining task at hand.
Post processing
🞂Ensures that only valid and useful results are
incorporated into the Decision Support System(DSS).

🞂Which allows analysts to explore the data and the data


mining results from a variety of viewpoints.

🞂Testing methods can also be applied during post


processing to eliminate spurious data mining results.
Challenges -motivated the development of data
mining
🞂Scalability:DM algorithms to handle massive amount
of datasets,access data
🞂High Dimensionality: thousands of attributes
Ex: Temp at various locations, Bio-informatics
🞂Heterogeneous and Complex Data
Ex: Climate-time series(temp,pressure etc), webpages
and social media semi structured text,hyperlinks
🞂Data Ownership and Distribution
Ex: Flipkart data warehouse
🞂Non-traditional Analysis:Automate the process of
hypothesis generation
Data Mining Tasks
🞂Predictive tasks: Predict the value of a particular attribute
based on the values of other attributes.
◦ Target or dependent variable
◦ Explanatory or independent variables.

🞂Descriptive tasks: Derive patterns that summarize the


underlying relationships in data.
◦ Post processing techniques are used to validate and explain the
result.
Data Mining Tasks …

g
el in
o d
e M
Clu
st erin
Data d ic tiv
e
g Pr

An
on De oma
ati tec ly
soc i tio
As s n
le
Ru

Milk
1. Predictive Modeling
🞂 Refers to the task of building a model for the target variable as
the function of the explanatory variables.

🞂 There are two types of predictive modeling tasks:


🞂 classification: which is used for discrete target variables. Eg for
discrete: no. of students in class
🞂 regression: which is used for continuous target variables. Eg for
continuous: wt and ht of students in a class

🞂 Example: classification:predicting whether a web user will


make a purchase at an online book store.
🞂 regression:forecasting the future price of the stock.
Predictive Modeling: Classification
🞂 Find a model for class attribute as a function of the
values of other attributes Model for predicting credit
worthiness

Class
Classification
🞂 Given a collection of records (training set )

🞂 Each record contains a set of attributes, one of the attributes is


the class.

🞂 Find a model for class attribute as a function of the values of


other attributes.

🞂 Goal: previously unseen records should be assigned a class as


accurately as possible.

🞂 A test set is used to determine the accuracy of the model.

🞂 Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
Classification Example
l l e
r ica ir ca a tiv
ti t
go g o
an ss
a te a te u a
c c q cl

Test
Set

Training
Learn
Set Classifier Model
2. Cluster Analysis
🞂 Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different from
(or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Example: Document Clustering

Clustering has been used to group sets of related customers, find


areas of the ocean that have a significant impact on the Earth's
climate, and compress data.
3. Association Analysis: Definition
🞂Given a set of records each of which contain some
number of items from a given collection

🞂used to discover patterns that describe strongly


associated features in the data.

🞂The discovered patterns are typically represented in


the form of implication rules or feature subsets
Association Analysis: Applications
🞂Market-Basket analysis:
◦Rules are used for sales promotion, shelf management, and inventory
management

Rules Discovered:
{Milk} --> {Bread}
{Diapers} --> {Milk}
{Diaper, Milk} --> {Coke}

Example:
🞂
Finding groups of genes that have related functionality

🞂
Identifying Web pages that are accessed together

🞂
Understanding the relationships between different elements of Earth's
climate system.
4.Deviation/Anomaly/Change Detection
🞂 Task of identifying observations
whose characteristics are
significantly different from the
rest of the data.

🞂 Suchobservations are known as


anomalies or outliers.

🞂 The goal of an anomaly


Applications:
detection algorithm is to 1. Credit Card Fraud
discover the real anomalies and Detection
avoid falsely labeling normal 2. Network Intrusion
objects as anomalies. Detection
Types of data-What is DataSet?
🞂 Dataset- Collection of data objects Attributes
🞂 Data objects are described by a no. of
attributes
🞂 An attribute is a property or
characteristic of an object
◦ Examples: eye color of a person,
temperature, etc.
◦ Attribute is also known as
variable, field, characteristic,

Objects
dimension, or feature
🞂 A collection of attributes describe an
object
◦ Object is also known as record,
point, case, event, vector, pattern,
observation, sample, entity, or
instance
DataSet- Example

● row-object-Student
● column-attribute-aspect of a student
● record based datasets- flat files or relational database systems
Attributes and Measurement
🞂 An attribute is a property or characteristic of an object that
may vary; either from one object to another or from one time
to another.

🞂 Ex: Eye color varies from person to person, while the


temperature of an object varies over time

🞂 Note that eye color is a symbolic attribute with a small


number of possible values {black,blue, brown,green,hazel,
etc.}

🞂 while temperature is a numerical attribute with a potentially


unlimited number of values.
Attributes and Measurement
🞂 A measurement scale is a rule (function) that associates a
numerical or symbolic value with an attribute of an object.

🞂 The process of measurement is the application of a


measurement scale to associate a value with a particular
attribute of a specific object.

Example: gender- Male/Female- Symbolic value


weight-kg/gms-Numerical value
Type of an Attribute
🞂 The values used to represent an attribute may have
properties that are not properties of the attribute itself, and
vice versa

🞂 Distinction between attributes and attribute values

◦ Same attribute can be mapped to different attribute values


● Example: height can be measured in feet or meters

◦ Different attributes can be mapped to the same set of values


● Example: Attribute values for ID and age are integers
● But properties of attribute values can be different
● ID has no limit but age has a maximum and minimum value
● ID cannot compute average of employee ID!!!
The Different Types of Attributes
🞂 The following properties (operations) of numbers are typically used to
describe attributes.

◦ Distinctness: = ≠
◦ Order: < >

◦ Differences are + -
meaningful :
◦ Ratios are * /
meaningful

Types are:
◦ Nominal attribute: distinctness
◦ Ordinal attribute: distinctness & order
◦ Interval attribute: distinctness, order & meaningful differences
◦ Ratio attribute: all 4 properties/operations

Example:length
Types of Attributes
🞂 There are different types of attributes

◦ Nominal: Names-integers or symbols, No order


● Examples: ID numbers, eye color, zip codes
◦ Ordinal :Ordered,differences can’t measure
● Examples: rankings (e.g., place in competition), grades, height
{tall, medium, short}
◦ Interval:Ordered,difference measured,no zero starting
point
◦ Examples: calendar dates, temperatures in Celsius or Fahrenheit.
◦ Ratio:Ordered,difference measured,starting point
zero,ratios
● Examples: height, weight, length, time, counts
Different Attributes types
Permissible Transformation that do not change the
attributes
Describing Attributes by the Number of Values
🞂Discrete Attribute
◦ Has only a finite or countably infinite set of values
◦ Examples: zip codes, counts, or the set of words in a collection of
documents
◦ Often represented as integer variables.
◦ Note: binary attributes are a special case of discrete attributes
◦ Examples:True/False,0/1,Male/Female,Y/N

🞂Continuous Attribute
◦ Has real numbers as attribute values
◦ Examples: temperature, height, or weight.
◦ Practically, real values can only be measured and represented
using a finite number of digits.
◦ Continuous attributes are typically represented as floating-
point variables.
Asymmetric Attributes
🞂Only presence -a non-zero attribute value is regarded as
important.

🞂Consider a data set where each object is a student and


each attribute records whether or not a student took a
particular course at a university.

🞂it is more meaningful and more efficient to focus on the


non-zero values.

🞂Binary attributes where only non-zero values are


important are called asymmetric binary attributes
Important Characteristics of Data Set
◦ Dimensionality (number of attributes)
● Less dimension may not lead to qualitative mining results
● High dimensional data –Curse of dimensionality
● An important motivation in preprocessing the data is
dimensionality reduction.

◦ Distribution
◦ frequency of occurrence of various values for the attribute of
data objects
◦ Statisticians enumerated distributions like: Gaussian(Normal)
and properties
◦ Many data sets not well captured so not analyze statistical
distribution.
◦ Skewness in the distribution makes classification difficult-
Categorical attribute→ Y-95%,N-5%
Important Characteristics of Data Set

◦ Sparsity
● Only the non-zero values need to be stored and manipulated
which improves computation time and storage.
◦ Resolution
● Possible to obtain data at different levels of resolution, and often
the properties of the data are different at different resolution.
● Ex: Surface of earth uneven at resolution of few meters and
relatively smooth at at tens of kilometers
● Patterns should not be too fine or too coarse, it would not be
visible.
● Ex: Atmospheric pressure at a scale of hours reflect the
movement of storms and other weather systems. On scale of
months, such a phenomena not detectable.
Types of data sets
🞂 Record Data
◦ Transaction Data (Market Basket Data)
◦ Data Matrix(Pattern Matrix)
◦ Sparse Data Matrix (Document-term Data Matrix)

🞂 Graph-Based Data
◦ World Wide Web (Data with Relationships among Objects)
◦ Molecular Structures (Data with Objects that are Graphs)

🞂 Ordered Data
◦ Sequential Transaction Data
◦ Genomic Sequence Data
◦ Temperature time series data
◦ Spatial Temperature Data
Record Data
🞂
Data that consists of a collection of records, each of which consists of a fixed set
of attributes
🞂
Stored in flat files or relational databases
Transaction Data
🞂A special type of record data, where
◦ Each record (transaction) involves a set of items.
◦ For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
Data Matrix
🞂 If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute

🞂 Such data set can be represented by an m by n matrix, where there


are m rows, one for each object, and n columns, one for each
attribute
🞂 Numeric attributes only- standard matrix operations can be applied
to transform and manipulate the data
The Sparse Data Matrix
🞂A sparse data matrix is a special case of a data matrix
in which the attributes are of the same type and are
asymmetric

🞂only non-zero values are important.

🞂Transaction data is an example of a sparse data matrix


that has only 0 1 entries.

🞂Another common example is document data.


🞂Also called as Document -Term Matrix
Document-term matrix
🞂Each document becomes a ‘term’ vector
◦ Each term is a component (attribute) of the vector
◦ The value of each component is the number of times the
corresponding term occurs in the document.
Graph-Based Data

Data with relationship among objects


🞂Frequently convey important information. In such
cases, the data is often represented as a graph.

🞂In particular, the data objects are mapped to nodes of


the graph, while the relationships among objects are
captured by the links between objects and link
properties, such as direction and weight.
Graph Data
🞂 Examples: Webpages,Social networks(obj- people,
relationships-interaction via social media)
Graph-Based Data

Data with objects that are graphs

🞂Objects contain sub objects that have relationships.Then such


objects are represented as Graphs.
🞂Examples:Structure of chemical compound

Benzene Molecule: C6H6


Ordered Data
🞂For some types of data, the attributes have
relationships that involve order in time or space

🞂Different types of ordered data are

◦ Sequential Transaction Data


◦ Genomic Sequence Data
◦ Temperature time series data
◦ Spatial Temperature Data
Sequential Transaction Data

Items/Eents
Genomic sequence data

Ex: Human genetic information DNA


Time Series Data

Ex: Daily prices of various stocks


Spatial Data
-Spatial Attributes
-Weather data (temperature,pressure,precipitation)
temperature,areas/positions(longitude,latitude) from variety of geographical locations

Average
Monthly
Temperature of
land and
ocean

Spatio-Temporal Data
Handling Non-Record Data
🞂Most of the data mining algorithms for record data or
its variations.

🞂Record-oriented techniques can be applied to non-


record data by extracting features from data objects
and using these features to create a record
corresponding to each object.

🞂Ex- Chemical structure data-record data with binary


attribute for common substructure
Data Quality
🞂 Data mining focuses on

(1) The detection and correction of data quality


Problems - Data Cleaning

(2) The use of robust algorithms that can tolerate poor data
quality.
Measurement and Data Collection Issues

🞂There may be problems due to human error, limitations of


measuring devices, or flaws in the data collection process.

🞂 The issues may be:


1. missing values or even entire data objects may be.
2. There may be spurious or duplicate objects

🞂Example: 2 Different records for a person who has recently


lived at 2 different addresses.
🞂Example: Person has a height of 2 meters but weighs only 2
Kgs
Measurement and Data Collection Errors
🞂
Measurement error refers to any problem resulting from the measurement
process
🞂
For continuous attributes, the numerical difference of the measured and
true value is called the error

🞂
The term data collection error refers to errors such as :Omitting data
objects or attribute values or inappropriately including a data object.
Example:Study of animals of certain species→ related species in
appearance

🞂
Problems that involve measurement error are:
◦Noise, artifacts,bias,precision and accuracy

◦Data Quality issues that involve measurement and data collection problems
are: outliers, Missing values, Inconsistent values and Duplicate data
Noise
🞂 Noise is the random component of a measurement error.
🞂 It may involve the distortion of a value or the addition of
spurious objects.
🞂 Noise is used with the data that has spatial or temporal
component such as from signal or image processing data.
Artifacts
🞂Data errors may be the result of a more deterministic
phenomenon, such as a streak in the same place on a
set of photographs.

🞂Such deterministic distortions of the data are often


referred to as artifacts.
Precision, Bias, and Accuracy
🞂 Precision:The closeness of repeated measurements( of the
same quantity) to one another.

🞂 Bias: A systematic
variation of the measurements from the
quantity being measured.

🞂 Accuracy: The closeness of measurements to the true value of


the quantity being measured.

Example: Laboratory standard weight scale for 1gm if


measured 5 times as: {1.015,0.990,1.013,1.001,0.986}.
Mean→1.001
Bias→ 1.001-1=0.001
Precision →Standard deviation of set of values
SD (0.001)= 0.013
Outliers
🞂 Outliers are data objects with characteristics that are
considerably different than most of the other data objects in the
data set or values of attributes are unusual
🞂
Anomalous Objects/values

◦ Case 1: Outliers are


noise that interferes
with data analysis
◦ Case 2: Outliers are
the goal of our analysis
● Credit card fraud
● Intrusion detection
Missing Values
🞂Reasons for missing values
◦ Information is not collected
(e.g., people decline to give their age and weight)
◦ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

🞂Handling missing values:


◦ Eliminate data objects or attributes.
◦ Estimate missing values
Example: Time series of temperature
◦ Ignore the missing value during analysis
Example: Clustering of data objects can miss one or two values
◦ Inconsistent values
Example: height should not be negative,Address field with
mentioned Zip code area and city information
Duplicate Data
🞂Data set may include data objects that are duplicates, or
almost duplicates of one another.
◦ Major issue when merging data from heterogeneous sources

🞂Examples:
◦ Same person with multiple email addresses
◦ Two distinct persons and address but with same name
→deduplication→Not similar
Data quality Issues Related to Applications

🞂Timeliness: age soon as collected


Example: purchasing behaviour of customer

🞂Relevance: Necessary information needed for the


application
Example: Model predicts accident rate for drivers,
should not omit age and gender attribute

🞂Knowledge about the Data: aspect of data, type of


attribute, scale of measurement
Data Preprocessing
🞂Different strategies/steps for data preprocessing:

🞂Aggregation
🞂Sampling
🞂Dimensionality reduction
🞂Feature subset selection
🞂Feature creation
🞂Discretization and Binarization
🞂Variable transformation
Aggregation
● Combining two or more attributes (or objects) into a
single attribute (or object)

● Purpose
– Data reduction
◆ Reduce the number of attributes or objects
– Change of scale
◆ Cities aggregated into regions, states, countries, etc.
◆ Days aggregated into weeks, months, or years
– More “stable” data
◆ Aggregated data(avg,total) tends to have less
variability than individual values being aggregated.
Aggregation

● Reducing days to months


● Reduce item to higher category, i.e. Electronics
● Price reduced to sum or an average
● Reduce over store locations
Example: Precipitation in Australia …

Standard Deviation of
Standard Deviation of Average
Average Yearly Precipitation
Monthly Precipitation

Yearly precipitation has less variability than monthly


precipitation in Australia.
Sampling
🞂 Sampling is used for selecting a subset of the data objects
to be analyzed.
🞂 Sampling is the main technique employed for data
reduction.
◦ It is often used for both the preliminary investigation of the data and
the final data analysis.

🞂 Statisticians often sample because obtaining the entire set


of data of interest is too expensive or time consuming.

🞂 Sampling is typically used in data mining because


processing the entire set of data of interest is too expensive
or time consuming.
Sampling …
🞂The key principle for effective sampling is the following:

◦ Using a sample will work almost as well as using the entire


data set, if the sample is representative

◦ A sample is representative if it has approximately the same


properties (of interest) as the original set of data
Sampling Approaches-Types of Sampling
🞂Simple Random Sampling
◦ There is an equal probability of selecting any particular item
◦ Sampling without replacement
● As each item is selected, it is removed from the
population
◦ Sampling with replacement
● Objects are not removed from the population as they are
selected for the sample.
● In sampling with replacement, the same object can be
picked up more than once
🞂Stratified sampling
◦ Split the data into several groups; then draw random samples
from each group
🞂Progressive sampling
◦ When proper sample size is difficult to determine
◦ Start with a small sample and increase until a sample of
sufficient size obtained.
Sample Size

8000 points 2000 Points 500 Points

🞂Example of the loss of structure with sampling


Sample Size
🞂 What sample size is necessary to get at least one
object from each of 10 equal-sized groups.

● ●

Ten groups of
points

Finding representative points from 10 groups


Dimensionality Reduction
🞂Purpose
◦ Avoid curse of dimensionality
◦ Reduce amount of time and memory required by data mining
algorithms
◦ Allow data to be more easily visualized
◦ May help to eliminate irrelevant features or reduce noise

🞂Techniques:Linear algebra techniques for


dimensionality reduction
◦ Principal Components Analysis (PCA)-from the set of
attributes find new attributes(principal component)
◦ Singular Value Decomposition(SVD)
Curse of Dimensionality

🞂 When dimensionality increases, data becomes increasingly sparse

in the space that it occupies.


🞂 Some data objects are not a representative of sample of all possible
objects.
🞂 Means, not enough data objects.
Feature Subset Selection
🞂Another way to reduce dimensionality of data
🞂Select useful attributes to create accurate model.
🞂Use only subset of features. No lose of information if:

🞂Redundant features
◦ Duplicate much or all of the information contained in one or more
other attributes
◦ Example: purchase price of a product and the amount of sales tax
paid contain much of the same information.

🞂Irrelevant features
◦ Contain no information that is useful for the data mining task at
hand
◦ Example: students' ID is often irrelevant to the task of predicting
students' GPA
An Architecture for Feature Subset
Selection
Feature Subset Selection

🞂Three standard approaches to feature selection:

🞂Embedded approaches:Algorithm decides which


attribute to use and ignore.

🞂Filter approaches: Features selected before data mining


algorithms run.Selected based on the low/high
correlation of attributes

🞂Wrapper approaches:Target data mining algorithm as a


black box to find best subset of attributes.
Feature Weighting
🞂Alternative to keep or eliminate features.

🞂High-More important feature

🞂Low-Less important feature

🞂Based on the domain knowledge


Feature Creation
🞂 Possible to create new set of attributes from the
original attributes and captures the information
effectively.

🞂 Number of new attributes are smaller than the number


of original attributes.

🞂 2 methodologies to create new attributes are:


◦ Feature Extraction
◦ Mapping the data to a new space
Feature Creation
🞂Feature Extraction: new set of features from original
raw data

🞂Example: data set consist of info on artifacts such


as :materials(wood,clay,bronze,gold), mass,volume
(density=mass/volume) so classification can be done
on density to identify material.

🞂Simple feature extraction by mathematical


combination/ domain expertise
Feature Creation
🞂Mapping the Data to a New Space: Different view on
data to reveal important and interesting information.

🞂Fourier transform- noise patterns can be detected using


this on time series data to change representations w.r.to
frequency
Discretization and Binarization

🞂Data mining algorithms such as classification needs data


in the form of categorical attributes.Association pattern
needs data in binary attributes.

🞂Transform a continuous attribute into a categorical


attribute→ Discretization

🞂 Transform continuous and discrete attributes into binary


attribute→ Binarization
Binarization

🞂 Assigning Numerical Value :If there are m categorical values,


then assign each original value to an integer interval [0,m-1]
🞂 Finding the number of binary attribute required:
n = ⌈ log2(m) ⌉
🞂 Conversion into binary : using n- bits

🞂 Such transformations leads unintended relationships among


the attributes (x2 and x3 are correlated-relationship exist
between 2 attributes)

🞂Overcome issue: no. of binary attributes= no. of values →


Asymmetric binary attributes
Binarization
Discretization of Continuous Attributes

🞂 Discretization is applied to attributes that are used in


classification or association analysis.

🞂 Transformation of continuous attributes to categorical


attributes involves:
◦ Decide how many categories(n):After values of continuous
attributes sorted, divide into n intervals by specifying n-1 split
points.
◦ Determine mapping of continuous to categorical attributes: results
in set of intervals {(x0,x1),(x1,x2)........ (xn-1,xn)}

🞂 There is unsupervised discretization(Without using class


labels) and supervised discretization (Using class labels).
Discretization: Without Using Class Labels
● For classification, not uses class information.
● Example: Discretization of people into low income,
middle income and high income is based on economic
factors.

Equal interval width approach used to obtain 4 values.


Discretization Without Using Class Labels

Equal frequency approach used to obtain 4 values.


Discretization Without Using Class Labels

K-means approach to obtain 4 values.


Measures of Similarity and Dissimilarity
🞂 Used in clustering,some classification,anomaly detection

🞂 Similarity measure
◦ Numerical measure of how alike two data objects are.
◦ Is higher for pairs of objects are more alike.
◦ Often falls in the range [0,1]
◦ 0→ No similarity, 1→ Complete Similar

🞂 Dissimilarity (distance) measure


◦ Numerical measure of how different two data objects are
◦ Lower for pairs of similar objects.
◦ Minimum dissimilarity is often 0
◦ Upper limit varies

🞂 Proximity refers to a similarity or dissimilarity


Measures of Similarity and Dissimilarity
🞂 Transformations

◦ Applied to convert a similarity to a dissimilarity or vice versa.


◦ Transform a proximity measure to fall a range [0,1]
◦ Example: objects range from 1(not similar) to 10 (similar),we
can transform to [0,1] using s`=(s-1)/9
where s→ original value s` → new similarity value

◦ In general,transformation of similarity measures to the interval


[0,1]
s`=(s -min_s)/ (max_s - min_s)
◦ Transformation of dissimilarity measures with finite range can be
mapped to the interval [0,1]
d`=(d -min_d)/ (max_d - min_d)
Measures of Similarity and Dissimilarity
🞂Transformations

◦ Example: Consider transformation d`=d/(1+d) for dissimilarity


measure ranges from 0 to ∞. Numbers are: 0,0.5,2,10,100,1000.New
dissimilarities are 0,0.33,0.67,0.90,0.99,0.999 respectively.

◦ Transforming similarities to dissimilarities and viceversa are


straightforward and issue of preserving meaning!!!

◦ One approach:If the similarity (dissimilarity) falls interval [0,1], then


dissimilarity (similarity)can be defined as:d=1-s or (s=1-d)
◦ Another approach by considering negative: dissimilarities 0,1,10,100
can be transformed into the similarities 0,-1,-10,-100 respectively
Measures of Similarity and Dissimilarity
🞂Transformations
◦ Along with the negative transformation other transformations
as
◦ s=1/d+1
◦ s=e^-d
◦ s=1- ( d-min_d/(max_d - min_d))

◦ Example: for dissimilarities 0,1,10,100 transformations by


using s=1/d+1 can be made as: 1,0.5,0.09,0.01
◦ By using s=e^-d can be made as: 1.00,0.37,0.00,0.00
◦ By using s=1- (d-min_d/(max_d - min_d)) can be made as:
1.00,0.99,0.90,0.00
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single, simple
attribute.
Similarity and Dissimilarity between Simple
Attribute
◦ Example: Consider the objects with single ordinal attribute
◦ A dairy milk chocolate has attribute quality on the scale
{poor,fair,OK,good, wonderful}.

Here 3 objects say P1,P2,P3. Where P1→ good, P2→OK, P3→


fair

◦ To make this observation as quantitative,values of ordinal


attributes mapped to integers as:
{poor=0,fair=1,OK=2,good=3, wonderful=4}
◦ d(P1,P2)= 3-2=1
or
◦ d(P1,P2)= (3-2)/(5-1)=0.25, if we want dissimilarity to fall
between 0 and 1
Dissimilarities between Data Objects
- Distance

🞂 Euclidean Distance

where n is the number of dimensions (attributes) and xk and


yk are, respectively, the kth attributes (components) or data
objects x and y.
Minkowski Distance

🞂 Minkowski Distance is a generalization of


Euclidean Distance

Where r is a parameter, n is the number of


dimensions (attributes) and xk and yk are,
respectively, the kth attributes (components) or data
objects x and y.
Euclidean Distance
Minkowski Distance: Examples

🞂 r = 1. City block (Manhattan, taxicab, L1 norm) distance.


● ex: hamming distance-difference between 2 binary vectors

🞂 r = 2. Euclidean distance

🞂 r → ∞. “supremum” (Lmax norm, L∞ norm) distance.


Common Properties of a Distance

🞂 Distances, such as the Euclidean distance, have some


well known properties.
1. d(x, y) ≥ 0 for all x and y
d(x, y) = 0 only if x = y. (Positivity)

2. d(x, y) = d(y, x) for all x and y. (Symmetry)

3. d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.


(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points
(data objects), x and y.

🞂 A distance that satisfies these properties is a metrics


Examples
🞂Non-metric Dissimilarities: Set Differences
◦ d(A,B): size(A- B) + size(B - A)

🞂A={1,2,3,4} and B={2,3,4}


Then, A-B= {1}
B-A ={Ø}

🞂We can define distance d between two sets A and B as:


d(A,B)= size(A-B)+size (B-A)

🞂where, size → returns the number of elements in the set.


Similarities between Data Objects
🞂 Similarities, also have some well known properties.

1. s(x, y) = 1 (or maximum similarity) only if x = y. (0 < s


<1)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects),


x and y.
Examples of Proximity measures
Similarity Measures for Binary Data

🞂 Similarity measures between objects that contain only binary


attributes are called similarity coefficients, and typically have
values between 0(not similar) and 1(complete similar).
Let x and y be two objects that consist of n binary attributes.

Simple Matching Coefficient

Example: Find students who had answered questions similarly on a test has True/False questions
Jaccard Coefficient
🞂 Used to handle objects consisting of asymmetric binary attributes.

🞂 Suppose that x and y are data objects that represent two rows (two
transactions) of a transaction matrix.

🞂 If each asymmetric binary attribute corresponds to an item in a


store, then a 1 indicates that the item was purchased, while a 0
indicates that the product was not purchased
SMC versus Jaccard: Example
Cosine Similarity

🞂Documents are often represented as vectors.


🞂Where each component (attribute) represents the the
frequency with which particular term (word) occurs in the
document.
🞂Similarity between two document depends on the words appear
in both documents.

COS 0=1,Documents are similar


COS 90=0, Dissimilar documents
Cosine Similarity
🞂

Cosine similarity between two document vectors:

● Dividing by x and y by their length vector means normalizes to have length 1. Means, does not take
length of the 2 data objects into account when computes similarity.
● When length is important then computation using Euclidean Distance is better choice.
Extended Jaccard Coefficient (Tanimoto
Coefficient )

🞂Used for Document data.


🞂It reduces to the Jaccard coefficient in case of binary
attributes
Correlation

● The linear relationship between the attributes of the objects.

● Correlation can measure relationship


○ between 2 variables - Ex: height and weight
○ between 2 objects- Ex: temperature time series data objects

● Used to measure similarity between attributes since the values in


2 data objects come from different attributes, which can have
different attribute types and scales.
Pearson’s Correlation - between two sets of numerical values
Visually Evaluating Correlation

30 pairs of values
randomly generated
with normal distribution,
Scatter plots showing
the similarity from –1
to 1.
● If x and y are transformed
as x’ and y’

● But cos(x,y) != cos(x’,y’)

● cos(45) != cos(10)

● Correlation between 2
vectors are equal when
cos(x,y)=0 and
cos(x’,y’)=0
Drawback of Correlation
🞂x = (-3, -2, -1, 0, 1, 2, 3)
🞂y = (9, 4, 1, 0, 1, 4, 9)

cov(x,y)= (-3-0) (9-4) + (-2-0)(4-4)+ (-1-0)(1-4) +(0-0)(0-


4) +(1-0)(1-4)+(2-0)(4-4)+(3-0)(9-4)

🞂mean(x) = 0, mean(y) = 4
🞂std(x) = 2.16, std(y) = 3.74

🞂 corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( (7-1) * 2.16 * 3.74 )


=0

Nonlinear relationships- If corr=0, then no linear relationship between the 2 set of values.
Though yi = x i2 correlation is 0
Bregman Divergence
● Used as loss or distortion functions.
● Assume: x and y → 2 points, y is original point and x is distortion
or approximation of original point.
● Resulting distortion or loss that results if y is approximated to y
(smaller the loss) or not.
● Also used as dissimilarity functions.
● Mathematically from vector calculus, Formal definition can be:
convex function and uses Taylor’s
expansion
Issues in Proximity Calculation
● Issues related to proximity measures are:

○ How to handle when attributes have different scales (range


of values)

○ How to calculate proximity when different attribute types


i.e quantitative, qualitative

○ How to handle proximity calculations when attributes


have different weights(not all attributes contribute equally
to the proximity of objects)
Standardization and correlation for distance measure:
-Mahalanobis distance,

🞂A generalization of Euclidean distance


🞂Useful when attributes are correlated, have different
ranges of values (different variances), and the
distribution of the data is approximately Gaussian
🞂Distance between 2 objects(vectors) x and y is:

∑ is the covariance matrix of data


Combining similarities for heterogeneous attributes:

🞂 When different attribute types, algorithm used as below:


🞂 Similarities computed using the table of similarity and
dissimilarity for nominal ordinal, interval and ratio attributes
as previously discussed.Then transform to the range 0 and
1→ not holds good when asymmetric attributes .
Using Weights:
🞂All attributes were treated equally when computing proximity.

🞂But Some attributes are more important than others.

🞂Based on the weights, formulas of the previous computation can be


changed based on the contribution of each attribute.
Selecting the Right proximity Measure
🞂The following are a few general observations that may be helpful:

🞂First, the type of proximity measure should fit the type of data.
For many types of dense, continuous data, metric distance
measures such as Euclidean distance are often used.

🞂Proximity between continuous attributes is most often expressed


in terms of differences, and distance measures provide a well-
defined way of combining these differences into an overall
proximity measure.

🞂For sparse data, which often consists of asymmetric attributes, we


employ similarity measures that ignore 0–0 matches.
Selecting the Right proximity Measure

🞂objects have only a few of the characteristics described by the


attributes, and thus, are highly similar in terms of the
characteristics they do not have. The cosine, Jaccard, and
extended Jaccard measures are appropriate for such data.

🞂There are other characteristics of data vectors that may need to


be considered. Suppose, for example, that we are interested in
comparing time series.

🞂If the magnitude of the time series is important (for example,


each time series represent total sales of the same organization
for a different year), then we could use Euclidean distance.
Selecting the Right proximity Measure

🞂 If the time series represent different quantities (for


example, blood pressure and oxygen consumption), then
we usually want to determine if the time series have the
same shape, not the same magnitude.

🞂 Correlation, which uses a built-in normalization that


accounts for differences in magnitude and level, would be
more appropriate.
Selecting the Right proximity Measure

🞂 In some cases, transformation or normalization of the data


is important for obtaining a proper similarity measure
since such transformations are not always present in
proximity measures.

🞂 For instance, time series may have trends or periodic


patterns that significantly impact similarity. Also, a proper
computation of similarity may require that time lags be
taken into account.

You might also like