0% found this document useful (0 votes)

79 views144 pages

Data Warehousing & Modeling: Module - 2

The document discusses efficient data cube computation and indexing techniques for online analytical processing (OLAP) in data warehouses. It covers computing data cubes using operators like compute cube, issues like the curse of dimensionality, and materialization strategies like full, partial and shell cubes. It also discusses bitmap indexing and join indexing to speed up OLAP queries. Finally, it outlines different OLAP server architectures like ROLAP, MOLAP and HOLAP.

Uploaded by

rakshitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views144 pages

Data Warehousing & Modeling: Module - 2

Uploaded by

rakshitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Data Warehousing &

Modeling
Subject Code: 18CS641

Module -2
Module -2
Module-2:
Data warehouse implementation & Data mining: Efficient
Data Cube computation: An overview, Indexing OLAP Data:
Bitmap index and join index, Efficient processing of OLAP
Queries, OLAP server Architecture ROLAP versus MOLAP
Versus HOLAP. : Introduction: What is data mining,
Challenges, Data Mining Tasks, Data: Types of Data, Data
Quality, Data Preprocessing, Measures of Similarity and
Dissimilarity.
Textbook 2: Ch.4.4
Textbook 1: Ch.1.1,1.2,1.4, 2.1 to 2.4
Data Warehouse Implementation
🞂Data warehouse contain huge amounts of data.
🞂OLAP Servers must return decision support queries in
order of seconds.
🞂So it is crucial for data warehouse systems to support
highly efficient cube computation techniques, access
methods and query processing techniques.
Efficient Data Cube Computation: An
Overview
🞂 In multi dimensional data analysis , it is efficient to
compute aggregations.
🞂 In SQL terms,aggregations are referred as: group-by’s.
🞂 Each group-by can be represented by a cuboid
🞂 Set of group-by’s forms a lattice of cuboids defining a
data cube.
🞂 Issues related to the efficient computations of data
cubes are as followed:
Efficient Data Cube Computation: “The compute
cube Operator”

🞂SQL extends cube computation by including a

compute cube operator.
🞂The compute cube operator computes aggregates over
all subsets of the dimensions specified in the operation.
🞂This require excessive storage space, for large number
of dimensions.
🞂 A data cube is a lattice of cuboids.
Efficient Data Cube Computation: “The compute
cube” Operator
🞂Example: You would like to create a data cube for
All_Electronics that contains the following:
🞂 item, city, year, and sales_in_dollars

🞂Analyze the data with following queries:

◦ Compute the sum of sales, grouping by item and city
◦ Compute the sum of sales, grouping by item
◦ Compute the sum of sales, grouping by city
Efficient Data Cube Computation: “The compute
cube” Operator

🞂The total number of data cuboids or group-by’s can

be formed is 23=8

◦ {(city,item,year),
◦ (city,item), (city,year),(item,year)
◦ (city),(item),(year),
◦ ()}

🞂(),-group-by is empty(the dimensions are not

grouped)
◦ These group-by’s form a lattice of cuboids for the data cube
as shown in below figure
◦ The basic cuboid contains all three dimensions
◦ apex Cuboid -0D- group-by is empty
Efficient Data Cube Computation: “The compute
cube” Operator

● Total sum of all sales

● Most generalized(least
specific) cuboid
● Start from apex to
downward- drilling down

● Least generalized(most
specific) cuboid
● Rolling up
Efficient Data Cube Computation: “The compute
cube” Operator

🞂 SQL Query containing no group-by -0-D operation.

🞂
Example: Compute the sum of total sales

🞂 A cube operator on n dimensions is equivalent to a collection of

group-by statements, one for each subset of the n dimensions.

🞂 Therefore, the cube operator is the n-dimensional generalization of

the group-by operator.

🞂 Similar to the SQL syntax, the data cube could be defined in

DMQL as:
◦ define cube sales [item, city, year]: sum (sales_in_dollars)

🞂 Compute the sales aggregate cuboids as:

◦ compute cube sales
A Data Cube: sales
Item City cuboid
Base cuboid
I1 I2 I3 I4 I5 I6 All

New York
10 11 12 3 10 1 47
City

Chicago 11 9 6 9 6 7 48

Toronto 12 9 8 5 7 3 44

Vancouver 13 8 10 5 6 3 45
All
46 37 36 22 29 14 184
Item cuboid

Aggregate cell Base cell Apex Cuboid

Efficient Data Cube Computation: “The Curse of
Dimensionality”
• Fast on-line analytical processing takes minimum time if
aggregates for all the cuboids are precomputed.

🞂 fast response time

🞂 avoids redundant computations

🞂 Pre-computation of the full cube requires excessive amount of

memory and depends on number of dimensions and cardinality of
dimensions.

🞂 Called as Curse of Dimensionality

Efficient Data Cube Computation: “The Curse of
Dimensionality”
🞂 How many cuboids are there in an n-dimensional data cube
with dimension cardinality (elements in the given dimension)?”

🞂 Many dimensions may have hierarchies, for example time

● day < month < quarter < year

🞂 The total number of cuboids that can be generated is

🞂 Where Li is the number of levels for dimension i.

🞂 1-> virtual top level (all)i.e. removal of dimension
🞂 5 Levels
Materialization
🞂 Materialized view is a virtual table,involves results of query
precomputed
🞂 Data cube materialization/ pre-computation involves three choices:

◦ No materialization: Don’t precompute any of the non-base

cuboid. Leads to multidimensional aggregation on the fly and is
slow.

◦ Full materialization: Precompute all the cubes. Running queries

will be very fast. Requires huge memory.-Full cube

◦ Partial Materialization: Selectively compute a proper subset of the

cuboids, which contains only those cells that satisfy some user
specified criterion.-Sub cube
Types of cubes
■Full cube: All cells and cuboids are materialized. All
possible combination of dimensions and values.
or

■Iceberg cube: Partial materialization. Materializing only the

cells in a cuboid whose measure value is above the minimum
support threshold.
count(*) >= min support Iceberg Condition

■Shell cube: precomputing the cuboids for only a small

number of dimensions (e.g., three to five) of a data cube.
remaining queries computed on the fly.
Indexing OLAP Data:
Bitmap Index is popular in OLAP products because it allows
quick searching in data cubes.

🞂Bitmap Index

🞂Join Index
Indexing OLAP Data: Bitmap Index
🞂 The bitmap index is an alternative representation of the record ID
(RID) list.

🞂 In the bitmap index for a given attribute, there is a distinct bit

vector, Bv, for each value v in the attribute’s domain.

🞂 If a given attribute’s domain consists of n values, then n bits are

needed for each entry in the bitmap index (i.e., there are n bit
vectors).

🞂 If the attribute has the value v for a given row in the data table,
then the bit representing that value is set to 1 in the corresponding
row of the bitmap index. All other bits for that row are set to 0.
Bitmap Index Advantages
🞂It is efficient compared to hash and tree indices.

🞂It is useful for low cardinality domains(few unique

values:sem,gender) because comparison,join and aggregation
operation processing time is reduced.

🞂It leads to reductions in space and [Link] string of

characters in single bit representation.

🞂It can be adapted for higher cardinality (many possible values

for attribute-cid,usn) domains using compression techniques.
Indexing OLAP Data: Join Index
🞂 Traditional indexing maps the value in a given column to a list of
rows having that value.

🞂 In contrast, join indexing registers the joinable rows of two relations

from a relational database.

🞂 In data warehouses, join index relates the values of the dimensions

of a star schema to rows in the fact table.
🞂 E.g. fact table: Sales and two dimensions location and item

🞂 A join index on location maintains for each distinct location list of R-

IDs of the tuples recording the Sales in the city.
🞂 Join indices can span multiple dimensions – composite join index
Efficient Processing of OLAP Queries

🞂Materializing cuboids and constructing OLAP index

structures is to speed up query processing in data cubes.

[Link] which operations should be performed on the

available cuboids:
◦selection, projection,roll-up (group-by),drill-down
operations specified in query to perform OLAP
operations.
Efficient Processing of OLAP Queries
2. Determine which materialized cuboid(s) should be selected for
OLAP operation.
◦The one with low cost

Example:
🞂
Suppose that we define a data cube for AllElectronics of the form:

“sales cube [time, item, location]: sum(sales in dollars).”

🞂
The dimension hierarchies used are:
🞂“day < month < quarter < year” for time
🞂“item name < brand < type” for item
🞂“street < city < province or state < country” for location.
Efficient Processing of OLAP Queries

Let the query to be processed be on

{brand, province_or_state} with the condition “year = 2010”
and there are 4 materialized cuboids available:

1) {year, item_name, city}

2) {year, brand, country}
3) {year, brand, province_or_state}
4) {item_name, province_or_state} where year = 2010

Which should be selected to process the query?

OLAP Server Architectures:

🞂 Implementations of a warehouse server for OLAP

processing includes:

🞂 Relational OLAP (ROLAP)Servers

🞂 Multidimensional OLAP (MOLAP) Servers
🞂Hybrid OLAP (HOLAP) Servers
🞂Specialized SQL Servers
OLAP Server Architectures: Relational OLAP
(ROLAP)
🞂 These are the intermediate servers that stand between a relational
back-end server and client front-end tools.
🞂 Uses a relational or extended-relational DBMS to store and manage
warehouse data.
🞂 ROLAP works directly with relational database.
🞂 Has greater scalability than MOLAP.
🞂 ROLAP Server includes optimization for each DBMS back end,
implementation of aggregation navigation logic, and additional tools
and services
🞂 ROLAP tools do not use pre-calculated data cubes but instead pose
the query to the standard relational database.
OLAP Server Architectures: Relational OLAP
(ROLAP)
🞂 ROLAP tools feature the ability to ask any question because the
methodology does not limit to the contents of a cube.
🞂 ROLAP also has the ability to drill down to the lowest level of
detail in the database.
🞂 ROLAP uses relational tables to store data for OLAP
🞂 Fact table associated with the a base cuboid→ Base fact table.
🞂 Base fact table stores data at the abstraction level indicated by the
join keys in the schema for the given data cube
🞂 Aggregated data stored in fact tables →Summary fact table
OLAP Server Architectures: Relational
OLAP (ROLAP)
OLAP Server Architectures: Multidimensional
OLAP (MOLAP)
🞂 MOLAP stores data in an optimized multi-dimensional array
storage called data cube.-Array based Multidimensional storage
engines.

🞂 The advantage of using a data cube is that it allows fast indexing

to precomputed summarized data.

🞂 MOLAP tools have a very fast response time.

🞂 The data cube contains all the possible answers to a given range of
questions.

🞂 MOLAP servers adopt a two-level storage representation to handle

dense(all data present) and sparse data sets(few values 0).

🞂 Sparse subcubes employ compression technology for efficient

storage utilization
OLAP Server Architectures: Hybrid OLAP
(HOLAP)
🞂 Combines ROLAP and MOLAP technology

🞂 HOLAP server may allow large volumes of detailed data to be

stored in a relational database

🞂 Aggregations are kept in a separate MOLAP store.

🞂 HOLAP tools can utilize both pre-calculated cubes and relational

data sources.

🞂 The hybrid OLAP approach combines ROLAP and MOLAP

technology, benefiting from the greater scalability of ROLAP and
the faster computation of MOLAP.

🞂 The Microsoft SQL Server 2000 supports a hybrid OLAP server.

Specialized SQL servers
🞂To meet the growing demand of OLAP processing in
relational databases, some database system vendors
implement specialized SQL servers.

🞂They provide advanced query language and query

processing support for SQL queries over star and
snowflake schemas in a read-only environment.
What Is Data Mining?( Definition)
🞂Data mining is the process of automatically discovering
useful information in large data repositories.

🞂Finding hidden information in a database.

🞂Data mining techniques are deployed to scour large

databases in order to find novel and useful patterns that
might otherwise remain unknown.

🞂They also provide capabilities to predict the outcome of

a future observation.
What is (not) Data Mining?
🞂 Looking up individual records using a database
management system.

🞂 Finding particular Web pages via a query to an Internet

search engine.

🞂 Above are tasks related to the area of information

retrieval.
Data Mining and Knowledge Discovery
● Data mining is an integral part of
Knowledge Discovery in Databases (KDD), which is the
overall process of converting raw data into useful information.
Data Preprocessing
🞂The input data can be stored in a variety of formats(flat
files,spreadsheets or relational tables)

🞂To transform the raw input data into an appropriate format

for subsequent analysis.

🞂It includes:
🞂Fusing data from multiple sources
🞂cleaning data to remove noise and duplicate observations
🞂selecting records and features that are relevant to the data
mining task at hand.
Post processing
🞂Ensures that only valid and useful results are
incorporated into the Decision Support System(DSS).

🞂Which allows analysts to explore the data and the data

mining results from a variety of viewpoints.

🞂Testing methods can also be applied during post

processing to eliminate spurious data mining results.
Challenges -motivated the development of data
mining
🞂Scalability:DM algorithms to handle massive amount
of datasets,access data
🞂High Dimensionality: thousands of attributes
Ex: Temp at various locations, Bio-informatics
🞂Heterogeneous and Complex Data
Ex: Climate-time series(temp,pressure etc), webpages
and social media semi structured text,hyperlinks
🞂Data Ownership and Distribution
Ex: Flipkart data warehouse
🞂Non-traditional Analysis:Automate the process of
hypothesis generation
Data Mining Tasks
🞂Predictive tasks: Predict the value of a particular attribute
based on the values of other attributes.
◦ Target or dependent variable
◦ Explanatory or independent variables.

🞂Descriptive tasks: Derive patterns that summarize the

underlying relationships in data.
◦ Post processing techniques are used to validate and explain the
result.
Data Mining Tasks …

g
el in
o d
e M
Clu
st erin
Data d ic tiv
e
g Pr

An
on De oma
ati tec ly
soc i tio
As s n
le
Ru

Milk
1. Predictive Modeling
🞂 Refers to the task of building a model for the target variable as
the function of the explanatory variables.

🞂 There are two types of predictive modeling tasks:

🞂 classification: which is used for discrete target variables. Eg for
discrete: no. of students in class
🞂 regression: which is used for continuous target variables. Eg for
continuous: wt and ht of students in a class

🞂 Example: classification:predicting whether a web user will

make a purchase at an online book store.
🞂 regression:forecasting the future price of the stock.
Predictive Modeling: Classification
🞂 Find a model for class attribute as a function of the
values of other attributes Model for predicting credit
worthiness

Class
Classification
🞂 Given a collection of records (training set )

🞂 Each record contains a set of attributes, one of the attributes is

the class.

🞂 Find a model for class attribute as a function of the values of

other attributes.

🞂 Goal: previously unseen records should be assigned a class as

accurately as possible.

🞂 A test set is used to determine the accuracy of the model.

🞂 Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
Classification Example
l l e
r ica ir ca a tiv
ti t
go g o
an ss
a te a te u a
c c q cl

Test
Set

Training
Learn
Set Classifier Model
2. Cluster Analysis
🞂 Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different from
(or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Example: Document Clustering

Clustering has been used to group sets of related customers, find

areas of the ocean that have a significant impact on the Earth's
climate, and compress data.
3. Association Analysis: Definition
🞂Given a set of records each of which contain some
number of items from a given collection

🞂used to discover patterns that describe strongly

associated features in the data.

🞂The discovered patterns are typically represented in

the form of implication rules or feature subsets
Association Analysis: Applications
🞂Market-Basket analysis:
◦Rules are used for sales promotion, shelf management, and inventory
management

Rules Discovered:
{Milk} --> {Bread}
{Diapers} --> {Milk}
{Diaper, Milk} --> {Coke}

Example:
🞂
Finding groups of genes that have related functionality

🞂
Identifying Web pages that are accessed together

🞂
Understanding the relationships between different elements of Earth's
climate system.
[Link]/Anomaly/Change Detection
🞂 Task of identifying observations
whose characteristics are
significantly different from the
rest of the data.

🞂 Suchobservations are known as

anomalies or outliers.

🞂 The goal of an anomaly

Applications:
detection algorithm is to 1. Credit Card Fraud
discover the real anomalies and Detection
avoid falsely labeling normal 2. Network Intrusion
objects as anomalies. Detection
Types of data-What is DataSet?
🞂 Dataset- Collection of data objects Attributes
🞂 Data objects are described by a no. of
attributes
🞂 An attribute is a property or
characteristic of an object
◦ Examples: eye color of a person,
temperature, etc.
◦ Attribute is also known as
variable, field, characteristic,

Objects
dimension, or feature
🞂 A collection of attributes describe an
object
◦ Object is also known as record,
point, case, event, vector, pattern,
observation, sample, entity, or
instance
DataSet- Example

● row-object-Student
● column-attribute-aspect of a student
● record based datasets- flat files or relational database systems
Attributes and Measurement
🞂 An attribute is a property or characteristic of an object that
may vary; either from one object to another or from one time
to another.

🞂 Ex: Eye color varies from person to person, while the

temperature of an object varies over time

🞂 Note that eye color is a symbolic attribute with a small

number of possible values {black,blue, brown,green,hazel,
etc.}

🞂 while temperature is a numerical attribute with a potentially

unlimited number of values.
Attributes and Measurement
🞂 A measurement scale is a rule (function) that associates a
numerical or symbolic value with an attribute of an object.

🞂 The process of measurement is the application of a

measurement scale to associate a value with a particular
attribute of a specific object.

Example: gender- Male/Female- Symbolic value

weight-kg/gms-Numerical value
Type of an Attribute
🞂 The values used to represent an attribute may have
properties that are not properties of the attribute itself, and
vice versa

🞂 Distinction between attributes and attribute values

◦ Same attribute can be mapped to different attribute values

● Example: height can be measured in feet or meters

◦ Different attributes can be mapped to the same set of values

● Example: Attribute values for ID and age are integers
● But properties of attribute values can be different
● ID has no limit but age has a maximum and minimum value
● ID cannot compute average of employee ID!!!
The Different Types of Attributes
🞂 The following properties (operations) of numbers are typically used to
describe attributes.

◦ Distinctness: = ≠
◦ Order: < >

◦ Differences are + -
meaningful :
◦ Ratios are * /
meaningful

Types are:
◦ Nominal attribute: distinctness
◦ Ordinal attribute: distinctness & order
◦ Interval attribute: distinctness, order & meaningful differences
◦ Ratio attribute: all 4 properties/operations

Example:length
Types of Attributes
🞂 There are different types of attributes

◦ Nominal: Names-integers or symbols, No order

● Examples: ID numbers, eye color, zip codes
◦ Ordinal :Ordered,differences can’t measure
● Examples: rankings (e.g., place in competition), grades, height
{tall, medium, short}
◦ Interval:Ordered,difference measured,no zero starting
point
◦ Examples: calendar dates, temperatures in Celsius or Fahrenheit.
◦ Ratio:Ordered,difference measured,starting point
zero,ratios
● Examples: height, weight, length, time, counts
Different Attributes types
Permissible Transformation that do not change the
attributes
Describing Attributes by the Number of Values
🞂Discrete Attribute
◦ Has only a finite or countably infinite set of values
◦ Examples: zip codes, counts, or the set of words in a collection of
documents
◦ Often represented as integer variables.
◦ Note: binary attributes are a special case of discrete attributes
◦ Examples:True/False,0/1,Male/Female,Y/N

🞂Continuous Attribute
◦ Has real numbers as attribute values
◦ Examples: temperature, height, or weight.
◦ Practically, real values can only be measured and represented
using a finite number of digits.
◦ Continuous attributes are typically represented as floating-
point variables.
Asymmetric Attributes
🞂Only presence -a non-zero attribute value is regarded as
important.

🞂Consider a data set where each object is a student and

each attribute records whether or not a student took a
particular course at a university.

🞂it is more meaningful and more efficient to focus on the

non-zero values.

🞂Binary attributes where only non-zero values are

important are called asymmetric binary attributes
Important Characteristics of Data Set
◦ Dimensionality (number of attributes)
● Less dimension may not lead to qualitative mining results
● High dimensional data –Curse of dimensionality
● An important motivation in preprocessing the data is
dimensionality reduction.

◦ Distribution
◦ frequency of occurrence of various values for the attribute of
data objects
◦ Statisticians enumerated distributions like: Gaussian(Normal)
and properties
◦ Many data sets not well captured so not analyze statistical
distribution.
◦ Skewness in the distribution makes classification difficult-
Categorical attribute→ Y-95%,N-5%
Important Characteristics of Data Set

◦ Sparsity
● Only the non-zero values need to be stored and manipulated
which improves computation time and storage.
◦ Resolution
● Possible to obtain data at different levels of resolution, and often
the properties of the data are different at different resolution.
● Ex: Surface of earth uneven at resolution of few meters and
relatively smooth at at tens of kilometers
● Patterns should not be too fine or too coarse, it would not be
visible.
● Ex: Atmospheric pressure at a scale of hours reflect the
movement of storms and other weather systems. On scale of
months, such a phenomena not detectable.
Types of data sets
🞂 Record Data
◦ Transaction Data (Market Basket Data)
◦ Data Matrix(Pattern Matrix)
◦ Sparse Data Matrix (Document-term Data Matrix)

🞂 Graph-Based Data
◦ World Wide Web (Data with Relationships among Objects)
◦ Molecular Structures (Data with Objects that are Graphs)

🞂 Ordered Data
◦ Sequential Transaction Data
◦ Genomic Sequence Data
◦ Temperature time series data
◦ Spatial Temperature Data
Record Data
🞂
Data that consists of a collection of records, each of which consists of a fixed set
of attributes
🞂
Stored in flat files or relational databases
Transaction Data
🞂A special type of record data, where
◦ Each record (transaction) involves a set of items.
◦ For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
Data Matrix
🞂 If data objects have the same fixed set of numeric attributes, then
the data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute

🞂 Such data set can be represented by an m by n matrix, where there

are m rows, one for each object, and n columns, one for each
attribute
🞂 Numeric attributes only- standard matrix operations can be applied
to transform and manipulate the data
The Sparse Data Matrix
🞂A sparse data matrix is a special case of a data matrix
in which the attributes are of the same type and are
asymmetric

🞂only non-zero values are important.

🞂Transaction data is an example of a sparse data matrix

that has only 0 1 entries.

🞂Another common example is document data.

🞂Also called as Document -Term Matrix
Document-term matrix
🞂Each document becomes a ‘term’ vector
◦ Each term is a component (attribute) of the vector
◦ The value of each component is the number of times the
corresponding term occurs in the document.
Graph-Based Data

Data with relationship among objects

🞂Frequently convey important information. In such
cases, the data is often represented as a graph.

🞂In particular, the data objects are mapped to nodes of

the graph, while the relationships among objects are
captured by the links between objects and link
properties, such as direction and weight.
Graph Data
🞂 Examples: Webpages,Social networks(obj- people,
relationships-interaction via social media)
Graph-Based Data

Data with objects that are graphs

🞂Objects contain sub objects that have [Link] such

objects are represented as Graphs.
🞂Examples:Structure of chemical compound

Benzene Molecule: C6H6

Ordered Data
🞂For some types of data, the attributes have
relationships that involve order in time or space

🞂Different types of ordered data are

◦ Sequential Transaction Data

◦ Genomic Sequence Data
◦ Temperature time series data
◦ Spatial Temperature Data
Sequential Transaction Data

Items/Eents
Genomic sequence data

Ex: Human genetic information DNA

Time Series Data

Ex: Daily prices of various stocks

Spatial Data
-Spatial Attributes
-Weather data (temperature,pressure,precipitation)
temperature,areas/positions(longitude,latitude) from variety of geographical locations

Average
Monthly
Temperature of
land and
ocean

Spatio-Temporal Data
Handling Non-Record Data
🞂Most of the data mining algorithms for record data or
its variations.

🞂Record-oriented techniques can be applied to non-

record data by extracting features from data objects
and using these features to create a record
corresponding to each object.

🞂Ex- Chemical structure data-record data with binary

attribute for common substructure
Data Quality
🞂 Data mining focuses on

(1) The detection and correction of data quality

Problems - Data Cleaning

(2) The use of robust algorithms that can tolerate poor data
quality.
Measurement and Data Collection Issues

🞂There may be problems due to human error, limitations of

measuring devices, or flaws in the data collection process.

🞂 The issues may be:

1. missing values or even entire data objects may be.
2. There may be spurious or duplicate objects

🞂Example: 2 Different records for a person who has recently

lived at 2 different addresses.
🞂Example: Person has a height of 2 meters but weighs only 2
Kgs
Measurement and Data Collection Errors
🞂
Measurement error refers to any problem resulting from the measurement
process
🞂
For continuous attributes, the numerical difference of the measured and
true value is called the error

🞂
The term data collection error refers to errors such as :Omitting data
objects or attribute values or inappropriately including a data object.
Example:Study of animals of certain species→ related species in
appearance

🞂
Problems that involve measurement error are:
◦Noise, artifacts,bias,precision and accuracy

◦Data Quality issues that involve measurement and data collection problems
are: outliers, Missing values, Inconsistent values and Duplicate data
Noise
🞂 Noise is the random component of a measurement error.
🞂 It may involve the distortion of a value or the addition of
spurious objects.
🞂 Noise is used with the data that has spatial or temporal
component such as from signal or image processing data.
Artifacts
🞂Data errors may be the result of a more deterministic
phenomenon, such as a streak in the same place on a
set of photographs.

🞂Such deterministic distortions of the data are often

referred to as artifacts.
Precision, Bias, and Accuracy
🞂 Precision:The closeness of repeated measurements( of the
same quantity) to one another.

🞂 Bias: A systematic
variation of the measurements from the
quantity being measured.

🞂 Accuracy: The closeness of measurements to the true value of

the quantity being measured.

Example: Laboratory standard weight scale for 1gm if

measured 5 times as: {1.015,0.990,1.013,1.001,0.986}.
Mean→1.001
Bias→ 1.001-1=0.001
Precision →Standard deviation of set of values
SD (0.001)= 0.013
Outliers
🞂 Outliers are data objects with characteristics that are
considerably different than most of the other data objects in the
data set or values of attributes are unusual
🞂
Anomalous Objects/values

◦ Case 1: Outliers are

noise that interferes
with data analysis
◦ Case 2: Outliers are
the goal of our analysis
● Credit card fraud
● Intrusion detection
Missing Values
🞂Reasons for missing values
◦ Information is not collected
(e.g., people decline to give their age and weight)
◦ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

🞂Handling missing values:

◦ Eliminate data objects or attributes.
◦ Estimate missing values
Example: Time series of temperature
◦ Ignore the missing value during analysis
Example: Clustering of data objects can miss one or two values
◦ Inconsistent values
Example: height should not be negative,Address field with
mentioned Zip code area and city information
Duplicate Data
🞂Data set may include data objects that are duplicates, or
almost duplicates of one another.
◦ Major issue when merging data from heterogeneous sources

🞂Examples:
◦ Same person with multiple email addresses
◦ Two distinct persons and address but with same name
→deduplication→Not similar
Data quality Issues Related to Applications

🞂Timeliness: age soon as collected

Example: purchasing behaviour of customer

🞂Relevance: Necessary information needed for the

application
Example: Model predicts accident rate for drivers,
should not omit age and gender attribute

🞂Knowledge about the Data: aspect of data, type of

attribute, scale of measurement
Data Preprocessing
🞂Different strategies/steps for data preprocessing:

🞂Aggregation
🞂Sampling
🞂Dimensionality reduction
🞂Feature subset selection
🞂Feature creation
🞂Discretization and Binarization
🞂Variable transformation
Aggregation
● Combining two or more attributes (or objects) into a
single attribute (or object)

● Purpose
– Data reduction
◆ Reduce the number of attributes or objects
– Change of scale
◆ Cities aggregated into regions, states, countries, etc.
◆ Days aggregated into weeks, months, or years
– More “stable” data
◆ Aggregated data(avg,total) tends to have less
variability than individual values being aggregated.
Aggregation

● Reducing days to months

● Reduce item to higher category, i.e. Electronics
● Price reduced to sum or an average
● Reduce over store locations
Example: Precipitation in Australia …

Standard Deviation of
Standard Deviation of Average
Average Yearly Precipitation
Monthly Precipitation

Yearly precipitation has less variability than monthly

precipitation in Australia.
Sampling
🞂 Sampling is used for selecting a subset of the data objects
to be analyzed.
🞂 Sampling is the main technique employed for data
reduction.
◦ It is often used for both the preliminary investigation of the data and
the final data analysis.

🞂 Statisticians often sample because obtaining the entire set

of data of interest is too expensive or time consuming.

🞂 Sampling is typically used in data mining because

processing the entire set of data of interest is too expensive
or time consuming.
Sampling …
🞂The key principle for effective sampling is the following:

◦ Using a sample will work almost as well as using the entire

data set, if the sample is representative

◦ A sample is representative if it has approximately the same

properties (of interest) as the original set of data
Sampling Approaches-Types of Sampling
🞂Simple Random Sampling
◦ There is an equal probability of selecting any particular item
◦ Sampling without replacement
● As each item is selected, it is removed from the
population
◦ Sampling with replacement
● Objects are not removed from the population as they are
selected for the sample.
● In sampling with replacement, the same object can be
picked up more than once
🞂Stratified sampling
◦ Split the data into several groups; then draw random samples
from each group
🞂Progressive sampling
◦ When proper sample size is difficult to determine
◦ Start with a small sample and increase until a sample of
sufficient size obtained.
Sample Size

8000 points 2000 Points 500 Points

🞂Example of the loss of structure with sampling

Sample Size
🞂 What sample size is necessary to get at least one
object from each of 10 equal-sized groups.

● ●

Ten groups of
points

Finding representative points from 10 groups

Dimensionality Reduction
🞂Purpose
◦ Avoid curse of dimensionality
◦ Reduce amount of time and memory required by data mining
algorithms
◦ Allow data to be more easily visualized
◦ May help to eliminate irrelevant features or reduce noise

🞂Techniques:Linear algebra techniques for

dimensionality reduction
◦ Principal Components Analysis (PCA)-from the set of
attributes find new attributes(principal component)
◦ Singular Value Decomposition(SVD)
Curse of Dimensionality

🞂 When dimensionality increases, data becomes increasingly sparse

in the space that it occupies.

🞂 Some data objects are not a representative of sample of all possible
objects.
🞂 Means, not enough data objects.
Feature Subset Selection
🞂Another way to reduce dimensionality of data
🞂Select useful attributes to create accurate model.
🞂Use only subset of features. No lose of information if:

🞂Redundant features
◦ Duplicate much or all of the information contained in one or more
other attributes
◦ Example: purchase price of a product and the amount of sales tax
paid contain much of the same information.

🞂Irrelevant features
◦ Contain no information that is useful for the data mining task at
hand
◦ Example: students' ID is often irrelevant to the task of predicting
students' GPA
An Architecture for Feature Subset
Selection
Feature Subset Selection

🞂Three standard approaches to feature selection:

🞂Embedded approaches:Algorithm decides which

attribute to use and ignore.

🞂Filter approaches: Features selected before data mining

algorithms [Link] based on the low/high
correlation of attributes

🞂Wrapper approaches:Target data mining algorithm as a

black box to find best subset of attributes.
Feature Weighting
🞂Alternative to keep or eliminate features.

🞂High-More important feature

🞂Low-Less important feature

🞂Based on the domain knowledge

Feature Creation
🞂 Possible to create new set of attributes from the
original attributes and captures the information
effectively.

🞂 Number of new attributes are smaller than the number

of original attributes.

🞂 2 methodologies to create new attributes are:

◦ Feature Extraction
◦ Mapping the data to a new space
Feature Creation
🞂Feature Extraction: new set of features from original
raw data

🞂Example: data set consist of info on artifacts such

as :materials(wood,clay,bronze,gold), mass,volume
(density=mass/volume) so classification can be done
on density to identify material.

🞂Simple feature extraction by mathematical

combination/ domain expertise
Feature Creation
🞂Mapping the Data to a New Space: Different view on
data to reveal important and interesting information.

🞂Fourier transform- noise patterns can be detected using

this on time series data to change representations [Link]
frequency
Discretization and Binarization

🞂Data mining algorithms such as classification needs data

in the form of categorical [Link] pattern
needs data in binary attributes.

🞂Transform a continuous attribute into a categorical

attribute→ Discretization

🞂 Transform continuous and discrete attributes into binary

attribute→ Binarization
Binarization

🞂 Assigning Numerical Value :If there are m categorical values,

then assign each original value to an integer interval [0,m-1]
🞂 Finding the number of binary attribute required:
n = ⌈ log2(m) ⌉
🞂 Conversion into binary : using n- bits

🞂 Such transformations leads unintended relationships among

the attributes (x2 and x3 are correlated-relationship exist
between 2 attributes)

🞂Overcome issue: no. of binary attributes= no. of values →

Asymmetric binary attributes
Binarization
Discretization of Continuous Attributes

🞂 Discretization is applied to attributes that are used in

classification or association analysis.

🞂 Transformation of continuous attributes to categorical

attributes involves:
◦ Decide how many categories(n):After values of continuous
attributes sorted, divide into n intervals by specifying n-1 split
points.
◦ Determine mapping of continuous to categorical attributes: results
in set of intervals {(x0,x1),(x1,x2)........ (xn-1,xn)}

🞂 There is unsupervised discretization(Without using class

labels) and supervised discretization (Using class labels).
Discretization: Without Using Class Labels
● For classification, not uses class information.
● Example: Discretization of people into low income,
middle income and high income is based on economic
factors.

Equal interval width approach used to obtain 4 values.

Discretization Without Using Class Labels

Equal frequency approach used to obtain 4 values.

Discretization Without Using Class Labels

K-means approach to obtain 4 values.

Measures of Similarity and Dissimilarity
🞂 Used in clustering,some classification,anomaly detection

🞂 Similarity measure
◦ Numerical measure of how alike two data objects are.
◦ Is higher for pairs of objects are more alike.
◦ Often falls in the range [0,1]
◦ 0→ No similarity, 1→ Complete Similar

🞂 Dissimilarity (distance) measure

◦ Numerical measure of how different two data objects are
◦ Lower for pairs of similar objects.
◦ Minimum dissimilarity is often 0
◦ Upper limit varies

🞂 Proximity refers to a similarity or dissimilarity

Measures of Similarity and Dissimilarity
🞂 Transformations

◦ Applied to convert a similarity to a dissimilarity or vice versa.

◦ Transform a proximity measure to fall a range [0,1]
◦ Example: objects range from 1(not similar) to 10 (similar),we
can transform to [0,1] using s`=(s-1)/9
where s→ original value s` → new similarity value

◦ In general,transformation of similarity measures to the interval

[0,1]
s`=(s -min_s)/ (max_s - min_s)
◦ Transformation of dissimilarity measures with finite range can be
mapped to the interval [0,1]
d`=(d -min_d)/ (max_d - min_d)
Measures of Similarity and Dissimilarity
🞂Transformations

◦ Example: Consider transformation d`=d/(1+d) for dissimilarity

measure ranges from 0 to ∞. Numbers are: 0,0.5,2,10,100,[Link]
dissimilarities are 0,0.33,0.67,0.90,0.99,0.999 respectively.

◦ Transforming similarities to dissimilarities and viceversa are

straightforward and issue of preserving meaning!!!

◦ One approach:If the similarity (dissimilarity) falls interval [0,1], then

dissimilarity (similarity)can be defined as:d=1-s or (s=1-d)
◦ Another approach by considering negative: dissimilarities 0,1,10,100
can be transformed into the similarities 0,-1,-10,-100 respectively
Measures of Similarity and Dissimilarity
🞂Transformations
◦ Along with the negative transformation other transformations
as
◦ s=1/d+1
◦ s=e^-d
◦ s=1- ( d-min_d/(max_d - min_d))

◦ Example: for dissimilarities 0,1,10,100 transformations by

using s=1/d+1 can be made as: 1,0.5,0.09,0.01
◦ By using s=e^-d can be made as: 1.00,0.37,0.00,0.00
◦ By using s=1- (d-min_d/(max_d - min_d)) can be made as:
1.00,0.99,0.90,0.00
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single, simple
attribute.
Similarity and Dissimilarity between Simple
Attribute
◦ Example: Consider the objects with single ordinal attribute
◦ A dairy milk chocolate has attribute quality on the scale
{poor,fair,OK,good, wonderful}.

Here 3 objects say P1,P2,P3. Where P1→ good, P2→OK, P3→

fair

◦ To make this observation as quantitative,values of ordinal

attributes mapped to integers as:
{poor=0,fair=1,OK=2,good=3, wonderful=4}
◦ d(P1,P2)= 3-2=1
or
◦ d(P1,P2)= (3-2)/(5-1)=0.25, if we want dissimilarity to fall
between 0 and 1
Dissimilarities between Data Objects
- Distance

🞂 Euclidean Distance

where n is the number of dimensions (attributes) and xk and

yk are, respectively, the kth attributes (components) or data
objects x and y.
Minkowski Distance

🞂 Minkowski Distance is a generalization of

Euclidean Distance

Where r is a parameter, n is the number of

dimensions (attributes) and xk and yk are,
respectively, the kth attributes (components) or data
objects x and y.
Euclidean Distance
Minkowski Distance: Examples

🞂 r = 1. City block (Manhattan, taxicab, L1 norm) distance.

● ex: hamming distance-difference between 2 binary vectors

🞂 r = 2. Euclidean distance

🞂 r → ∞. “supremum” (Lmax norm, L∞ norm) distance.

Common Properties of a Distance

🞂 Distances, such as the Euclidean distance, have some

well known properties.
1. d(x, y) ≥ 0 for all x and y
d(x, y) = 0 only if x = y. (Positivity)

2. d(x, y) = d(y, x) for all x and y. (Symmetry)

3. d(x, z) ≤ d(x, y) + d(y, z) for all points x, y, and z.

(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points
(data objects), x and y.

🞂 A distance that satisfies these properties is a metrics

Examples
🞂Non-metric Dissimilarities: Set Differences
◦ d(A,B): size(A- B) + size(B - A)

🞂A={1,2,3,4} and B={2,3,4}

Then, A-B= {1}
B-A ={Ø}

🞂We can define distance d between two sets A and B as:

d(A,B)= size(A-B)+size (B-A)

🞂where, size → returns the number of elements in the set.

Similarities between Data Objects
🞂 Similarities, also have some well known properties.

1. s(x, y) = 1 (or maximum similarity) only if x = y. (0 < s

<1)
2. s(x, y) = s(y, x) for all x and y. (Symmetry)

where s(x, y) is the similarity between points (data objects),

x and y.
Examples of Proximity measures
Similarity Measures for Binary Data

🞂 Similarity measures between objects that contain only binary

attributes are called similarity coefficients, and typically have
values between 0(not similar) and 1(complete similar).
Let x and y be two objects that consist of n binary attributes.

Simple Matching Coefficient

Example: Find students who had answered questions similarly on a test has True/False questions
Jaccard Coefficient
🞂 Used to handle objects consisting of asymmetric binary attributes.

🞂 Suppose that x and y are data objects that represent two rows (two
transactions) of a transaction matrix.

🞂 If each asymmetric binary attribute corresponds to an item in a

store, then a 1 indicates that the item was purchased, while a 0
indicates that the product was not purchased
SMC versus Jaccard: Example
Cosine Similarity

🞂Documents are often represented as vectors.

🞂Where each component (attribute) represents the the
frequency with which particular term (word) occurs in the
document.
🞂Similarity between two document depends on the words appear
in both documents.

COS 0=1,Documents are similar

COS 90=0, Dissimilar documents
Cosine Similarity
🞂

Cosine similarity between two document vectors:

● Dividing by x and y by their length vector means normalizes to have length 1. Means, does not take
length of the 2 data objects into account when computes similarity.
● When length is important then computation using Euclidean Distance is better choice.
Extended Jaccard Coefficient (Tanimoto
Coefficient )

🞂Used for Document data.

🞂It reduces to the Jaccard coefficient in case of binary
attributes
Correlation

● The linear relationship between the attributes of the objects.

● Correlation can measure relationship

○ between 2 variables - Ex: height and weight
○ between 2 objects- Ex: temperature time series data objects

● Used to measure similarity between attributes since the values in

2 data objects come from different attributes, which can have
different attribute types and scales.
Pearson’s Correlation - between two sets of numerical values
Visually Evaluating Correlation

30 pairs of values
randomly generated
with normal distribution,
Scatter plots showing
the similarity from –1
to 1.
● If x and y are transformed
as x’ and y’

● But cos(x,y) != cos(x’,y’)

● cos(45) != cos(10)

● Correlation between 2
vectors are equal when
cos(x,y)=0 and
cos(x’,y’)=0
Drawback of Correlation
🞂x = (-3, -2, -1, 0, 1, 2, 3)
🞂y = (9, 4, 1, 0, 1, 4, 9)

cov(x,y)= (-3-0) (9-4) + (-2-0)(4-4)+ (-1-0)(1-4) +(0-0)(0-

4) +(1-0)(1-4)+(2-0)(4-4)+(3-0)(9-4)

🞂mean(x) = 0, mean(y) = 4
🞂std(x) = 2.16, std(y) = 3.74

🞂 corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( (7-1) * 2.16 * 3.74 )

Nonlinear relationships- If corr=0, then no linear relationship between the 2 set of values.
Though yi = x i2 correlation is 0
Bregman Divergence
● Used as loss or distortion functions.
● Assume: x and y → 2 points, y is original point and x is distortion
or approximation of original point.
● Resulting distortion or loss that results if y is approximated to y
(smaller the loss) or not.
● Also used as dissimilarity functions.
● Mathematically from vector calculus, Formal definition can be:
convex function and uses Taylor’s
expansion
Issues in Proximity Calculation
● Issues related to proximity measures are:

○ How to handle when attributes have different scales (range

of values)

○ How to calculate proximity when different attribute types

i.e quantitative, qualitative

○ How to handle proximity calculations when attributes

have different weights(not all attributes contribute equally
to the proximity of objects)
Standardization and correlation for distance measure:
-Mahalanobis distance,

🞂A generalization of Euclidean distance

🞂Useful when attributes are correlated, have different
ranges of values (different variances), and the
distribution of the data is approximately Gaussian
🞂Distance between 2 objects(vectors) x and y is:

∑ is the covariance matrix of data

Combining similarities for heterogeneous attributes:

🞂 When different attribute types, algorithm used as below:

🞂 Similarities computed using the table of similarity and
dissimilarity for nominal ordinal, interval and ratio attributes
as previously [Link] transform to the range 0 and
1→ not holds good when asymmetric attributes .
Using Weights:
🞂All attributes were treated equally when computing proximity.

🞂But Some attributes are more important than others.

🞂Based on the weights, formulas of the previous computation can be

changed based on the contribution of each attribute.
Selecting the Right proximity Measure
🞂The following are a few general observations that may be helpful:

🞂First, the type of proximity measure should fit the type of data.
For many types of dense, continuous data, metric distance
measures such as Euclidean distance are often used.

🞂Proximity between continuous attributes is most often expressed

in terms of differences, and distance measures provide a well-
defined way of combining these differences into an overall
proximity measure.

🞂For sparse data, which often consists of asymmetric attributes, we

employ similarity measures that ignore 0–0 matches.
Selecting the Right proximity Measure

🞂objects have only a few of the characteristics described by the

attributes, and thus, are highly similar in terms of the
characteristics they do not have. The cosine, Jaccard, and
extended Jaccard measures are appropriate for such data.

🞂There are other characteristics of data vectors that may need to

be considered. Suppose, for example, that we are interested in
comparing time series.

🞂If the magnitude of the time series is important (for example,

each time series represent total sales of the same organization
for a different year), then we could use Euclidean distance.
Selecting the Right proximity Measure

🞂 If the time series represent different quantities (for

example, blood pressure and oxygen consumption), then
we usually want to determine if the time series have the
same shape, not the same magnitude.

🞂 Correlation, which uses a built-in normalization that

accounts for differences in magnitude and level, would be
more appropriate.
Selecting the Right proximity Measure

🞂 In some cases, transformation or normalization of the data

is important for obtaining a proper similarity measure
since such transformations are not always present in
proximity measures.

🞂 For instance, time series may have trends or periodic

patterns that significantly impact similarity. Also, a proper
computation of similarity may require that time lags be
taken into account.

Chapter 3 Topic - 4
No ratings yet
Chapter 3 Topic - 4
29 pages
DM and DW Notes-Module2
No ratings yet
DM and DW Notes-Module2
18 pages
Data Warehousing Implementation
No ratings yet
Data Warehousing Implementation
18 pages
Implementation: Data Warehouse
No ratings yet
Implementation: Data Warehouse
56 pages
DM Module 2
No ratings yet
DM Module 2
47 pages
Module 2
No ratings yet
Module 2
19 pages
Data Warehousing & OLAP Insights
No ratings yet
Data Warehousing & OLAP Insights
53 pages
Cube Computation and Indexes For Data Warehouses: CPS 196.03 Notes 7
No ratings yet
Cube Computation and Indexes For Data Warehouses: CPS 196.03 Notes 7
28 pages
1.6 Efficient Data Cube Computation & Indexing OLAP
No ratings yet
1.6 Efficient Data Cube Computation & Indexing OLAP
25 pages
Data Mining and Warehousing Overview
No ratings yet
Data Mining and Warehousing Overview
132 pages
Data Cube Computation and Cuboids
No ratings yet
Data Cube Computation and Cuboids
76 pages
Data Warehousing Explained
No ratings yet
Data Warehousing Explained
21 pages
Data Cube Insights for Analysts
No ratings yet
Data Cube Insights for Analysts
14 pages
1.7 Efficient Processing of OLAP Queries & OLAP Servers
No ratings yet
1.7 Efficient Processing of OLAP Queries & OLAP Servers
14 pages
Understanding OLAP and Data Cubes
No ratings yet
Understanding OLAP and Data Cubes
42 pages
DMDW Co1 Session 7
No ratings yet
DMDW Co1 Session 7
46 pages
What Is OLAP - On - Line Analytical Processing
No ratings yet
What Is OLAP - On - Line Analytical Processing
34 pages
Note2 3
No ratings yet
Note2 3
36 pages
Data Warehousing and OLAP Explained
No ratings yet
Data Warehousing and OLAP Explained
57 pages
Data Warehouse for Information Processing
No ratings yet
Data Warehouse for Information Processing
89 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
46 pages
Reference Short Notes For Mid Term Papers: CS614 - Date Warehousing
No ratings yet
Reference Short Notes For Mid Term Papers: CS614 - Date Warehousing
18 pages
Data Mining and Warehosuing Lecture 02
No ratings yet
Data Mining and Warehosuing Lecture 02
22 pages
DM 6
No ratings yet
DM 6
29 pages
CSE Data Mining & Warehousing Notes
No ratings yet
CSE Data Mining & Warehousing Notes
30 pages
P7 CubeTech
No ratings yet
P7 CubeTech
34 pages
BMW M-2
No ratings yet
BMW M-2
41 pages
Chapter 2.introduction To Data Warehouse
No ratings yet
Chapter 2.introduction To Data Warehouse
49 pages
DWDM Unit 2 Part 2 by Jithender Tulasi
No ratings yet
DWDM Unit 2 Part 2 by Jithender Tulasi
63 pages
Unit 2
No ratings yet
Unit 2
26 pages
Concepts and Techniques: - Chapter 5
No ratings yet
Concepts and Techniques: - Chapter 5
95 pages
Data Warehousing for ISE Students
No ratings yet
Data Warehousing for ISE Students
41 pages
Unit-4 Finalized
No ratings yet
Unit-4 Finalized
7 pages
Data Reduction Techniques Guide
No ratings yet
Data Reduction Techniques Guide
39 pages
Duck Data Umpire by Cubical Kits: Sarathchand P.V. B.E (Cse), M.Tech (CS), (PHD) Professor and Research Scholar
No ratings yet
Duck Data Umpire by Cubical Kits: Sarathchand P.V. B.E (Cse), M.Tech (CS), (PHD) Professor and Research Scholar
4 pages
Chapter 4
No ratings yet
Chapter 4
7 pages
DWDM Notes
No ratings yet
DWDM Notes
19 pages
On-Line Analytical Processing: Analyzing Data Resources
No ratings yet
On-Line Analytical Processing: Analyzing Data Resources
60 pages
Unit - 2: Online Analytical Processing (OLAP)
No ratings yet
Unit - 2: Online Analytical Processing (OLAP)
32 pages
OLAP Implementation Techniques: High Performance Data Warehouse Design and Construction
No ratings yet
OLAP Implementation Techniques: High Performance Data Warehouse Design and Construction
34 pages
DMDW 1 2nd Module
No ratings yet
DMDW 1 2nd Module
29 pages
Data Cube
No ratings yet
Data Cube
55 pages
UNIT2DM
No ratings yet
UNIT2DM
63 pages
3-Data Warehouse Modeling - Data Cube and OLAP-18!12!2024
No ratings yet
3-Data Warehouse Modeling - Data Cube and OLAP-18!12!2024
25 pages
Data Warehousing - C02 - OLAP
No ratings yet
Data Warehousing - C02 - OLAP
46 pages
Data Warehousing & Mining Guide
No ratings yet
Data Warehousing & Mining Guide
29 pages
Data Warehousing and OLAP Overview
No ratings yet
Data Warehousing and OLAP Overview
38 pages
Data Warehousing and OLAP Technology For Data Mining
No ratings yet
Data Warehousing and OLAP Technology For Data Mining
30 pages
Data Cube Computation and Data Generalization: Lesson Introduction
No ratings yet
Data Cube Computation and Data Generalization: Lesson Introduction
11 pages
Data Warehouses and Data Cubes
No ratings yet
Data Warehouses and Data Cubes
21 pages
Lec 04
No ratings yet
Lec 04
15 pages
Batch B DWM Experiments
No ratings yet
Batch B DWM Experiments
90 pages
Data Warehousing & OLAP Overview
No ratings yet
Data Warehousing & OLAP Overview
31 pages
Understanding OLAP and Data Warehousing
No ratings yet
Understanding OLAP and Data Warehousing
43 pages
CNS Gate Mod1
No ratings yet
CNS Gate Mod1
2 pages
CNS - 2 - Q
No ratings yet
CNS - 2 - Q
1 page
CNS Questions
No ratings yet
CNS Questions
1 page
ARM7TDMI Embedded Systems Lab
No ratings yet
ARM7TDMI Embedded Systems Lab
17 pages
Microcontroller and Embedded Systems Laboratory: Subject
No ratings yet
Microcontroller and Embedded Systems Laboratory: Subject
6 pages
Blockchain Technology and Implementation: A Systematic Literature Review
No ratings yet
Blockchain Technology and Implementation: A Systematic Literature Review
5 pages
Kvn238@nyu - Edu: Class
No ratings yet
Kvn238@nyu - Edu: Class
1 page
.Facial Emotion Recognition Using Convolutional Neural Network
No ratings yet
.Facial Emotion Recognition Using Convolutional Neural Network
4 pages
Information Security and Privacy: 22nd Australasian Conference, ACISP 2017, Auckland, New Zealand, July 3-5, 2017, Proceedings, Part I 1st Edition Josef Pieprzyk Available Full Chapters
No ratings yet
Information Security and Privacy: 22nd Australasian Conference, ACISP 2017, Auckland, New Zealand, July 3-5, 2017, Proceedings, Part I 1st Edition Josef Pieprzyk Available Full Chapters
90 pages
Geoffrey Hinton
No ratings yet
Geoffrey Hinton
15 pages
Data Science Case Report
No ratings yet
Data Science Case Report
20 pages
Data Knowledge
No ratings yet
Data Knowledge
44 pages
Offensive Meme Detection Framework
No ratings yet
Offensive Meme Detection Framework
15 pages
University Library System Design
No ratings yet
University Library System Design
9 pages
2.5 Pre - and Post-Coordinate Indexing
No ratings yet
2.5 Pre - and Post-Coordinate Indexing
48 pages
Transformer Models in Biomedicine: Review Open Access
No ratings yet
Transformer Models in Biomedicine: Review Open Access
22 pages
CONVR2022 Paper 743
No ratings yet
CONVR2022 Paper 743
13 pages
Test Bank For Business Intelligence Analytics and Data Science 4th Edition
No ratings yet
Test Bank For Business Intelligence Analytics and Data Science 4th Edition
12 pages
QCNN Paper
No ratings yet
QCNN Paper
3 pages
Viewing AnswerScript - IT
No ratings yet
Viewing AnswerScript - IT
3 pages
Final Exam 2021 22
No ratings yet
Final Exam 2021 22
3 pages
1 Data Warehouse Construction Real Life Problem To Be Defined For Data
No ratings yet
1 Data Warehouse Construction Real Life Problem To Be Defined For Data
3 pages
Mining Frequent Patterns
No ratings yet
Mining Frequent Patterns
41 pages
Systems Analysis and Design
No ratings yet
Systems Analysis and Design
3 pages
Dl-01 A2 Knowledge Society Shirin Mulani
No ratings yet
Dl-01 A2 Knowledge Society Shirin Mulani
10 pages
A Brief Study On Applications of Random Matrix Theory
No ratings yet
A Brief Study On Applications of Random Matrix Theory
844 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
5 pages
Spark Optimization Techniques 1676610430
No ratings yet
Spark Optimization Techniques 1676610430
15 pages
Crypto Predictions with AI & Blockchain
No ratings yet
Crypto Predictions with AI & Blockchain
7 pages
Fourth Normal Form
No ratings yet
Fourth Normal Form
6 pages
Software Vulnerability Detection Tool Using Machine Learning
No ratings yet
Software Vulnerability Detection Tool Using Machine Learning
10 pages
SQL Database for Student Management
No ratings yet
SQL Database for Student Management
7 pages
Data Analysis Tools Unit-7
No ratings yet
Data Analysis Tools Unit-7
24 pages
A1 Agreement Copy and Authorization
No ratings yet
A1 Agreement Copy and Authorization
19 pages
100 ETL Questions
No ratings yet
100 ETL Questions
5 pages