0% found this document useful (0 votes)

2 views18 pages

Unit-5 DM

The document discusses the relational data model, pattern discovery techniques in relational databases, and various data types including transactional, multi-dimensional, distributed, spatial, and data streams. It outlines methods for pattern discovery such as OLAP, frequent itemset mining, and the challenges faced in processing large datasets. The conclusion emphasizes the importance of these models and techniques in efficiently extracting meaningful insights from diverse data sources.

Uploaded by

aithavaishnavi1517

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views18 pages

Unit-5 DM

Uploaded by

aithavaishnavi1517

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Relational Data Model and Pattern Discovery in Relational Databases

Relational Data Model

1. Definition: Relational data is organized as tables consisting of rows (tuples) and columns
(attributes). It is based on set theory and predicate logic.

2. RDBMS: Relational Database Management System implements the relational model for
storing, retrieving, and manipulating data.

3. Schema and Keys:

Each table has attributes with fixed domains (schema).

Tables must have a key to uniquely identify rows; the primary key is a designated key.

4. Example:

A student table includes columns like roll_no, name, gender, state, and marks.

Schema: students(roll_no: integer, name: string, gender: boolean, maths: integer, ...)

Primary Key: roll_no.

Pattern Discovery in Relational Databases

1. Approaches:

(a) Using SQL:

SQL allows data retrieval, sorting, grouping, and aggregation.

Manually discovering patterns is possible but tedious and error-prone.

(b) OLAP and Data Mining in RDBMS:

Use language extensions (e.g., Python, C++) to embed SQL.

SQL cursor interfaces and stored procedures for query execution.

User-defined functions for specific needs.

Requires expertise in algorithm implementation.

(c) Data Transfer to Other Software:

Relational data is transferred to external data mining/warehousing tools.

Fast but involves overhead in data transfer, especially for large datasets.

(d) Tightly-Coupled Vendor Implementations:

Directly execute OLAP and data mining queries within RDBMS.

Efficient due to direct access to relational data.

Requires identification of roles (e.g., dimensions, hierarchies, measures).

Includes preprocessing techniques (e.g., normalization, binning).

Most promising method due to no transfer overhead and high efficiency.

Steps for OLAP and Data Mining in RDBMS:

1. Identify data warehouse dimensions and define hierarchies.

2. Create measure tables and cube-objects.

3. Execute queries using created structures.

4. Preprocess data for mining (e.g., normalization, binary conversion).

Conclusion:

The relational model is simple, efficient, and widely used. Among the four approaches to
pattern discovery, tightly-coupled vendor implementations are the most efficient, combining
high performance with minimal overhead.
Transactional Data Overview

Definition:
Transactional data consists of records where each record contains a set of items. Examples
include supermarket purchases or search engine queries. This is also called market-basket
data as it models item sets purchased by customers.

Key Concepts in Transactional Data

1. Data Model:

○ Comprises records with unique transaction IDs (TIDs) and associated items.

○ Example dataset:
| TID | Items Sold |
|-----|-----------------------------|
| 1 | tomato, potato, onion |
| 2 | tomato, potato, brinjal, pumpkin |
| 3 | tomato, potato, onion, chilly |
| 4 | lemon, tamarind, chilly |

○ Data needs normalization for relational databases.

Data Formats

1. Horizontal List Format:

○ Most common format. Items are listed sequentially under their TIDs.
○ Example:
| TID | Items |
|-----|-------------------------|
| 1 | tomato, potato, onion |
2. Horizontal Vector Format:

○ Represents each record as a bit vector.

○ Example:
| TID | tomato | potato | onion | brinjal | ... |
|-----|--------|--------|-------|---------|-----|
| 1 | 1 | 1 | 1 | 0 | ... |
3. Vertical List Format:
○ Stores TIDs for each item.
○ Example:
| Item | TIDs |
|----------|------------|
| tomato | 1, 2, 3 |
4. Vertical Vector Format:

○ Each item has a bit vector showing presence in transactions.

○ Example:
| Item | TID Vector |
|----------|------------|
| tomato | 1, 1, 1, 0 |

Pattern Discovery in Transactional Data

1. Frequent Itemsets:

○ An itemset is frequent if its support ≥ minsup (threshold).

○ Indicates correlation between items.
2. Association Rules:

○ Rule format: X → Y (e.g., "tomato → potato").

○ Confidence: Fraction of transactions with X that also have Y.
3. Implication Rules:

○ Based on conviction (better measure than confidence).

4. Quantitative & Categorical Rules:

○ Deals with richer attributes like age, income (quantitative) or postal codes
(categorical).
○ Converts these into binary attributes for mining.
5. Hierarchical Rules:

○ Utilizes item hierarchies (e.g., biological classification trees).

○ Rules may involve internal nodes (general items) or leaves (specific items).
6. Cyclic/Periodic Rules:

○ Detects patterns with regular time-based variations (e.g., seasonal shopping

trends).
7. Sequential Rules:

○ Focuses on sequences of items (e.g., customer visits, webpage navigation).

8. Classification & Clustering:

○ Uses frequent itemsets for classifying objects or summarizing data classes.

Applications of Transactional Data

● Market Basket Analysis: Identifies frequently purchased items together.

● Recommendation Systems: Suggests products based on user preferences.
● Business Insights: Helps understand customer behavior.
● Web Usage Mining: Optimizes navigation by analyzing web activity.

Comparison of Formats

● Horizontal List: Common input format for mining.

● Horizontal Vector: Quick presence checking.
● Vertical List/Vector: Efficient for counting and set operations using intersections or
bitwise AND.

By understanding these concepts, transactional data mining helps in decision-making and

discovering hidden patterns, making it an essential tool for businesses and research.

Multi-Dimensional Data Model

The multi-dimensional data model is primarily used in data warehouses, which are
repositories of integrated data collected from multiple volatile sources. Unlike these sources,
a data warehouse is non-volatile and time-variant, accumulating data over several years.

---

1. Data Model Structure

Organization:

Data is structured as measures (quantitative data, e.g., sales) and dimensions (qualitative
attributes, e.g., product, location, time).

Users define measures and dimensions to facilitate exploration.

Data is conceptually stored as a multi-dimensional array, where:

Each dimension corresponds to a warehouse dimension.

Each cell stores measure values.

Concept Hierarchies:

Defined over dimensions (e.g., product hierarchies like specific model → product type →
category).

Granularity determines the level of detail (higher granularity = more detail).

Storage:

Specialized storage structures or plain relational databases.

Fact table: Stores measure values for dimensions.

Other tables describe dimension attributes and hierarchies.

---

2. Pattern Discovery Using OLAP

Pattern discovery involves user-interaction through OLAP (Online Analytical Processing)

queries:

Selection of Dimensions: Users select a subset of dimensions to explore (typically 3–4 at a

time).

Roll-up and Drill-down:

Roll-up: View data at a less detailed level (e.g., category level).

Drill-down: View data at a more detailed level (e.g., specific product level).

Slice and Dice: Analyze specific dimension values (e.g., sales for 2003 and 2005 only).

---

3. Pattern Discovery Techniques

Multi-dimensional data supports all relational database pattern discovery techniques,

including:

Frequent pattern mining

Classification

Regression

Clustering

Preprocessing:

Conversion to binary attributes may be necessary for certain techniques like frequent itemset
mining.

The role of dimensions/measures must be defined (e.g., identifying class attributes for
classification).

---

Key Takeaways

Multi-dimensional data models enable subject-oriented data organization and facilitate

pattern discovery using OLAP operations.

They integrate seamlessly with relational databases, supporting various data mining
techniques for actionable insights.

Distributed Data Mining

With technological advancements, datasets have grown in size and complexity, often
distributed geographically. For example, sales records of chain stores may reside in multiple
locations. To mine such datasets, there are two approaches:

1. Data Warehousing Approach: Collect and integrate data into a single repository.

2. Distributed Data Mining Approach: Devise algorithms to mine patterns or run queries at
each site separately, then combine results.

---

1. Distributed Data Model

Structure:

Portions of data are stored at different sites.

A central server initiates mining or OLAP queries and collects results.

Challenges in Data Transfer:

Transferring large datasets over networks is inefficient.

Distributed algorithms divide tasks into smaller parts, executed locally at each site.

Local patterns (smaller in size) are transmitted to the central server for global pattern
computation.

Advantages:

Utilizes parallel processing of multiple machines.

Reduces response time, even when data originates from a central server.

---

2. Pattern Discovery in Distributed Data

Combining Local Patterns:

Combining local patterns into global patterns is complex.

Aggregates are used as patterns, divided into three types based on computation ease:

1. Distributive Aggregates: Easily computed from local results (e.g., maximum).

2. Algebraic Aggregates: Computed using multiple pieces of information from each site (e.g.,
average).

3. Holistic Aggregates: Require full local data for computation (e.g., median).
---

3. Example: Supermarket Sales Data

Aggregate Functions:

1. Maximum Items Purchased: Compute local maximum at each site; send to the central
server for the global maximum.

2. Average Items Purchased: Send total items and total customers from each site; central
server calculates the global average.

3. Median Items Purchased: Local medians cannot determine the global median; full local
data must be transferred.

---

4. Distributed Data Mining Tasks

Frequent Itemset Mining:

Data across sites treated as partitions.

Local frequent itemsets are computed and combined globally using the Partition Algorithm.

Other Tasks:

Techniques for classification, clustering, and regression adapt distributive or algebraic

aggregates, minimizing data transfer.

---

Conclusion
Distributed data mining efficiently handles large and distributed datasets by leveraging
parallel processing and reducing data transfer. While some aggregates (like holistic
functions) pose challenges, most tasks can be accomplished with distributive or algebraic
functions, ensuring the feasibility of mining distributed data.

Spatial Data and Pattern Discovery

Spatial data refers to information where the location of objects in space holds significance. It
includes data types such as geographic maps, medical images, engineering drawings, and
architectural designs. A significant portion of real-world data has a spatial component,
making it essential to handle spatial data in specialized ways for pattern discovery.

---

1. The Spatial Data Model

Spatial Databases:
A spatial database supports spatial data types and queries. It enables efficient access to
spatial objects through spatial indexes and allows the use of spatial predicates (e.g., near,
adjacent, inside) in selection and join queries.

Spatial Data Types:

Spatial databases support three primary data types:

1. Points: Represented by coordinates (x, y, z) or latitude and longitude for geographical

locations.

2. Lines: Polylines representing connected line segments (e.g., roads, bridges).

3. Regions: Spatial objects with an area that may have holes or consist of multiple disjoint
parts (e.g., lakes, countries).

Spatial Predicates:
Predicates allow the comparison of spatial objects and are classified into:

1. Topological Predicates: Invariant under transformations (e.g., adjacent, inside, disjoint).

2. Direction Predicates: Based on direction (e.g., above, below, left of).

3. Metric Predicates: Based on distance (e.g., distance ≤10 km).

---

2. Spatial Queries

Selection Queries:
Used to select spatial objects based on spatial predicates. Examples include:

Listing all cities in a region.

Identifying blood vessels passing through specific organs.

Join Queries:
Involves linking spatial objects from different tables based on spatial relationships. Examples
include:

Mapping cities to their states.

Finding ATMs near petrol stations.

Spatial Index:
A spatial index is used to enable efficient retrieval of spatial objects within a specified
bounding region. This reduces the need to retrieve all objects and improves query
performance.

---

3. Pattern Discovery in Spatial Data

Preprocessing Spatial Data:

Spatial data tables are augmented with additional attributes, such as the length or area of
spatial objects, or the predicates they satisfy (e.g., being near an ATM). This preprocessing
can be performed logically during query execution rather than physically stored on disk.

The "First Law of Geography":

This law suggests that objects located near each other are more likely to have similar
relationships. Thus, predicates are generally considered for objects close to the object of
interest.
---

4. Types of Patterns in Spatial Data

Location Prediction:
Predicting the location of objects (e.g., species habitats) based on known data. An example
could be discovering that red-winged blackbirds live in humid and cold wetlands.

Spatial Outliers:
Identifying objects that significantly differ from their neighboring objects. For instance, a
newly constructed building in an older area of a city could be considered a spatial outlier.

---

5. OLAP Queries for Spatial Data

Spatial OLAP (On-Line Analytical Processing):

OLAP queries are particularly useful for spatial data because of the inherent hierarchical
structure. For example, a colony belongs to a city, which is in a state, and ultimately in a
country. Spatial attributes can be used to roll-up and drill-down in these hierarchical
structures, enabling advanced spatial analysis.

---

Conclusion

Spatial data mining involves efficiently querying and analyzing large spatial datasets, often
through the use of spatial indexes and predicates. By augmenting spatial databases with
additional attributes, similar techniques to relational databases can be applied to discover
meaningful patterns, such as location predictions and spatial outliers. Moreover, OLAP
queries can reveal spatial hierarchies, enabling deeper insights into spatial data.

Data Streams and Pattern Discovery

A data stream is a continuous flow of data records, which can include web logs, stock prices,
network traffic, phone conversations, ATM transactions, and sensor data. The challenge in
working with data streams is that the data is too large and arrives too quickly to be stored
entirely, requiring efficient methods for processing and pattern discovery.

1. Data Stream Model

A data stream consists of a sequence of records, each having numeric or categorical
attributes or sets of items.

Synopses (or sketches) are used to store small summaries of the stream due to limited
storage. Pattern discovery is based on these synopses, not on the entire data.

Approximate Results: Exact pattern discovery is not always feasible, but approximate results
are acceptable in most scenarios, with some error margin (ɛ, δ) in the computed patterns.

Continuous Queries: These are queries that execute continuously as the data updates, such
as tracking running totals or frequent itemsets.

Recent Patterns: In many applications, recent data records are more important than older
ones, and algorithms must focus on finding emerging patterns over recent data, rather than
the entire stream.

2. Kinds of Synopses

Several techniques for creating synopses to represent data streams include:

Random Sampling: A small set of randomly chosen records from the stream serves as a
sample. Algorithms like reservoir sampling are used for this when the size of the stream is
unknown.

Histograms: A frequency distribution of attributes is stored in buckets, representing how

records fall within different attribute ranges.

Sliding Windows: Only the most recent w records are considered, where w is the window
size, useful when only recent records are relevant.

Wavelets: A transform applied to data that reduces the data’s size while retaining important
patterns. It can be used to summarize a stream efficiently.

Snapshots and Time Frames: To capture recent patterns, snapshots of the stream are stored
over different time frames. These time frames may follow natural, logarithmic, or pyramidal
designs, giving more weight to recent data.

3. Pattern Discovery in Data Streams

Pattern discovery in data streams is typically done using aggregates or approximate

summaries:

Pattern Discovery as Aggregates: Most pattern discovery tasks can be treated as computing
distributive or algebraic aggregates (e.g., maximum, average) over the stream.
Frequent Itemset Mining: Using techniques like the Partition algorithm, local frequent
itemsets from partitions of the stream are combined to estimate global patterns.

Classification and Clustering: Techniques like bagging and boosting can be applied in
classification by learning classifiers for each data chunk, and clustering can be performed
using efficient algorithms like Birch that work well with single-pass data.

4. Challenges and Algorithms

Handling holistic aggregates (e.g., median), which require the entire dataset, is difficult in
data streams. These aggregates require special algorithms to be computed effectively.

Data streams may require incremental algorithms that update pattern discovery results as
new data arrives.

5. Example: Stock Market Data Analysis

An analyst tracking stock market data may want to compute the maximum, average, and
median number of shares bought at a time. Here's how these aggregates can be computed:

Maximum (Distributive Aggregate): Store the current maximum share count. When new data
arrives, update if the new value is larger.

Average (Algebraic Aggregate): Maintain a running total of shares and the count of records.
Compute the average by dividing the total by the count.

Median (Holistic Aggregate): This is harder to compute in data streams, as maintaining the
current median requires storing all past records.

6. Conclusion

Discovering patterns in data streams involves efficiently summarizing the data into
manageable synopses, performing approximate computations, and leveraging incremental
algorithms. Though challenges exist, especially with holistic aggregates, modern algorithms
allow for effective pattern discovery in real-time data streams.

Time-Series Data and Pattern Discovery

Time-Series Data: A time-series is a sequence of measurements taken at different time

points, commonly used in various fields like stock prices, weather, and health parameters
(e.g., blood pressure). These measurements are typically recorded at regular intervals such
as seconds, hours, or days.

Data Model:
A time-series can be represented as pairs of <time, measurement>.

The data model can be generalized to include complex measurements (multiple

parameters), data streams (continuous flow), and sequence databases (ordered records
without explicit time).

Visualizing time-series data is often done graphically, with time on the x-axis and measured
values on the y-axis.

Pattern Discovery in Time-Series: There are three main tasks involved in pattern discovery
within time-series data:

1. Trend Analysis:

Long-term trends: Overall movement of the time-series (e.g., upward or downward trends),
often detected by averaging nearby measurements.

Cycles: Repetitive behaviors identified using statistical correlation, such as recurring patterns
at fixed intervals.

Seasonal Movements: Periodic fluctuations based on seasons or events (e.g., increased

sales during festivals).

Outliers: Measurements significantly different from the norm, indicating rare or significant
events.

2. Prediction:

Curve-fitting: Statistical techniques like regression used to predict future values, though this
can be inaccurate due to noise.

Auto-regression: A method where future values are predicted based on a linear combination
of previous measurements. This technique uses autoregressive parameters computed from
the data.

3. Similarity Search:

Important for applications like clustering or comparing time-series, such as stock prices over
time.

Sequences may have different lengths, missing values, or outliers, making similarity search
challenging.
Similarity is determined by how many transformations (e.g., adding, deleting, or scaling
values) are needed to make one sequence match another. Techniques like multidimensional
index structures (e.g., R-trees) help efficiently find similar sequences.

Conclusion: Time-series data is widespread in various fields, and discovering patterns

through trend analysis, prediction, and similarity search is crucial for understanding and
forecasting behaviors. The ability to automatically detect trends, predict future events, and
find similar sequences can provide valuable insights in many real-world applications.

Text and Web Data

Text and web data comprise unstructured or semi-structured information like books, articles,
emails, web pages, and XML documents. The key challenge in this domain is mining
patterns effectively.

1. Data Model:
Text documents are structured as sequences of characters, often organized into titles,
sections, and paragraphs. Web pages include links, providing additional context.
Techniques like hierarchical clustering can infer relationships between documents and
create pseudo-links.

2. Document Representation:
Bag-of-Words Model: Documents are represented as collections of words, optionally with
their frequencies.

Vector-Space Model: Words form dimensions of a high-dimensional space where documents

are vectors, enabling linear algebra techniques.

Preprocessing:
Stop-Words Removal: Eliminates frequent but uninformative words like "the" or and.
Stemming: Reduces words to their root forms (e.g., "learned" → "learn").
TF-IDF:.Balances term frequency in a document against its rarity across the collection.

3. Pattern Discovery Tasks:

1. Search:
- Basic keyword search retrieves matching documents.
Ranking:
Based on keyword relevance or web page link structures (e.g., PageRank algorithm).
Efficiency: Inverted indexes organize document data for quick retrieval.
2. Frequent Patterns:
Extract frequent itemsets or sequences from documents.
Useful for query characterization, user profiling, and refining search results.

3. Classification:
Categorizes documents into predefined classes using features like keywords, authors, or
metadata.
Popular methods: Naïve Bayes, Support Vector Machines, etc.
Applications: Spam detection, email routing, topic categorization.

4. Clustering:
Groups similar documents into clusters, often for hierarchical classification.
Measures like cosine similarity evaluate document relationships.

5. Summarization:
Automates extraction of key sentences for a concise summary.
Heuristics include sentence position and keyword relevance.

6. Theme Evolution and Hot-Topic Mining:

-Tracks thematic changes over time (e.g., news trends, research focus).
-Identifies current and emerging topics of interest.

7. Document Understanding:
- Uses computational linguistics (e.g., WordNet) for semantic enrichment.
- Enhances statistical features by adding synonyms, categories, and sentence meanings.

Conclusion:
Text and web mining techniques enable efficient handling of vast unstructured data. By
leveraging models like TF-IDF, clustering, and classification, and integrating linguistic
insights, these techniques address challenges in organization, retrieval, and understanding.

Multimedia Data Summary

Definition and Importance:

Multimedia refers to information combining various media types such as images, audio, and
video. It is increasingly generated due to advancements in digital technologies. Mining
multimedia data involves identifying patterns and insights to handle its complexity and
volume

5.9.1 Data Model

Structure: Multimedia documents may contain homogeneous (e.g., only images) or
heterogeneous (e.g., images, audio, and video) content. These can be tagged with metadata
like author, date, or topic.
Integration with Text:Multimedia data may exist within web documents, described by
contextual text.

Feature Extraction:Features are extracted and stored numerically or categorically for tasks
like classification, clustering, and frequent pattern mining.

Time-Series Representation:Audio and video are modeled as time-series data, enabling

mining of interesting subsequences.

5.9.2 Pattern Discovery

Pattern discovery in multimedia parallels text and web data mining but adapts for time-series
properties.

1. Multimedia Search:
- Users query with example documents.
- If textual metadata exists, search resembles text-based techniques.
- Without metadata, similarity relies on weighted Euclidean distances between extracted
features.

2. Classification:
- Multimedia is classified into predefined categories (e.g., movie genre, singer, actor).
- For video, features like audio volume, color histograms, and shot changes are analyzed
to differentiate ads, movies, or news.

3.Clustering:
- Multimedia clustering organizes data hierarchically, akin to text clustering.
- Frames in a video sequence can be clustered based on similarity to identify meaningful
shots.

Applications:
The discussed tasks are foundational for multimedia organization, retrieval, and analysis,
making them crucial areas of active research.

MCA 301 Data Mining Notes
No ratings yet
MCA 301 Data Mining Notes
6 pages
Data Mining Unit 1-1
No ratings yet
Data Mining Unit 1-1
11 pages
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
100% (1)
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
115 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
Data Mining and Datawarehousing CS-303
No ratings yet
Data Mining and Datawarehousing CS-303
34 pages
DWDM Notes
No ratings yet
DWDM Notes
59 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Mining, Data Warehousing and Knowledge Discovery
No ratings yet
Data Mining, Data Warehousing and Knowledge Discovery
15 pages
Unit 1 DM
No ratings yet
Unit 1 DM
62 pages
DM Unit-1
No ratings yet
DM Unit-1
14 pages
Unit Iii
No ratings yet
Unit Iii
10 pages
Data Mining
No ratings yet
Data Mining
48 pages
DMT Unit1
No ratings yet
DMT Unit1
46 pages
Data Ming Unit 2
No ratings yet
Data Ming Unit 2
8 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
Pptcs 1661
No ratings yet
Pptcs 1661
38 pages
Unit-1 PPT Dma
No ratings yet
Unit-1 PPT Dma
83 pages
Unit-1: 1. Define Data Mining and Explain Its Importance in Modern Data Analysis
No ratings yet
Unit-1: 1. Define Data Mining and Explain Its Importance in Modern Data Analysis
42 pages
BusinessIntelligence 2023
No ratings yet
BusinessIntelligence 2023
36 pages
Big Data
No ratings yet
Big Data
8 pages
UNIT-1 Why We Need Data Mining?
No ratings yet
UNIT-1 Why We Need Data Mining?
99 pages
DWDM External
No ratings yet
DWDM External
30 pages
Data Mining
No ratings yet
Data Mining
14 pages
Week1 2
No ratings yet
Week1 2
24 pages
Fundamentals of Data Mining
No ratings yet
Fundamentals of Data Mining
36 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
39 pages
Association Rule Mining
No ratings yet
Association Rule Mining
61 pages
Unit I Dbmi
No ratings yet
Unit I Dbmi
35 pages
Data Mining - Concepts and Techniques
No ratings yet
Data Mining - Concepts and Techniques
13 pages
Ai Pass
No ratings yet
Ai Pass
12 pages
Q: How Does Cluster Technique Work? Discuss K-Means Clustering Algorithm in Detail
No ratings yet
Q: How Does Cluster Technique Work? Discuss K-Means Clustering Algorithm in Detail
10 pages
Unit No 3
No ratings yet
Unit No 3
10 pages
DWDM
No ratings yet
DWDM
11 pages
Data Science & Big Data Analysis Module 1,2,3,4,5
No ratings yet
Data Science & Big Data Analysis Module 1,2,3,4,5
70 pages
Module 1
No ratings yet
Module 1
41 pages
DM Unit2 (Part1)
No ratings yet
DM Unit2 (Part1)
19 pages
Data Mining Display
No ratings yet
Data Mining Display
20 pages
Data Mining
No ratings yet
Data Mining
52 pages
Datamining Unit - 1
No ratings yet
Datamining Unit - 1
20 pages
Unit 4
No ratings yet
Unit 4
27 pages
Data Mining Unit I Notes
No ratings yet
Data Mining Unit I Notes
24 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
24 pages
Dmi Unit 1
No ratings yet
Dmi Unit 1
8 pages
Unit 3 PPT (BA)
No ratings yet
Unit 3 PPT (BA)
19 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
Why We Need Data Mining?
No ratings yet
Why We Need Data Mining?
39 pages
Bca DM Unit I
No ratings yet
Bca DM Unit I
20 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
Data Warehouse & Mining
No ratings yet
Data Warehouse & Mining
28 pages
Unit-1 DWDM
No ratings yet
Unit-1 DWDM
20 pages
Data Mining
No ratings yet
Data Mining
4 pages
Data Mining 1 2 and 3
No ratings yet
Data Mining 1 2 and 3
20 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Unit 1
No ratings yet
Unit 1
11 pages
18mca52c U1
No ratings yet
18mca52c U1
17 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Chap 1
No ratings yet
Chap 1
32 pages
ES Notes 2025 1
No ratings yet
ES Notes 2025 1
37 pages
CH 6
No ratings yet
CH 6
40 pages
Data Mining Questions
No ratings yet
Data Mining Questions
5 pages
Software Architectural Styles
No ratings yet
Software Architectural Styles
2 pages
Artificial Intelligence New 2 Months
No ratings yet
Artificial Intelligence New 2 Months
6 pages
Basic Concepts in Data Structures
From Everand
Basic Concepts in Data Structures
K.Meenendranath Reddy
No ratings yet
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
From Everand
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
Robert Johnson
No ratings yet

Unit-5 DM

Uploaded by

Unit-5 DM

Uploaded by

Relational Data Model and Pattern Discovery in Relational Databases

Relational Data Model

3. Schema and Keys:

Each table has attributes with fixed domains (schema).

Primary Key: roll_no.

Pattern Discovery in Relational Databases

(a) Using SQL:

SQL allows data retrieval, sorting, grouping, and aggregation.

Manually discovering patterns is possible but tedious and error-prone.

(b) OLAP and Data Mining in RDBMS:

Use language extensions (e.g., Python, C++) to embed SQL.

SQL cursor interfaces and stored procedures for query execution.

Requires expertise in algorithm implementation.

(c) Data Transfer to Other Software:

Relational data is transferred to external data mining/warehousing tools.

(d) Tightly-Coupled Vendor Implementations:

Directly execute OLAP and data mining queries within RDBMS.

Efficient due to direct access to relational data.

Requires identification of roles (e.g., dimensions, hierarchies, measures).

Includes preprocessing techniques (e.g., normalization, binning).

Most promising method due to no transfer overhead and high efficiency.

Steps for OLAP and Data Mining in RDBMS:

1. Identify data warehouse dimensions and define hierarchies.

2. Create measure tables and cube-objects.

3. Execute queries using created structures.

4. Preprocess data for mining (e.g., normalization, binary conversion).

Key Concepts in Transactional Data

1.​ Data Model:

○​ Data needs normalization for relational databases.​

1.​ Horizontal List Format:​

○​ Represents each record as a bit vector.

○​ Each item has a bit vector showing presence in transactions.

Pattern Discovery in Transactional Data

1.​ Frequent Itemsets:​

○​ An itemset is frequent if its support ≥ minsup (threshold).

○​ Rule format: X → Y (e.g., "tomato → potato").

○​ Based on conviction (better measure than confidence).

○​ Utilizes item hierarchies (e.g., biological classification trees).

○​ Detects patterns with regular time-based variations (e.g., seasonal shopping

○​ Focuses on sequences of items (e.g., customer visits, webpage navigation).

○​ Uses frequent itemsets for classifying objects or summarizing data classes.

●​ Market Basket Analysis: Identifies frequently purchased items together.

●​ Horizontal List: Common input format for mining.

By understanding these concepts, transactional data mining helps in decision-making and

Multi-Dimensional Data Model

1. Data Model Structure

Users define measures and dimensions to facilitate exploration.

Data is conceptually stored as a multi-dimensional array, where:

Each dimension corresponds to a warehouse dimension.

Each cell stores measure values.

Granularity determines the level of detail (higher granularity = more detail).

Specialized storage structures or plain relational databases.

Fact table: Stores measure values for dimensions.

Other tables describe dimension attributes and hierarchies.

2. Pattern Discovery Using OLAP

Pattern discovery involves user-interaction through OLAP (Online Analytical Processing)

Selection of Dimensions: Users select a subset of dimensions to explore (typically 3–4 at a

Roll-up and Drill-down:

Roll-up: View data at a less detailed level (e.g., category level).

3. Pattern Discovery Techniques

Multi-dimensional data supports all relational database pattern discovery techniques,

Frequent pattern mining

Multi-dimensional data models enable subject-oriented data organization and facilitate

Distributed Data Mining

1. Distributed Data Model

Portions of data are stored at different sites.

A central server initiates mining or OLAP queries and collects results.

Challenges in Data Transfer:

Transferring large datasets over networks is inefficient.

Utilizes parallel processing of multiple machines.

2. Pattern Discovery in Distributed Data

Combining Local Patterns:

Combining local patterns into global patterns is complex.

1. Distributive Aggregates: Easily computed from local results (e.g., maximum).

3. Example: Supermarket Sales Data

4. Distributed Data Mining Tasks

1. Data Model:

○ Data needs normalization for relational databases.

1. Horizontal List Format:

○ Represents each record as a bit vector.

○ Each item has a bit vector showing presence in transactions.

1. Frequent Itemsets:

○ An itemset is frequent if its support ≥ minsup (threshold).

○ Rule format: X → Y (e.g., "tomato → potato").

○ Based on conviction (better measure than confidence).

○ Utilizes item hierarchies (e.g., biological classification trees).

○ Detects patterns with regular time-based variations (e.g., seasonal shopping

○ Focuses on sequences of items (e.g., customer visits, webpage navigation).

○ Uses frequent itemsets for classifying objects or summarizing data classes.

● Market Basket Analysis: Identifies frequently purchased items together.

● Horizontal List: Common input format for mining.