0% found this document useful (0 votes)
2 views18 pages

Unit-5 DM

The document discusses the relational data model, pattern discovery techniques in relational databases, and various data types including transactional, multi-dimensional, distributed, spatial, and data streams. It outlines methods for pattern discovery such as OLAP, frequent itemset mining, and the challenges faced in processing large datasets. The conclusion emphasizes the importance of these models and techniques in efficiently extracting meaningful insights from diverse data sources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views18 pages

Unit-5 DM

The document discusses the relational data model, pattern discovery techniques in relational databases, and various data types including transactional, multi-dimensional, distributed, spatial, and data streams. It outlines methods for pattern discovery such as OLAP, frequent itemset mining, and the challenges faced in processing large datasets. The conclusion emphasizes the importance of these models and techniques in efficiently extracting meaningful insights from diverse data sources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Relational Data Model and Pattern Discovery in Relational Databases

Relational Data Model

1. Definition: Relational data is organized as tables consisting of rows (tuples) and columns
(attributes). It is based on set theory and predicate logic.

2. RDBMS: Relational Database Management System implements the relational model for
storing, retrieving, and manipulating data.

3. Schema and Keys:

Each table has attributes with fixed domains (schema).

Tables must have a key to uniquely identify rows; the primary key is a designated key.

4. Example:

A student table includes columns like roll_no, name, gender, state, and marks.

Schema: students(roll_no: integer, name: string, gender: boolean, maths: integer, ...)

Primary Key: roll_no.

Pattern Discovery in Relational Databases

1. Approaches:

(a) Using SQL:

SQL allows data retrieval, sorting, grouping, and aggregation.

Manually discovering patterns is possible but tedious and error-prone.

(b) OLAP and Data Mining in RDBMS:

Use language extensions (e.g., Python, C++) to embed SQL.

SQL cursor interfaces and stored procedures for query execution.


User-defined functions for specific needs.

Requires expertise in algorithm implementation.

(c) Data Transfer to Other Software:

Relational data is transferred to external data mining/warehousing tools.

Fast but involves overhead in data transfer, especially for large datasets.

(d) Tightly-Coupled Vendor Implementations:

Directly execute OLAP and data mining queries within RDBMS.

Efficient due to direct access to relational data.

Requires identification of roles (e.g., dimensions, hierarchies, measures).

Includes preprocessing techniques (e.g., normalization, binning).

Most promising method due to no transfer overhead and high efficiency.

Steps for OLAP and Data Mining in RDBMS:

1. Identify data warehouse dimensions and define hierarchies.

2. Create measure tables and cube-objects.

3. Execute queries using created structures.

4. Preprocess data for mining (e.g., normalization, binary conversion).

Conclusion:

The relational model is simple, efficient, and widely used. Among the four approaches to
pattern discovery, tightly-coupled vendor implementations are the most efficient, combining
high performance with minimal overhead.
Transactional Data Overview

Definition:​
Transactional data consists of records where each record contains a set of items. Examples
include supermarket purchases or search engine queries. This is also called market-basket
data as it models item sets purchased by customers.

Key Concepts in Transactional Data

1.​ Data Model:


○​ Comprises records with unique transaction IDs (TIDs) and associated items.​

○​ Example dataset:​
| TID | Items Sold |​
|-----|-----------------------------|​
| 1 | tomato, potato, onion |​
| 2 | tomato, potato, brinjal, pumpkin |​
| 3 | tomato, potato, onion, chilly |​
| 4 | lemon, tamarind, chilly |​

○​ Data needs normalization for relational databases.​

Data Formats

1.​ Horizontal List Format:​

○​ Most common format. Items are listed sequentially under their TIDs.
○​ Example:​
| TID | Items |​
|-----|-------------------------|​
| 1 | tomato, potato, onion |
2.​ Horizontal Vector Format:​

○​ Represents each record as a bit vector.


○​ Example:​
| TID | tomato | potato | onion | brinjal | ... |​
|-----|--------|--------|-------|---------|-----|​
| 1 | 1 | 1 | 1 | 0 | ... |
3.​ Vertical List Format:​
○​ Stores TIDs for each item.
○​ Example:​
| Item | TIDs |​
|----------|------------|​
| tomato | 1, 2, 3 |
4.​ Vertical Vector Format:​

○​ Each item has a bit vector showing presence in transactions.


○​ Example:​
| Item | TID Vector |​
|----------|------------|​
| tomato | 1, 1, 1, 0 |

Pattern Discovery in Transactional Data

1.​ Frequent Itemsets:​

○​ An itemset is frequent if its support ≥ minsup (threshold).


○​ Indicates correlation between items.
2.​ Association Rules:​

○​ Rule format: X → Y (e.g., "tomato → potato").


○​ Confidence: Fraction of transactions with X that also have Y.
3.​ Implication Rules:​

○​ Based on conviction (better measure than confidence).


4.​ Quantitative & Categorical Rules:​

○​ Deals with richer attributes like age, income (quantitative) or postal codes
(categorical).
○​ Converts these into binary attributes for mining.
5.​ Hierarchical Rules:​

○​ Utilizes item hierarchies (e.g., biological classification trees).


○​ Rules may involve internal nodes (general items) or leaves (specific items).
6.​ Cyclic/Periodic Rules:​

○​ Detects patterns with regular time-based variations (e.g., seasonal shopping


trends).
7.​ Sequential Rules:​

○​ Focuses on sequences of items (e.g., customer visits, webpage navigation).


8.​ Classification & Clustering:​

○​ Uses frequent itemsets for classifying objects or summarizing data classes.


Applications of Transactional Data

●​ Market Basket Analysis: Identifies frequently purchased items together.


●​ Recommendation Systems: Suggests products based on user preferences.
●​ Business Insights: Helps understand customer behavior.
●​ Web Usage Mining: Optimizes navigation by analyzing web activity.

Comparison of Formats

●​ Horizontal List: Common input format for mining.


●​ Horizontal Vector: Quick presence checking.
●​ Vertical List/Vector: Efficient for counting and set operations using intersections or
bitwise AND.

By understanding these concepts, transactional data mining helps in decision-making and


discovering hidden patterns, making it an essential tool for businesses and research.

Multi-Dimensional Data Model

The multi-dimensional data model is primarily used in data warehouses, which are
repositories of integrated data collected from multiple volatile sources. Unlike these sources,
a data warehouse is non-volatile and time-variant, accumulating data over several years.

---

1. Data Model Structure

Organization:

Data is structured as measures (quantitative data, e.g., sales) and dimensions (qualitative
attributes, e.g., product, location, time).

Users define measures and dimensions to facilitate exploration.

Data is conceptually stored as a multi-dimensional array, where:

Each dimension corresponds to a warehouse dimension.

Each cell stores measure values.


Concept Hierarchies:

Defined over dimensions (e.g., product hierarchies like specific model → product type →
category).

Granularity determines the level of detail (higher granularity = more detail).

Storage:

Specialized storage structures or plain relational databases.

Fact table: Stores measure values for dimensions.

Other tables describe dimension attributes and hierarchies.

---

2. Pattern Discovery Using OLAP

Pattern discovery involves user-interaction through OLAP (Online Analytical Processing)


queries:

Selection of Dimensions: Users select a subset of dimensions to explore (typically 3–4 at a


time).

Roll-up and Drill-down:

Roll-up: View data at a less detailed level (e.g., category level).

Drill-down: View data at a more detailed level (e.g., specific product level).

Slice and Dice: Analyze specific dimension values (e.g., sales for 2003 and 2005 only).

---

3. Pattern Discovery Techniques

Multi-dimensional data supports all relational database pattern discovery techniques,


including:

Frequent pattern mining


Classification

Regression

Clustering

Preprocessing:

Conversion to binary attributes may be necessary for certain techniques like frequent itemset
mining.

The role of dimensions/measures must be defined (e.g., identifying class attributes for
classification).

---

Key Takeaways

Multi-dimensional data models enable subject-oriented data organization and facilitate


pattern discovery using OLAP operations.

They integrate seamlessly with relational databases, supporting various data mining
techniques for actionable insights.

Distributed Data Mining

With technological advancements, datasets have grown in size and complexity, often
distributed geographically. For example, sales records of chain stores may reside in multiple
locations. To mine such datasets, there are two approaches:

1. Data Warehousing Approach: Collect and integrate data into a single repository.

2. Distributed Data Mining Approach: Devise algorithms to mine patterns or run queries at
each site separately, then combine results.

---

1. Distributed Data Model


Structure:

Portions of data are stored at different sites.

A central server initiates mining or OLAP queries and collects results.

Challenges in Data Transfer:

Transferring large datasets over networks is inefficient.

Distributed algorithms divide tasks into smaller parts, executed locally at each site.

Local patterns (smaller in size) are transmitted to the central server for global pattern
computation.

Advantages:

Utilizes parallel processing of multiple machines.

Reduces response time, even when data originates from a central server.

---

2. Pattern Discovery in Distributed Data

Combining Local Patterns:

Combining local patterns into global patterns is complex.

Aggregates are used as patterns, divided into three types based on computation ease:

1. Distributive Aggregates: Easily computed from local results (e.g., maximum).

2. Algebraic Aggregates: Computed using multiple pieces of information from each site (e.g.,
average).

3. Holistic Aggregates: Require full local data for computation (e.g., median).
---

3. Example: Supermarket Sales Data

Aggregate Functions:

1. Maximum Items Purchased: Compute local maximum at each site; send to the central
server for the global maximum.

2. Average Items Purchased: Send total items and total customers from each site; central
server calculates the global average.

3. Median Items Purchased: Local medians cannot determine the global median; full local
data must be transferred.

---

4. Distributed Data Mining Tasks

Frequent Itemset Mining:

Data across sites treated as partitions.

Local frequent itemsets are computed and combined globally using the Partition Algorithm.

Other Tasks:

Techniques for classification, clustering, and regression adapt distributive or algebraic


aggregates, minimizing data transfer.

---

Conclusion
Distributed data mining efficiently handles large and distributed datasets by leveraging
parallel processing and reducing data transfer. While some aggregates (like holistic
functions) pose challenges, most tasks can be accomplished with distributive or algebraic
functions, ensuring the feasibility of mining distributed data.

Spatial Data and Pattern Discovery

Spatial data refers to information where the location of objects in space holds significance. It
includes data types such as geographic maps, medical images, engineering drawings, and
architectural designs. A significant portion of real-world data has a spatial component,
making it essential to handle spatial data in specialized ways for pattern discovery.

---

1. The Spatial Data Model

Spatial Databases:
A spatial database supports spatial data types and queries. It enables efficient access to
spatial objects through spatial indexes and allows the use of spatial predicates (e.g., near,
adjacent, inside) in selection and join queries.

Spatial Data Types:


Spatial databases support three primary data types:

1. Points: Represented by coordinates (x, y, z) or latitude and longitude for geographical


locations.

2. Lines: Polylines representing connected line segments (e.g., roads, bridges).

3. Regions: Spatial objects with an area that may have holes or consist of multiple disjoint
parts (e.g., lakes, countries).

Spatial Predicates:
Predicates allow the comparison of spatial objects and are classified into:

1. Topological Predicates: Invariant under transformations (e.g., adjacent, inside, disjoint).

2. Direction Predicates: Based on direction (e.g., above, below, left of).


3. Metric Predicates: Based on distance (e.g., distance ≤10 km).

---

2. Spatial Queries

Selection Queries:
Used to select spatial objects based on spatial predicates. Examples include:

Listing all cities in a region.

Identifying blood vessels passing through specific organs.

Join Queries:
Involves linking spatial objects from different tables based on spatial relationships. Examples
include:

Mapping cities to their states.

Finding ATMs near petrol stations.

Spatial Index:
A spatial index is used to enable efficient retrieval of spatial objects within a specified
bounding region. This reduces the need to retrieve all objects and improves query
performance.

---

3. Pattern Discovery in Spatial Data

Preprocessing Spatial Data:


Spatial data tables are augmented with additional attributes, such as the length or area of
spatial objects, or the predicates they satisfy (e.g., being near an ATM). This preprocessing
can be performed logically during query execution rather than physically stored on disk.

The "First Law of Geography":


This law suggests that objects located near each other are more likely to have similar
relationships. Thus, predicates are generally considered for objects close to the object of
interest.
---

4. Types of Patterns in Spatial Data

Location Prediction:
Predicting the location of objects (e.g., species habitats) based on known data. An example
could be discovering that red-winged blackbirds live in humid and cold wetlands.

Spatial Outliers:
Identifying objects that significantly differ from their neighboring objects. For instance, a
newly constructed building in an older area of a city could be considered a spatial outlier.

---

5. OLAP Queries for Spatial Data

Spatial OLAP (On-Line Analytical Processing):


OLAP queries are particularly useful for spatial data because of the inherent hierarchical
structure. For example, a colony belongs to a city, which is in a state, and ultimately in a
country. Spatial attributes can be used to roll-up and drill-down in these hierarchical
structures, enabling advanced spatial analysis.

---

Conclusion

Spatial data mining involves efficiently querying and analyzing large spatial datasets, often
through the use of spatial indexes and predicates. By augmenting spatial databases with
additional attributes, similar techniques to relational databases can be applied to discover
meaningful patterns, such as location predictions and spatial outliers. Moreover, OLAP
queries can reveal spatial hierarchies, enabling deeper insights into spatial data.

Data Streams and Pattern Discovery

A data stream is a continuous flow of data records, which can include web logs, stock prices,
network traffic, phone conversations, ATM transactions, and sensor data. The challenge in
working with data streams is that the data is too large and arrives too quickly to be stored
entirely, requiring efficient methods for processing and pattern discovery.

1. Data Stream Model


A data stream consists of a sequence of records, each having numeric or categorical
attributes or sets of items.

Synopses (or sketches) are used to store small summaries of the stream due to limited
storage. Pattern discovery is based on these synopses, not on the entire data.

Approximate Results: Exact pattern discovery is not always feasible, but approximate results
are acceptable in most scenarios, with some error margin (ɛ, δ) in the computed patterns.

Continuous Queries: These are queries that execute continuously as the data updates, such
as tracking running totals or frequent itemsets.

Recent Patterns: In many applications, recent data records are more important than older
ones, and algorithms must focus on finding emerging patterns over recent data, rather than
the entire stream.

2. Kinds of Synopses

Several techniques for creating synopses to represent data streams include:

Random Sampling: A small set of randomly chosen records from the stream serves as a
sample. Algorithms like reservoir sampling are used for this when the size of the stream is
unknown.

Histograms: A frequency distribution of attributes is stored in buckets, representing how


records fall within different attribute ranges.

Sliding Windows: Only the most recent w records are considered, where w is the window
size, useful when only recent records are relevant.

Wavelets: A transform applied to data that reduces the data’s size while retaining important
patterns. It can be used to summarize a stream efficiently.

Snapshots and Time Frames: To capture recent patterns, snapshots of the stream are stored
over different time frames. These time frames may follow natural, logarithmic, or pyramidal
designs, giving more weight to recent data.

3. Pattern Discovery in Data Streams

Pattern discovery in data streams is typically done using aggregates or approximate


summaries:

Pattern Discovery as Aggregates: Most pattern discovery tasks can be treated as computing
distributive or algebraic aggregates (e.g., maximum, average) over the stream.
Frequent Itemset Mining: Using techniques like the Partition algorithm, local frequent
itemsets from partitions of the stream are combined to estimate global patterns.

Classification and Clustering: Techniques like bagging and boosting can be applied in
classification by learning classifiers for each data chunk, and clustering can be performed
using efficient algorithms like Birch that work well with single-pass data.

4. Challenges and Algorithms

Handling holistic aggregates (e.g., median), which require the entire dataset, is difficult in
data streams. These aggregates require special algorithms to be computed effectively.

Data streams may require incremental algorithms that update pattern discovery results as
new data arrives.

5. Example: Stock Market Data Analysis

An analyst tracking stock market data may want to compute the maximum, average, and
median number of shares bought at a time. Here's how these aggregates can be computed:

Maximum (Distributive Aggregate): Store the current maximum share count. When new data
arrives, update if the new value is larger.

Average (Algebraic Aggregate): Maintain a running total of shares and the count of records.
Compute the average by dividing the total by the count.

Median (Holistic Aggregate): This is harder to compute in data streams, as maintaining the
current median requires storing all past records.

6. Conclusion

Discovering patterns in data streams involves efficiently summarizing the data into
manageable synopses, performing approximate computations, and leveraging incremental
algorithms. Though challenges exist, especially with holistic aggregates, modern algorithms
allow for effective pattern discovery in real-time data streams.

Time-Series Data and Pattern Discovery

Time-Series Data: A time-series is a sequence of measurements taken at different time


points, commonly used in various fields like stock prices, weather, and health parameters
(e.g., blood pressure). These measurements are typically recorded at regular intervals such
as seconds, hours, or days.

Data Model:
A time-series can be represented as pairs of <time, measurement>.

The data model can be generalized to include complex measurements (multiple


parameters), data streams (continuous flow), and sequence databases (ordered records
without explicit time).

Visualizing time-series data is often done graphically, with time on the x-axis and measured
values on the y-axis.

Pattern Discovery in Time-Series: There are three main tasks involved in pattern discovery
within time-series data:

1. Trend Analysis:

Long-term trends: Overall movement of the time-series (e.g., upward or downward trends),
often detected by averaging nearby measurements.

Cycles: Repetitive behaviors identified using statistical correlation, such as recurring patterns
at fixed intervals.

Seasonal Movements: Periodic fluctuations based on seasons or events (e.g., increased


sales during festivals).

Outliers: Measurements significantly different from the norm, indicating rare or significant
events.

2. Prediction:

Curve-fitting: Statistical techniques like regression used to predict future values, though this
can be inaccurate due to noise.

Auto-regression: A method where future values are predicted based on a linear combination
of previous measurements. This technique uses autoregressive parameters computed from
the data.

3. Similarity Search:

Important for applications like clustering or comparing time-series, such as stock prices over
time.

Sequences may have different lengths, missing values, or outliers, making similarity search
challenging.
Similarity is determined by how many transformations (e.g., adding, deleting, or scaling
values) are needed to make one sequence match another. Techniques like multidimensional
index structures (e.g., R-trees) help efficiently find similar sequences.

Conclusion: Time-series data is widespread in various fields, and discovering patterns


through trend analysis, prediction, and similarity search is crucial for understanding and
forecasting behaviors. The ability to automatically detect trends, predict future events, and
find similar sequences can provide valuable insights in many real-world applications.

Text and Web Data

Text and web data comprise unstructured or semi-structured information like books, articles,
emails, web pages, and XML documents. The key challenge in this domain is mining
patterns effectively.

1. Data Model:
Text documents are structured as sequences of characters, often organized into titles,
sections, and paragraphs. Web pages include links, providing additional context.
Techniques like hierarchical clustering can infer relationships between documents and
create pseudo-links.

2. Document Representation:
Bag-of-Words Model: Documents are represented as collections of words, optionally with
their frequencies.

Vector-Space Model: Words form dimensions of a high-dimensional space where documents


are vectors, enabling linear algebra techniques.

Preprocessing:
Stop-Words Removal: Eliminates frequent but uninformative words like "the" or and.
Stemming: Reduces words to their root forms (e.g., "learned" → "learn").
TF-IDF:.Balances term frequency in a document against its rarity across the collection.

3. Pattern Discovery Tasks:


1. Search:
- Basic keyword search retrieves matching documents.
Ranking:
Based on keyword relevance or web page link structures (e.g., PageRank algorithm).
Efficiency: Inverted indexes organize document data for quick retrieval.
2. Frequent Patterns:
Extract frequent itemsets or sequences from documents.
Useful for query characterization, user profiling, and refining search results.

3. Classification:
Categorizes documents into predefined classes using features like keywords, authors, or
metadata.
Popular methods: Naïve Bayes, Support Vector Machines, etc.
Applications: Spam detection, email routing, topic categorization.

4. Clustering:
Groups similar documents into clusters, often for hierarchical classification.
Measures like cosine similarity evaluate document relationships.

5. Summarization:
Automates extraction of key sentences for a concise summary.
Heuristics include sentence position and keyword relevance.

6. Theme Evolution and Hot-Topic Mining:


-Tracks thematic changes over time (e.g., news trends, research focus).
-Identifies current and emerging topics of interest.

7. Document Understanding:
- Uses computational linguistics (e.g., WordNet) for semantic enrichment.
- Enhances statistical features by adding synonyms, categories, and sentence meanings.

Conclusion:
Text and web mining techniques enable efficient handling of vast unstructured data. By
leveraging models like TF-IDF, clustering, and classification, and integrating linguistic
insights, these techniques address challenges in organization, retrieval, and understanding.

Multimedia Data Summary

Definition and Importance:


Multimedia refers to information combining various media types such as images, audio, and
video. It is increasingly generated due to advancements in digital technologies. Mining
multimedia data involves identifying patterns and insights to handle its complexity and
volume

5.9.1 Data Model


Structure: Multimedia documents may contain homogeneous (e.g., only images) or
heterogeneous (e.g., images, audio, and video) content. These can be tagged with metadata
like author, date, or topic.
Integration with Text:Multimedia data may exist within web documents, described by
contextual text.

Feature Extraction:Features are extracted and stored numerically or categorically for tasks
like classification, clustering, and frequent pattern mining.

Time-Series Representation:Audio and video are modeled as time-series data, enabling


mining of interesting subsequences.

5.9.2 Pattern Discovery


Pattern discovery in multimedia parallels text and web data mining but adapts for time-series
properties.

1. Multimedia Search:
- Users query with example documents.
- If textual metadata exists, search resembles text-based techniques.
- Without metadata, similarity relies on weighted Euclidean distances between extracted
features.

2. Classification:
- Multimedia is classified into predefined categories (e.g., movie genre, singer, actor).
- For video, features like audio volume, color histograms, and shot changes are analyzed
to differentiate ads, movies, or news.

3.Clustering:
- Multimedia clustering organizes data hierarchically, akin to text clustering.
- Frames in a video sequence can be clustered based on similarity to identify meaningful
shots.

Applications:
The discussed tasks are foundational for multimedia organization, retrieval, and analysis,
making them crucial areas of active research.

You might also like