Unit-5 DM
Unit-5 DM
1. Definition: Relational data is organized as tables consisting of rows (tuples) and columns
(attributes). It is based on set theory and predicate logic.
2. RDBMS: Relational Database Management System implements the relational model for
storing, retrieving, and manipulating data.
Tables must have a key to uniquely identify rows; the primary key is a designated key.
4. Example:
A student table includes columns like roll_no, name, gender, state, and marks.
Schema: students(roll_no: integer, name: string, gender: boolean, maths: integer, ...)
1. Approaches:
Fast but involves overhead in data transfer, especially for large datasets.
Conclusion:
The relational model is simple, efficient, and widely used. Among the four approaches to
pattern discovery, tightly-coupled vendor implementations are the most efficient, combining
high performance with minimal overhead.
Transactional Data Overview
Definition:
Transactional data consists of records where each record contains a set of items. Examples
include supermarket purchases or search engine queries. This is also called market-basket
data as it models item sets purchased by customers.
○ Example dataset:
| TID | Items Sold |
|-----|-----------------------------|
| 1 | tomato, potato, onion |
| 2 | tomato, potato, brinjal, pumpkin |
| 3 | tomato, potato, onion, chilly |
| 4 | lemon, tamarind, chilly |
Data Formats
○ Most common format. Items are listed sequentially under their TIDs.
○ Example:
| TID | Items |
|-----|-------------------------|
| 1 | tomato, potato, onion |
2. Horizontal Vector Format:
○ Deals with richer attributes like age, income (quantitative) or postal codes
(categorical).
○ Converts these into binary attributes for mining.
5. Hierarchical Rules:
Comparison of Formats
The multi-dimensional data model is primarily used in data warehouses, which are
repositories of integrated data collected from multiple volatile sources. Unlike these sources,
a data warehouse is non-volatile and time-variant, accumulating data over several years.
---
Organization:
Data is structured as measures (quantitative data, e.g., sales) and dimensions (qualitative
attributes, e.g., product, location, time).
Defined over dimensions (e.g., product hierarchies like specific model → product type →
category).
Storage:
---
Drill-down: View data at a more detailed level (e.g., specific product level).
Slice and Dice: Analyze specific dimension values (e.g., sales for 2003 and 2005 only).
---
Regression
Clustering
Preprocessing:
Conversion to binary attributes may be necessary for certain techniques like frequent itemset
mining.
The role of dimensions/measures must be defined (e.g., identifying class attributes for
classification).
---
Key Takeaways
They integrate seamlessly with relational databases, supporting various data mining
techniques for actionable insights.
With technological advancements, datasets have grown in size and complexity, often
distributed geographically. For example, sales records of chain stores may reside in multiple
locations. To mine such datasets, there are two approaches:
1. Data Warehousing Approach: Collect and integrate data into a single repository.
2. Distributed Data Mining Approach: Devise algorithms to mine patterns or run queries at
each site separately, then combine results.
---
Distributed algorithms divide tasks into smaller parts, executed locally at each site.
Local patterns (smaller in size) are transmitted to the central server for global pattern
computation.
Advantages:
Reduces response time, even when data originates from a central server.
---
Aggregates are used as patterns, divided into three types based on computation ease:
2. Algebraic Aggregates: Computed using multiple pieces of information from each site (e.g.,
average).
3. Holistic Aggregates: Require full local data for computation (e.g., median).
---
Aggregate Functions:
1. Maximum Items Purchased: Compute local maximum at each site; send to the central
server for the global maximum.
2. Average Items Purchased: Send total items and total customers from each site; central
server calculates the global average.
3. Median Items Purchased: Local medians cannot determine the global median; full local
data must be transferred.
---
Local frequent itemsets are computed and combined globally using the Partition Algorithm.
Other Tasks:
---
Conclusion
Distributed data mining efficiently handles large and distributed datasets by leveraging
parallel processing and reducing data transfer. While some aggregates (like holistic
functions) pose challenges, most tasks can be accomplished with distributive or algebraic
functions, ensuring the feasibility of mining distributed data.
Spatial data refers to information where the location of objects in space holds significance. It
includes data types such as geographic maps, medical images, engineering drawings, and
architectural designs. A significant portion of real-world data has a spatial component,
making it essential to handle spatial data in specialized ways for pattern discovery.
---
Spatial Databases:
A spatial database supports spatial data types and queries. It enables efficient access to
spatial objects through spatial indexes and allows the use of spatial predicates (e.g., near,
adjacent, inside) in selection and join queries.
3. Regions: Spatial objects with an area that may have holes or consist of multiple disjoint
parts (e.g., lakes, countries).
Spatial Predicates:
Predicates allow the comparison of spatial objects and are classified into:
---
2. Spatial Queries
Selection Queries:
Used to select spatial objects based on spatial predicates. Examples include:
Join Queries:
Involves linking spatial objects from different tables based on spatial relationships. Examples
include:
Spatial Index:
A spatial index is used to enable efficient retrieval of spatial objects within a specified
bounding region. This reduces the need to retrieve all objects and improves query
performance.
---
Location Prediction:
Predicting the location of objects (e.g., species habitats) based on known data. An example
could be discovering that red-winged blackbirds live in humid and cold wetlands.
Spatial Outliers:
Identifying objects that significantly differ from their neighboring objects. For instance, a
newly constructed building in an older area of a city could be considered a spatial outlier.
---
---
Conclusion
Spatial data mining involves efficiently querying and analyzing large spatial datasets, often
through the use of spatial indexes and predicates. By augmenting spatial databases with
additional attributes, similar techniques to relational databases can be applied to discover
meaningful patterns, such as location predictions and spatial outliers. Moreover, OLAP
queries can reveal spatial hierarchies, enabling deeper insights into spatial data.
A data stream is a continuous flow of data records, which can include web logs, stock prices,
network traffic, phone conversations, ATM transactions, and sensor data. The challenge in
working with data streams is that the data is too large and arrives too quickly to be stored
entirely, requiring efficient methods for processing and pattern discovery.
Synopses (or sketches) are used to store small summaries of the stream due to limited
storage. Pattern discovery is based on these synopses, not on the entire data.
Approximate Results: Exact pattern discovery is not always feasible, but approximate results
are acceptable in most scenarios, with some error margin (ɛ, δ) in the computed patterns.
Continuous Queries: These are queries that execute continuously as the data updates, such
as tracking running totals or frequent itemsets.
Recent Patterns: In many applications, recent data records are more important than older
ones, and algorithms must focus on finding emerging patterns over recent data, rather than
the entire stream.
2. Kinds of Synopses
Random Sampling: A small set of randomly chosen records from the stream serves as a
sample. Algorithms like reservoir sampling are used for this when the size of the stream is
unknown.
Sliding Windows: Only the most recent w records are considered, where w is the window
size, useful when only recent records are relevant.
Wavelets: A transform applied to data that reduces the data’s size while retaining important
patterns. It can be used to summarize a stream efficiently.
Snapshots and Time Frames: To capture recent patterns, snapshots of the stream are stored
over different time frames. These time frames may follow natural, logarithmic, or pyramidal
designs, giving more weight to recent data.
Pattern Discovery as Aggregates: Most pattern discovery tasks can be treated as computing
distributive or algebraic aggregates (e.g., maximum, average) over the stream.
Frequent Itemset Mining: Using techniques like the Partition algorithm, local frequent
itemsets from partitions of the stream are combined to estimate global patterns.
Classification and Clustering: Techniques like bagging and boosting can be applied in
classification by learning classifiers for each data chunk, and clustering can be performed
using efficient algorithms like Birch that work well with single-pass data.
Handling holistic aggregates (e.g., median), which require the entire dataset, is difficult in
data streams. These aggregates require special algorithms to be computed effectively.
Data streams may require incremental algorithms that update pattern discovery results as
new data arrives.
An analyst tracking stock market data may want to compute the maximum, average, and
median number of shares bought at a time. Here's how these aggregates can be computed:
Maximum (Distributive Aggregate): Store the current maximum share count. When new data
arrives, update if the new value is larger.
Average (Algebraic Aggregate): Maintain a running total of shares and the count of records.
Compute the average by dividing the total by the count.
Median (Holistic Aggregate): This is harder to compute in data streams, as maintaining the
current median requires storing all past records.
6. Conclusion
Discovering patterns in data streams involves efficiently summarizing the data into
manageable synopses, performing approximate computations, and leveraging incremental
algorithms. Though challenges exist, especially with holistic aggregates, modern algorithms
allow for effective pattern discovery in real-time data streams.
Data Model:
A time-series can be represented as pairs of <time, measurement>.
Visualizing time-series data is often done graphically, with time on the x-axis and measured
values on the y-axis.
Pattern Discovery in Time-Series: There are three main tasks involved in pattern discovery
within time-series data:
1. Trend Analysis:
Long-term trends: Overall movement of the time-series (e.g., upward or downward trends),
often detected by averaging nearby measurements.
Cycles: Repetitive behaviors identified using statistical correlation, such as recurring patterns
at fixed intervals.
Outliers: Measurements significantly different from the norm, indicating rare or significant
events.
2. Prediction:
Curve-fitting: Statistical techniques like regression used to predict future values, though this
can be inaccurate due to noise.
Auto-regression: A method where future values are predicted based on a linear combination
of previous measurements. This technique uses autoregressive parameters computed from
the data.
3. Similarity Search:
Important for applications like clustering or comparing time-series, such as stock prices over
time.
Sequences may have different lengths, missing values, or outliers, making similarity search
challenging.
Similarity is determined by how many transformations (e.g., adding, deleting, or scaling
values) are needed to make one sequence match another. Techniques like multidimensional
index structures (e.g., R-trees) help efficiently find similar sequences.
Text and web data comprise unstructured or semi-structured information like books, articles,
emails, web pages, and XML documents. The key challenge in this domain is mining
patterns effectively.
1. Data Model:
Text documents are structured as sequences of characters, often organized into titles,
sections, and paragraphs. Web pages include links, providing additional context.
Techniques like hierarchical clustering can infer relationships between documents and
create pseudo-links.
2. Document Representation:
Bag-of-Words Model: Documents are represented as collections of words, optionally with
their frequencies.
Preprocessing:
Stop-Words Removal: Eliminates frequent but uninformative words like "the" or and.
Stemming: Reduces words to their root forms (e.g., "learned" → "learn").
TF-IDF:.Balances term frequency in a document against its rarity across the collection.
3. Classification:
Categorizes documents into predefined classes using features like keywords, authors, or
metadata.
Popular methods: Naïve Bayes, Support Vector Machines, etc.
Applications: Spam detection, email routing, topic categorization.
4. Clustering:
Groups similar documents into clusters, often for hierarchical classification.
Measures like cosine similarity evaluate document relationships.
5. Summarization:
Automates extraction of key sentences for a concise summary.
Heuristics include sentence position and keyword relevance.
7. Document Understanding:
- Uses computational linguistics (e.g., WordNet) for semantic enrichment.
- Enhances statistical features by adding synonyms, categories, and sentence meanings.
Conclusion:
Text and web mining techniques enable efficient handling of vast unstructured data. By
leveraging models like TF-IDF, clustering, and classification, and integrating linguistic
insights, these techniques address challenges in organization, retrieval, and understanding.
Feature Extraction:Features are extracted and stored numerically or categorically for tasks
like classification, clustering, and frequent pattern mining.
1. Multimedia Search:
- Users query with example documents.
- If textual metadata exists, search resembles text-based techniques.
- Without metadata, similarity relies on weighted Euclidean distances between extracted
features.
2. Classification:
- Multimedia is classified into predefined categories (e.g., movie genre, singer, actor).
- For video, features like audio volume, color histograms, and shot changes are analyzed
to differentiate ads, movies, or news.
3.Clustering:
- Multimedia clustering organizes data hierarchically, akin to text clustering.
- Frames in a video sequence can be clustered based on similarity to identify meaningful
shots.
Applications:
The discussed tasks are foundational for multimedia organization, retrieval, and analysis,
making them crucial areas of active research.