0% found this document useful (0 votes)
15 views65 pages

A Distributed File System-1

A Distributed File System (DFS) enables multiple computers to access and manage files as if they were stored locally, providing features like transparency, scalability, and fault tolerance. Popular DFS examples include LizardFS, Lustre, and Google Colossus, which are used in various applications such as cloud storage and big data processing. The document also discusses MapReduce, a programming model for distributed data processing, and its application in data mining tasks, along with algorithms that leverage its capabilities.

Uploaded by

vardhinijothi11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views65 pages

A Distributed File System-1

A Distributed File System (DFS) enables multiple computers to access and manage files as if they were stored locally, providing features like transparency, scalability, and fault tolerance. Popular DFS examples include LizardFS, Lustre, and Google Colossus, which are used in various applications such as cloud storage and big data processing. The document also discusses MapReduce, a programming model for distributed data processing, and its application in data mining tasks, along with algorithms that leverage its capabilities.

Uploaded by

vardhinijothi11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

A Distributed File System (DFS) is a networked storage solution that allows multiple computers

to access and manage files as if they were stored locally, even though the data may be distributed
across different physical locations. This architecture is fundamental for scalable, fault-tolerant,
and high-performance storage in modern computing environments.

🔍 Core Features of Distributed File Systems

 Transparency: Users interact with files without needing to know their physical
locations, providing location and access transparency. SANGFOR
 Scalability: DFS can handle increasing workloads by adding more nodes to the network,
accommodating more data and users without significant performance degradation.
SANGFOR
 Fault Tolerance: Data is replicated across multiple nodes, ensuring system functionality
even if some nodes fail. Techniques like data replication and erasure coding are used to
ensure data availability and durability. Wikipedia+2SANGFOR+2Wikipedia+2
 Concurrency: Supports multiple users or applications accessing and modifying files
simultaneously, implementing mechanisms to manage concurrent access and ensure data
consistency. SANGFOR
 Data Encryption: Incorporates robust encryption protocols to bolster data security,
protecting sensitive information from unauthorized interception and tampering.
SANGFOR

🧩 Popular Distributed File Systems

1. LizardFS

 An open-source, POSIX-compliant DFS that offers scalability and fault tolerance.


 Supports data replication, snapshots, and geo-replication.
 Compatible with Linux, FreeBSD, macOS, and Solaris.
 Provides native Windows client support and integration with Hadoop.
Wikipedia+1Wikipedia+1

2. Lustre

 A high-performance DFS used in large-scale computing environments.


 Employed in top supercomputers worldwide, including Frontier and Fugaku.
 Scalable to handle petabytes of data and high throughput. Wikipedia

3. MooseFS

 An open-source DFS designed for fault tolerance and high availability.


 Utilizes chunk servers for data storage and metadata servers for file management.
 Supports replication and erasure coding for data protection.
Wikipedia+1Wikipedia+1Wikipedia+2SANGFOR+2Wikipedia+2
4. Google Colossus

 The successor to Google File System (GFS), designed to address scalability and
reliability issues.
 Eliminates single points of failure by incorporating multiple master nodes.
 Supports real-time operations and integration with Google's search infrastructure.
WIRED+1ResearchGate+1

5. InterPlanetary File System (IPFS)

 A decentralized, peer-to-peer DFS that uses content-addressing for file identification.


 Enables efficient and reliable data distribution across a global network.
 Widely used in decentralized applications and blockchain ecosystems. Wikipedia

🧩 Use Cases

 Cloud Storage: Services like Google Drive and Dropbox utilize DFS to provide scalable
and reliable file storage solutions.
 Big Data Processing: Frameworks like Hadoop and Spark rely on DFS for storing and
processing large datasets across distributed clusters.
 High-Performance Computing: Supercomputers use DFS to manage vast amounts of
data with high throughput and low latency.Wikipedia
 Decentralized Applications: IPFS is used for building decentralized web applications
and content distribution networks.Wikipedia
MapReduce is a programming model and processing technique that enables the distributed
processing of large datasets across a cluster of computers. It is particularly effective in data
mining tasks where scalability and efficiency are paramount.

🔄 How MapReduce Works

MapReduce operates through three main phases:WIRED

1. Map Phase: Input data is divided into chunks, and a mapper function processes each
chunk to produce a set of intermediate key-value pairs.
2. Shuffle and Sort Phase: The intermediate key-value pairs are grouped by key, and the
values associated with each key are sorted. This phase ensures that all values for a
particular key are brought together.GeeksforGeeks
3. Reduce Phase: A reducer function processes each group of key-value pairs to produce
the final output.
This model allows for parallel processing, making it suitable for handling large-scale data mining
tasks.

🧩 MapReduce in Data Mining

MapReduce is widely used in data mining for tasks such as:

 Frequent Itemset Mining: Identifying sets of items that frequently co-occur in


transactions.
 Classification: Assigning items to predefined categories.
 Clustering: Grouping similar items together.WIRED
 Regression Analysis: Predicting numerical values based on input data.

For instance, in frequent itemset mining, MapReduce can efficiently process large transaction
datasets to identify frequently occurring itemsets. By distributing the computation across
multiple nodes, MapReduce accelerates the mining process, making it feasible to analyze
massive datasets. 33rd Square+1Analytics Vidhya+1

🛠️ Example: Weather Data Analysis

A practical example of MapReduce in data mining is analyzing weather data to determine hot
and cold days. In this scenario, the mapper function processes weather data records to extract
relevant information, while the reducer function aggregates the results to identify patterns such
as the number of hot or cold days within a given period. This approach leverages the parallel
processing capabilities of MapReduce to handle large volumes of weather data efficiently.
GeeksforGeeks

⚙️ Tools and Frameworks

Several tools and frameworks have been developed to implement MapReduce:

 Apache Hadoop: An open-source framework that provides a distributed storage and


processing platform for big data applications. Wikipedia
 Apache Pig: A high-level platform for creating programs that run on Hadoop, using a
language called Pig Latin. Pig abstracts the complexity of writing MapReduce programs,
making it easier to develop data analysis tasks. Wikipedia
 Apache Hive: A data warehouse infrastructure built on top of Hadoop that provides data
summarization, querying, and analysis.
In summary, MapReduce is a powerful model for distributed data processing, enabling efficient
data mining on large datasets. By leveraging the parallel processing capabilities of MapReduce,
organizations can extract valuable insights from big data more effectively.

If you're interested in implementing MapReduce for a specific data mining task or need further
assistance, feel free to ask!

Algorithms using in Map reduce


MapReduce is a powerful programming model for processing large-scale datasets in parallel
across distributed systems. Several data mining algorithms have been adapted to leverage
MapReduce's capabilities, enabling efficient analysis of massive datasets. Here are some notable
algorithms and their applications:

1. Frequent Itemset Mining (Apriori Algorithm)

 Description: Identifies sets of items that frequently appear together in transactions.


 MapReduce Implementation: The Apriori algorithm has been adapted for MapReduce
to handle large-scale data. Implementations like AprioriPMR (Power-set MapReduce)
and AprioriS (Simple) utilize MapReduce to efficiently process and identify frequent
itemsets in massive datasets. SpringerOpen

2. K-Means Clustering

 Description: Partitions data into clusters based on similarity.


 MapReduce Implementation: K-Means clustering can be implemented using
MapReduce by assigning each data point to the nearest centroid in the Map phase and
recalculating centroids in the Reduce phase. This process is repeated iteratively until
convergence. Noob to MasterHaifeng's Random Walk+1Noob to Master+1

3. PageRank

 Description: Measures the importance of web pages based on link structures.


 MapReduce Implementation: PageRank can be implemented using MapReduce by
distributing the computation of page ranks across multiple nodes. Each MapReduce
iteration updates the page ranks based on the contributions from linked pages. Noob to
Master
4. Decision Tree Learning (C4.5 Algorithm)

 Description: Builds a decision tree for classification tasks.


 MapReduce Implementation: The C4.5 algorithm has been adapted for MapReduce by
dividing the process into multiple MapReduce phases. The first phase counts attribute
occurrences, the second calculates information gain, and the third constructs the decision
tree. S-Logix+1SpringerOpen+1

5. Maximal Biclique Enumeration

 Description: Identifies maximal bicliques (complete bipartite subgraphs) in large graphs.


 MapReduce Implementation: A MapReduce-based algorithm clusters the input graph
into smaller subgraphs, processes them in parallel, and minimizes redundancy and load
imbalance to efficiently enumerate maximal bicliques. arXiv

6. K-Medoids Clustering

 Description: Similar to K-Means but uses actual data points as cluster centers.
 MapReduce Implementation: The K-Medoids++ algorithm has been adapted for
MapReduce to efficiently cluster large spatial datasets by initializing medoids and
assigning data points to the nearest medoid in parallel. arXiv

7. k-Nearest Neighbor (k-NN) Join

 Description: Finds the k nearest neighbors for each point in a dataset.


 MapReduce Implementation: MapReduce can process k-NN joins by clustering objects
into groups and performing the k-NN join on each group separately, utilizing pruning
rules to reduce computational costs. arXiv

8. Inverted Index

 Description: Creates an index mapping words to their locations in a set of documents.


 MapReduce Implementation: In the Map phase, each document is parsed, and key-
value pairs (word, document ID) are emitted. In the Reduce phase, these pairs are
aggregated to form the inverted index. Noob to Master+1Fiveable+1
These algorithms demonstrate the versatility of MapReduce in handling various data mining
tasks, from clustering and classification to graph analysis and indexing. By leveraging the
parallel processing capabilities of MapReduce, these algorithms can efficiently process and
analyze large-scale datasets.

If you're interested in implementing any of these algorithms using MapReduce, feel free to ask
for more detailed information or guidance!

efficiency of cluster computing techniques

Cluster computing techniques have significantly enhanced computational efficiency by enabling


parallel processing, fault tolerance, and scalability. These advancements are particularly evident
in high-performance computing (HPC) environments, where the integration of specialized
hardware and optimized algorithms plays a crucial role.

⚙️ Key Factors Influencing Cluster Computing Efficiency

1. Parallel Processing
By distributing tasks across multiple nodes, cluster computing accelerates data processing.
Algorithms like K-Means clustering and hierarchical clustering benefit from parallelism,
reducing computation time and handling large datasets more effectively. Caching strategies
further enhance efficiency by storing intermediate results, minimizing redundant calculations,
and expediting convergence .PMC

2. Load Balancing

Efficient load balancing ensures that computational resources are utilized optimally. Dynamic
load balancing algorithms adjust to the current state of the system, redistributing tasks to prevent
bottlenecks and underutilization. This adaptability is crucial for maintaining performance in real-
time applications .Wikipedia

3. Energy Efficiency

Energy consumption is a critical consideration in cluster computing. Techniques like energy-


proportional computing aim to align power usage with workload demands, reducing energy
waste. For instance, consolidating workloads onto fewer nodes during periods of low activity can
minimize power consumption without compromising performance .Wikipedia

4. Fault Tolerance

Cluster systems are designed to handle node failures gracefully. Redundant configurations and
distributed file systems ensure that data remains accessible and computations continue
uninterrupted, enhancing the reliability of the system .IBM

5. Scalability

Clusters can be expanded by adding more nodes, allowing systems to scale according to
workload requirements. This scalability is vital for adapting to growing data volumes and
increasing computational demands .IBM

🔬 Real-World Applications

 Artificial Intelligence (AI) and Machine Learning: Clusters equipped with GPUs and
specialized accelerators like TPUs facilitate the training of complex models, significantly
reducing processing time.
 Scientific Simulations: HPC clusters enable simulations of physical phenomena, such as
climate modeling and molecular dynamics, which require substantial computational
resources.
 Big Data Analytics: Cluster computing supports the analysis of large datasets, extracting
valuable insights for decision-making in various industries.
📈 Performance Metrics

The efficiency of cluster computing can be assessed using metrics such as throughput, latency,
and energy consumption. For example, integrating GPUs into supercomputing architectures has
led to significant improvements in performance per watt, as demonstrated by the Green500
rankings .WIRED

In summary, cluster computing techniques have revolutionized computational efficiency by


leveraging parallel processing, optimizing resource utilization, and ensuring system reliability.
As technology continues to advance, these techniques will play an increasingly vital role in
addressing the growing computational challenges across various domains.

If you're interested in exploring specific algorithms or tools related to cluster computing, feel
free to ask!

Unit 2 similar items


Nearest neighbor search
Nearest Neighbor Search (NNS) is a fundamental operation in data science and machine
learning, enabling tasks like classification, recommendation, and anomaly detection. Given the
challenges posed by high-dimensional data, especially the "curse of dimensionality,"
Approximate Nearest Neighbor (ANN) search techniques have been developed to offer efficient
solutions.

What Is Nearest Neighbor Search?

NNS involves finding the closest data point(s) to a given query point in a dataset, based on a
defined distance metric (e.g., Euclidean, cosine). While exact NNS guarantees the most accurate
results, it becomes computationally expensive as data size and dimensionality increase. ANN
methods trade off some accuracy for significant gains in speed and scalability.

⚙️ Key ANN Algorithms

1. Hierarchical Navigable Small World (HNSW)

 Overview: A graph-based approach that constructs a multi-layered navigable small world


graph, facilitating efficient nearest neighbor searches.
 Strengths: High accuracy and scalability, particularly effective in high-dimensional
spaces.
 Limitations: Memory-intensive and may require careful parameter tuning.
 Use Cases: Widely adopted in vector databases and search engines.
Medium+3arXiv+3Wikipedia+3MediumWikipedia+1Wikipedia+1

2. Approximate Nearest Neighbors Oh Yeah (ANNOY)

 Overview: Utilizes random projection trees to partition the data space, enabling efficient
search through approximate nearest neighbors.
 Strengths: Scalable and works well with high-dimensional data.
 Limitations: May require significant memory for large datasets.
 Use Cases: Suitable for recommendation systems and large-scale search applications.
Wikipedia+3Medium+3Milvus Blog+3

3. Inverted File Index (IVF)

 Overview: Partitions the dataset into clusters and performs searches within relevant
clusters to find nearest neighbors.
 Strengths: Offers high search quality with reasonable speed.
 Limitations: Lower query throughput compared to graph-based methods.
 Use Cases: Effective in scenarios where data can be naturally partitioned into clusters.
Pinecone

4. Best Bin First (BBF)

 Overview: An algorithm designed to efficiently find an approximate solution to the


nearest neighbor search problem in very-high-dimensional spaces.
 Strengths: Significant speedup in high-dimensional spaces.
 Limitations: Approximate results may not be suitable for all applications.
 Use Cases: Shape indexing and high-dimensional data retrieval.
Wikipedia+2Wikipedia+2Milvus Blog+2

🧩 Why Use Approximate Nearest Neighbor Search?

In high-dimensional spaces, traditional exact NNS methods become inefficient due to the curse
of dimensionality, where the volume of the space increases exponentially with the number of
dimensions. ANN techniques address this by employing strategies like dimensionality reduction,
clustering, and hashing to limit the search scope, enabling real-time performance even with large
datasets. Milvus Blog+1Wikipedia+1

🔧 Practical Considerations
 Data Characteristics: Choose an ANN algorithm that aligns with your data's
dimensionality and distribution.
 Performance Metrics: Balance between accuracy, speed, and memory usage based on
application requirements.
 Implementation Tools: Libraries like FAISS, HNSWlib, and Annoy provide optimized
implementations for various platforms.Pinecone+4Wikipedia+4Medium+4

If you're interested in implementing an ANN algorithm for a specific application or need further
assistance, feel free to ask!

Shingling of documents
Shingling is a fundamental technique in text mining and natural language processing (NLP) that
transforms documents into sets of contiguous subsequences, called shingles or n-grams. This
method is pivotal for tasks like document similarity detection, plagiarism identification, and
near-duplicate content detection.Semantic Scholar+6Scribd+6DZone+6
🔍 What Is Shingling?

Shingling involves extracting all possible contiguous subsequences of length kkk from a
document. These subsequences can be based on characters, words, or other tokens, depending on
the application.ScribdMedium

Examples:

 Character-based Shingling:
o Document: "abcde"
o k = 2:
 Shingles: {"ab", "bc", "cd", "de"}
 Word-based Shingling:
o Document: "The quick brown fox"
o k = 2:
 Shingles: {"The quick", "quick brown", "brown fox"}

The choice of kkk influences the granularity of the similarity measure. Smaller values of kkk
capture finer details, while larger values focus on broader patterns.Scribd

🧩 Applications of Shingling

 Document Similarity Detection: By comparing the sets of shingles between two


documents using similarity measures like Jaccard similarity, one can assess how similar
the documents are.
 Plagiarism Detection: Identifying near-duplicate content by detecting overlapping
shingles between documents.HackerNoon+1Scribd+1
 Near-Duplicate Detection: Efficiently finding documents that are nearly identical, even
if they have slight variations.
 Text Clustering and Classification: Grouping similar documents together based on
shared shingles.

⚙️ Steps in Shingling a Document

1. Preprocessing:
o Remove punctuation, convert to lowercase, and handle special characters to
ensure uniformity.
o Tokenize the document into characters or words based on the chosen shingling
method.Scribd
2. Generating Shingles:
o Extract contiguous substrings (shingles) of length kkk from the preprocessed
document.
o Store these shingles as a set or list for further processing.PhD Inds
AIM+2Scribd+2Medium+2
3. Hashing and Representation:
o Optionally, hash the shingles to represent them as numerical values.
o This helps in efficient storage and quick computation.PhD Inds
AIM+3Scribd+3slama.dev+3

🔐 Advanced Techniques

To handle large datasets and improve efficiency, advanced techniques are employed:

 MinHashing: A method to estimate the Jaccard similarity between sets efficiently by


using hash functions.Deepgram+2slama.dev+2Medium+2
 Locality Sensitive Hashing (LSH): A technique that hashes similar input items into the
same "buckets" with high probability, facilitating approximate nearest neighbor
searches.PhD Inds AIM
 Weighted Shingling: Assigns different importance weights to shingles based on factors
like position, frequency, or information content, allowing the model to focus on the most
relevant shingles.HackerNoon

📊 Visual Representation

For a visual understanding of the shingling process, consider the following diagram:

If you're interested in implementing shingling techniques or exploring specific applications like


plagiarism detection or document clustering, feel free to ask for more detailed guidance!
similarity preserving summaries
Similarity-Preserving Summaries in text summarization aim to generate concise summaries
that accurately reflect the semantic content of the original text. This approach is crucial for
applications like information retrieval, content recommendation, and document clustering, where
maintaining the core meaning is essential.

🧩 Key Techniques for Semantic Preservation

1. Siamese Generative Adversarial Networks (SSPGAN)

SSPGAN employs a dual-discriminator setup to ensure that generated summaries maintain


semantic fidelity. The generator produces summaries, while the Siamese discriminator evaluates
both the summary's quality and its semantic alignment with the source text. This adversarial
training enhances the model's ability to produce summaries that are both fluent and semantically
consistent. ACL Anthology+1arXiv+1

2. Semantic Extractor-Paraphraser Models

These models focus on extracting key semantic units from the source text and then paraphrasing
them to generate summaries. By emphasizing semantic overlap rather than syntactic similarity,
these models produce summaries that preserve the original meaning more effectively. They have
shown superior performance in terms of ROUGE, METEOR, and word mover similarity metrics.
ScienceDirect+2ACL Anthology+2arXiv+2

3. Semantic Content Generalization

This approach involves transforming the source text into a generalized form that captures its core
meaning, which is then used to generate a summary. By focusing on semantic content rather than
surface-level details, this method ensures that the summary retains the essential information of
the original text. It has been shown to improve the handling of out-of-vocabulary or rare words.
ACL Anthology+1ScienceDirect+1

4. Semantic Similarity Metrics (SSAS)

The SSAS metric evaluates the semantic similarity between system-generated summaries and
human-written references. By incorporating natural language inference and paraphrasing
techniques, SSAS provides a more accurate assessment of how well a summary preserves the
original meaning. This metric helps in fine-tuning models to produce semantically faithful
summaries. ACL Anthology
🔧 Practical Applications

 Information Retrieval: Ensures that search results are summarized in a way that
accurately reflects the original content, improving relevance.
 Content Recommendation: Provides summaries that capture the essence of items,
aiding in better recommendation accuracy.
 Document Clustering: Facilitates grouping of similar documents by maintaining
semantic consistency in summaries.ScienceDirect
 Legal and Medical Text Analysis: Critical for generating summaries that preserve the
precise meaning of complex documents.

In summary, similarity-preserving summaries are vital for applications where maintaining the
core meaning of the original text is essential. By employing advanced techniques like SSPGAN,
semantic extractor-paraphraser models, semantic content generalization, and semantic similarity
metrics, it's possible to generate summaries that are both concise and semantically faithful.

LOCALITY SENSITIVE HASHING FOR DOCUMENTS


Locality Sensitive Hashing (LSH) is a technique used to efficiently find similar documents in a
large corpus by grouping them into buckets based on similarity, reducing the need for pairwise
comparisons. It works by hashing documents into buckets such that similar documents are more
likely to hash into the same bucket. This approach significantly reduces the search space, making
it feasible to find approximate nearest neighbors or near-duplicate documents.
Here's a more detailed explanation:

1. 1. Similarity Representation:

Documents are first represented as vectors or sets of features that reflect their similarity.
For example, a document might be represented by a set of its unique words (shingles), or
by a vector of word frequencies.

  2. Hashing:
A hash function is used to map these similarity representations (or document representations) to
"buckets" or hash values. LSH is "locality-sensitive" because similar items are more likely to
hash to the same bucket than dissimilar items.
  3. Bucket Comparison:
Only the documents that hash to the same bucket are compared for similarity. This significantly
reduces the number of pairwise comparisons needed, especially in large datasets.

3.

Steps in LSH for Document Similarity:

1. 1. Shingling:
Text documents are converted into sets of shingles (short sequences of words or
characters).

  2. MinHashing:
Shingles are then converted into short signatures (often called minhashes) that represent the
document, while preserving similarity.
  3. Banding:
Minhashes are divided into bands, which are then processed individually by hash functions to
map them to buckets.
  4. Bucket Comparison:
Documents that hash to the same buckets are compared for similarity.

4.

Benefits of LSH:

 Efficiency:

LSH significantly reduces the computational cost of similarity searches, especially in


large datasets.

  Scalability:
LSH can be used to search through very large collections of documents without needing to store
all possible pairwise comparisons.
  Approximate Similarity:
LSH can efficiently find documents that are similar, even if they are not exact duplicates.
  Applications:
LSH is used in various applications, including near-duplicate detection, nearest neighbor search,
and clustering.
DISTANCE MEASUES

Measures of Distance in Data Mining


Last Updated : 30 Jul, 2024



Clustering

consists of grouping certain objects that are similar to each other, it can be used to decide if two
items are similar or dissimilar in their properties. In a

Data Mining
sense, the similarity measure is a distance with dimensions describing object features. That
means if the distance among two data points is

small

then there is a

high

degree of similarity among the objects and vice versa. The similarity is

subjective

and depends heavily on the context and application. For example, similarity among vegetables
can be determined from their taste, size, colour etc. Most clustering approaches use distance
measures to assess the similarities or differences between a pair of objects, the most popular
distance measures used are:

1. Euclidean Distance:

Euclidean distance is considered the traditional metric for problems with geometry. It can be
simply explained as the

ordinary distance

between two points. It is one of the most used algorithms in the cluster analysis. One of the
algorithms that use this formula would be

K-mean

. Mathematically it computes the

root of squared differences

between the coordinates between two objects.

d(p,q)=d(q,p)=(q1−p1)2+(q2−p2)2+⋯+(qn−pn)2=∑i=1n(qi−pi)2d(p,q)=d(q,p)=(q1−p1)2+(q2−p2)2+⋯+(qn
−pn)2
=i=1∑n(qi−pi)2

Figure -

Euclidean Distance

2. Manhattan Distance:

This determines the absolute difference among the pair of the coordinates. Suppose we have two
points P and Q to determine the distance between these points we simply have to calculate the
perpendicular distance of the points from X-Axis and Y-Axis. In a plane with P at coordinate
(x1, y1) and Q at (x2, y2). Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|
Here the total distance of the

Red

line gives the Manhattan distance between both the points.

3. Jaccard Index:

The Jaccard distance measures the similarity of the two data set items as the

intersection

of those items divided by the

union

of the data items.


J(A,B)=∣A∩B∣∣A∪B∣=∣A∩B∣∣A∣+∣B∣−∣A∩B∣J(A,B)=∣A∪B∣∣A∩B∣=∣A∣+∣B∣−∣A∩B∣∣A∩B∣

Figure -

Jaccard Index

4. Minkowski distance:

It is the

generalized

form of the Euclidean and Manhattan Distance Measure. In an

N-dimensional space

, a point is represented as,

(x1, x2, ..., xN)

Consider two points P1 and P2:

P1: (X1, X2, ..., XN)


P2: (Y1, Y2, ..., YN)
Then, the Minkowski distance between P1 and P2 is given as:

(x1−y1)p+(x2−y2)p+…+(xN−yN)ppp(x1−y1)p+(x2−y2)p+…+(xN−yN)p

 When p = 2, Minkowski distance is same as the Euclidean distance.

 When p = 1, Minkowski distance is same as the Manhattan distance.

5. Cosine Index:

Cosine distance measure for clustering determines the

cosine

of the angle between two vectors given by the following formula.

sim⁡(A,B)=cos⁡(θ)=A⋅B∥A∥B∥sim(A,B)=cos(θ)=∥A∥B∥A⋅B

Here (

theta

) gives the angle between two vectors and A, B are n-dimensional vectors.
Figure -

Cosine Distance

THEORY OF LOCALITY SENSITIVE FUNCTIONS

Locality Sensitive Hashing (LSH) uses hash functions that, by design, are more likely to place
similar items into the same "buckets" or hash values than dissimilar items. This allows for
efficient approximate nearest neighbor search and other similarity-based tasks by focusing on
potentially similar items.
Key Concepts:

 Locality-Sensitive Functions:

These are hash functions designed to preserve a certain level of similarity between data
points.

  Hashing for Similarity:


LSH maps items into buckets such that similar items are more likely to end up in the same
bucket compared to dissimilar items.
  Reduced Comparisons:
By grouping potentially similar items, LSH significantly reduces the number of pairwise
comparisons needed to find near-duplicate items, according to a blog post.
  Approximate Nearest Neighbor Search:
LSH is a popular technique for finding approximate nearest neighbors, especially in high-
dimensional data.
  Efficiency:
LSH offers a good trade-off between speed and accuracy, making it suitable for large datasets.
  False Positives and Negatives:
It's important to understand that LSH can have false positives (dissimilar items in the same
bucket) and false negatives (similar items in different buckets), but these can be controlled by
adjusting parameters.

How LSH Works:


1. Hashing: Input items are hashed using LSH functions.

  Bucket Grouping: Similar items are more likely to be mapped to the same hash value
(bucket).
  Comparison: Only items within the same buckets are compared to find similar items.

3.

Applications:

 Approximate Nearest Neighbor Search: Finding the nearest neighbor in a large dataset.
 Similarity Search: Identifying similar items (e.g., documents, images, etc.).
 Data Clustering: Grouping similar items together.
 Duplicate Detection: Finding near-duplicate items in large datasets

METHODS FOR HIGH DEGREE OF SIMILARITIES:

Several methods can be used to measure the high degree of similarity between data points or
objects. These include Euclidean distance, cosine similarity, Jaccard index, and various graph-
based algorithms. Euclidean distance calculates the straight-line distance between two points,
while cosine similarity measures the angle between two vectors, focusing on their orientation
rather than magnitude. The Jaccard index is used to compare the similarity of two sets by
measuring the overlap of their elements. Graph-based algorithms, like graph matching and graph
neural networks, are used for analyzing complex relationships and structures.
Here's a more detailed look at some of these methods:
1. Euclidean Distance:

 Concept: Calculates the straight-line distance between two points in a multi-dimensional


space.

  Formula: √((x1 - y1)² + (x2 - y2)² + ... + (xn - yn)²)


  Application: Useful for comparing numerical data points and is commonly used in
clustering and classification tasks.

2. Cosine Similarity:

 Concept: Measures the cosine of the angle between two vectors, regardless of their
magnitude.

  Formula: (A ⋅ B) / (||A|| * ||B||)


  Application: Effectively compares text documents or other high-dimensional data where
the orientation of vectors is more important than their length.


3. Jaccard Index:

 Concept: Calculates the similarity between two sets by dividing the number of elements
common to both sets by the total number of unique elements in both sets.

  Formula: |A ∩ B| / |A ∪ B|
  Application: Useful for comparing sets of items or documents where the presence or
absence of elements is significant.

4. Graph-Based Algorithms:

 Concept:

Utilize graph structures to represent relationships between data points and analyze their
similarities.

 Types:
o Graph Matching: Finds similar structures between graphs, often used in
chemical compound similarity search.
o Graph Neural Networks (GNNs): Learn complex relationships within graph
structures, used in social network analysis and recommendation systems.
o Community Detection: Identifies groups of similar nodes within a graph, useful
in social network analysis and clustering.
 Application:

Effective for analyzing complex data with intricate relationships, such as social networks,
protein interactions, and recommendation systems.

Other Similarity Measures:

 Minkowski Distance: A generalization of Euclidean and Manhattan distances, allowing


for different power parameters.

  Manhattan Distance: Calculates the sum of absolute differences between corresponding


coordinates of two points.
  Hamming Distance: Measures the number of differing bits between two binary strings.
  Correlation: Measures the strength and direction of the linear relationship between two
variables.
  Mahalanobis Distance: Considers the covariance between variables when
measuring distance
UNIT 111 MINING DATA STREAMS
STREAM DATA MODEL
The Stream Data Model is a framework for processing continuous, real-time data flows—often
referred to as data streams. Unlike traditional batch processing, which handles finite datasets,
stream processing deals with unbounded, never-ending sequences of data events. This model is
fundamental in applications requiring real-time analytics, such as fraud detection,
recommendation systems, and IoT monitoring.

🔍 Core Concepts of the Stream Data Model

1. Continuous Data Flow


Data is ingested and processed in real-time, enabling immediate insights and actions. This
continuous flow is essential for applications like financial fraud detection and real-time
traffic monitoring.
2. Event Time vs. Processing Time
o Event Time: The timestamp when an event actually occurred.
o Processing Time: The timestamp when the event is processed by the system.
Differentiating between these is crucial for accurate time-based analysis,
especially when events may arrive out of order.
3. Windowing
To manage infinite data streams, windowing divides the stream into finite chunks:
o Tumbling Windows: Fixed, non-overlapping intervals.
o Sliding Windows: Overlapping intervals that move over time.
o Session Windows: Based on periods of activity, useful for session-based
analytics. GeeksforGeeks+1HackerNoon+1Medium
4. Stateful Processing
Maintaining state allows systems to perform operations like aggregations and joins over
time. This is achieved through:
o Internal State: Managed within the application instance.
o External State: Stored in external systems like NoSQL databases, offering
scalability and durability. Javatpoint+1Medium+1
5. Fault Tolerance
Stream processing systems implement mechanisms like checkpointing and state
snapshots to ensure data consistency and recoverability in case of failures.
🛠️ Components of a Stream Data Architecture

1. Data Sources
Various producers such as sensors, applications, and databases generate data streams.
2. Stream Ingestion
Components like Kafka or Pulsar collect and transport data to processing engines.
3. Stream Processing Engine
Frameworks like Apache Flink, Spark Streaming, or Google Cloud Dataflow process the
data in real-time.
4. Data Storage
Systems like HDFS, Amazon S3, or cloud-native storage solutions store processed data
for further analysis.
5. Analytics and Visualization
Tools and dashboards that provide insights and visual representations of the processed
data.
6. Data Sink
Final destinations where processed data is stored or acted upon, such as databases or data
warehouses.

SAMPLING DATA IN THE STREAM


Sampling data in a stream is a critical technique for efficiently managing and analyzing
continuous, unbounded data flows. Given the constraints of limited memory and the need for
real-time processing, various sampling methods have been developed to address these
challenges.

🔍 Key Sampling Techniques for Data Streams

1. Reservoir Sampling

Reservoir Sampling is a randomized algorithm designed to select a representative sample of size


k from a stream of unknown or infinite length. It ensures that each item in the stream has an
equal probability of being included in the sample.AlgoCademy+1GeeksforGeeks+1

How it works:

 Initialization: Store the first k items in a reservoir.


 Processing Subsequent Items: For each new item at position i (where i > k), generate a
random index j between 0 and i. If j is less than k, replace the item at index j in the
reservoir with the new item.MediumAlgoCademy
Advantages:

 Efficient with a time complexity of O(n), where n is the number of items processed.
 Requires only O(k) space, making it suitable for large or unbounded data
streams.WikipediaStudySmarter UK

Applications:

 Maintaining a representative sample for real-time analytics.


 Sampling data for machine learning model training on streaming data.StudySmarter
UK+1Medium+1

2. Sliding Window Sampling

This method maintains a fixed-size window of the most recent n items in the stream. As new
items arrive, the oldest items are discarded, ensuring that the sample reflects the most recent
data.

How it works:

 Use a queue or circular buffer to store the current window of items.


 When a new item arrives, remove the oldest item and add the new item to the
window.GeeksforGeeks+1Wikipedia+1

Advantages:

 Simple to implement and understand.


 Provides a snapshot of the most recent data, useful for detecting short-term trends.

Applications:

 Monitoring real-time metrics and alerts.


 Detecting recent anomalies or changes in data patterns.Beginners Coding
101+1StudySmarter UK+1

3. Weighted Reservoir Sampling

An extension of reservoir sampling that assigns different probabilities to items based on their
importance or relevance, allowing for biased sampling.StudySmarter UK

How it works:

 Initialization: Store the first k items in the reservoir.


 Processing Subsequent Items: For each new item at position i, generate a random
number and determine if the item should replace an existing item in the reservoir,
considering its weight.Wikipedia+7StudySmarter UK+7Medium+7
Advantages:

 Allows for prioritization of certain items based on predefined criteria.


 Useful for applications where some data points are more significant than others.

Applications:

 Sampling for biased machine learning models.


 Prioritizing critical events in monitoring systems.Medium

4. Histograms and Quantile Sketches

These methods approximate the distribution of data streams by dividing the data into intervals or
buckets and maintaining counts or summaries for each bucket.Medium+1Medium+1

How it works:

 Divide the data range into fixed intervals or use algorithms like the V-optimal histogram
to determine bucket boundaries.
 Update the counts or summaries as new data arrives.Medium

Advantages:

 Provides insights into the distribution and frequency of data.


 Useful for estimating quantiles and percentiles in streaming data.

Applications:

 Estimating percentiles for real-time analytics.


 Summarizing data distributions for monitoring and reporting.

🛠️ Practical Considerations

 Memory Constraints: Choose sampling methods that fit within available memory,
especially when dealing with large or unbounded data streams.
 Real-Time Processing: Ensure that the sampling technique supports real-time data
processing to meet application requirements.
 Bias and Representativeness: Select appropriate sampling methods to avoid bias and
ensure that the sample accurately represents the data stream.

Filtering streams
Filtering streams is a fundamental technique in data processing, allowing systems to efficiently
handle and analyze continuous data flows by selecting only relevant information. This approach
is particularly crucial in scenarios involving large-scale data streams, such as network traffic
analysis, real-time analytics, and event-driven architectures.

🔍 Key Filtering Techniques in Stream Processing

1. Threshold-Based Filtering

This method involves applying predefined criteria to filter out data points that do not meet
specific thresholds. It's commonly used to detect anomalies or focus on significant events.

Example: Filtering sensor data to exclude readings below a certain value.

Considerations:

 Pros: Simple to implement and understand.


 Cons: Requires careful selection of thresholds to avoid missing important data.Insight
Tribune

2. Pattern Matching and Regular Expressions

Utilizing regular expressions allows for filtering data streams based on specific patterns, such as
keywords or formats. This is particularly useful in text processing and log analysis.

Example: Filtering log entries that match a particular error pattern.

Considerations:

 Pros: Powerful for complex pattern matching.


 Cons: Can be computationally intensive; requires careful crafting of regular expressions.

3. Bloom Filters

A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a
member of a set. It's particularly useful for filtering out known elements without storing the
entire dataset.RabbitMQ

Example: Filtering out already seen user IDs in a recommendation system.

Considerations:

 Pros: Very low memory usage; fast membership tests.


 Cons: Allows for false positives; does not support deletion of elements.

4. Deep Packet Inspection (DPI)


DPI involves examining the data part (and sometimes the header) of a packet as it passes an
inspection point. It's used for network traffic filtering, security monitoring, and policy
enforcement.

Example: Blocking malicious payloads or enforcing content policies.Wikipedia

Considerations:

 Pros: Allows for detailed inspection and filtering based on content.


 Cons: Can introduce latency; may raise privacy concerns.

🛠️ Implementing Stream Filtering in Java

Java's Stream API provides powerful tools for filtering collections and data streams. Here's how
you can implement filtering:ocpj21.javastudyguide.com+5Stack Abuse+5Baeldung+5

Basic Filtering:

Counting distance elements in a stream


Counting distance elements in a stream refers to the problem of estimating the number of distinct
elements (cardinality) in a data stream using limited memory. This is a fundamental problem in
data stream processing, especially when dealing with large-scale data where storing all elements
is impractical.

🔍 Key Algorithms for Counting Distinct Elements

1. Flajolet–Martin Algorithm

The Flajolet–Martin algorithm is a probabilistic counting algorithm that estimates the number of
distinct elements in a stream. It uses hash functions to map elements to bit strings and counts the
number of leading zeros in these strings. The maximum number of leading zeros observed gives
an estimate of the cardinality. This algorithm uses logarithmic space and provides an
approximation with a known error bound. MIT CSAIL
Courses+4Wikipedia+4www.slideshare.net+4Medium+1www.slideshare.net+1

2. HyperLogLog Algorithm

An improvement over the Flajolet–Martin algorithm, HyperLogLog reduces the variance of the
estimate by using multiple hash functions and combining their results. It provides a more
accurate estimate with a fixed amount of memory, making it suitable for large-scale applications.
Wikipedia

3. CVM Algorithm

The CVM (Chakraborty-Vinodchandran-Meel) algorithm is a recent approach that uses sampling


instead of hashing. It provides an unbiased estimator for the number of distinct elements in a
stream and offers (ε, δ)-approximation guarantees. This algorithm is particularly useful when
unbiased estimation is critical. Wikipedia

4. Lossy Counting Algorithm

The Lossy Counting algorithm maintains approximate counts of elements in a data stream. It
divides the stream into buckets and prunes elements whose counts fall below a certain threshold.
This approach is effective for estimating the frequency of elements and can be adapted for
distinct counting by adjusting the threshold. GeeksforGeeks

🧩 Practical Considerations

 Memory Usage: Algorithms like Flajolet–Martin and HyperLogLog are designed to use
logarithmic memory, making them suitable for large-scale
applications.oceanumeric.github.io+3Wikipedia+3Medium+3
 Accuracy: While these algorithms provide approximate counts, they come with known
error bounds, allowing users to trade off between memory usage and
accuracy.GeeksforGeeks+1Wikipedia+1
 Implementation: Many of these algorithms have been implemented in various
programming languages and are available in libraries, making them accessible for
practical use.

Estimating moments
-- Estimating moments in data streams is a fundamental task in streaming algorithms, enabling
the analysis of data characteristics such as distribution, variance, and higher-order statistics
without storing the entire dataset. Moments are statistical measures that provide insights into the
shape and spread of a distribution.

📊 Understanding Frequency Moments

In the context of data streams, the p-th frequency moment (denoted as FpF_pFp) is defined
as:ResearchGate
Fp=∑i=1nfipF_p = \sum_{i=1}^n f_i^pFp=i=1∑nfip

Where:

 fif_ifi is the frequency of the i-th distinct element in the stream.


 ppp is a positive integer.

The first moment (F1F_1F1) represents the total number of elements in the stream, while the
second moment (F2F_2F2) provides information about the distribution of
frequencies.ResearchGate+2Scribd+2Wikipedia+2

🧩 Estimating Moments in Data Streams

Due to the constraints of data streams—such as limited memory and high throughput—exact
computation of moments is often infeasible. Therefore, approximate methods are employed:

1. AMS Sketch (Alon-Matias-Szegedy Sketch)

 Purpose: Estimates the second frequency moment F2F_2F2.


 Method: Uses hash functions to map elements to random variables, then computes the
inner product of these variables.
 Advantages: Requires sublinear space and provides probabilistic
guarantees.ResearchGate

2. Count-Min Sketch

 Purpose: Estimates the frequency of elements and can be adapted to estimate moments.
 Method: Maintains a set of hash tables to count occurrences of elements.
 Advantages: Offers fast updates and queries with a trade-off between accuracy and
memory usage.

3. Reservoir Sampling

 Purpose: Provides a representative sample from the stream.


 Method: Randomly selects elements to maintain a sample of fixed size.
 Advantages: Enables unbiased estimation of moments from the
sample.Wikipedia+4arXiv+4Wikipedia+4

🔬 Advanced Techniques and Considerations

 Weighted Sampling: Incorporates weights into sampling to improve accuracy in


moment estimation.
 Sublinear Algorithms: Develop algorithms that use less memory while providing
accurate estimates for higher-order moments.
 Trade-offs: There's a balance between memory usage, computational complexity, and
estimation accuracy.arXiv

Counting ones in window


Counting the number of ones within a sliding window is a common problem in data stream
processing, where you need to maintain a count of ones over a fixed-size window as it moves
through a stream of binary data. This is particularly useful in applications like network traffic
analysis, real-time analytics, and monitoring systems.

🔄 Sliding Window Approach

The sliding window technique allows you to efficiently calculate the number of ones in a
window of size k as it moves through the stream. Here's how it works:

1. Initialize the Window:


o Start by counting the number of ones in the first window of size k.
2. Slide the Window:
o For each new element added to the window (i.e., as the window slides to the
right):
 Add the new element's value to the count.
 Remove the value of the element that is sliding out of the window from
the count.
3. Update the Count:
o After each slide, the count reflects the number of ones in the current
window.Enjoy Algorithms+1GeeksforGeeks+1

This approach ensures that each element is processed only once, leading to an efficient O(n) time
complexity, where n is the number of elements in the stream.

🧩 Example Implementation in Python

Here's a Python implementation of the sliding window technique to count the number of ones in
a binary stream: def count_ones_in_window(stream, window_size):

count = 0

result = []
# Initialize the count with the first window

for i in range(window_size):

count += stream[i]

result.append(count)

# Slide the window over the stream

for i in range(window_size, len(stream)):

count += stream[i] - stream[i - window_size]

result.append(count)

return result

# Example usage

stream = [1, 0, 1, 1, 0, 1, 1, 0, 1]

window_size = 3

print(count_ones_in_window(stream, window_size))

decaying windows
Decaying windows—also known as exponentially decaying windows or sliding time windows
with exponential weighting—are a powerful technique in data stream processing. They assign
exponentially decreasing weights to data points over time, emphasizing more recent data while
gradually diminishing the influence of older data. This approach is particularly useful for
analyzing time-sensitive patterns and trends in streams, where recent observations are considered
more relevant than past ones. na-win.io+1Scribd+1Studocu+4Scribd+4na-win.io+4
🔍 Key Concepts of Decaying Windows

 Exponential Weighting: Each data point is assigned a weight that decreases


exponentially with age. The weight w(t)w(t)w(t) of a data point observed at time ttt is
given by:

w(t)=e−λ⋅tw(t) = e^{-\lambda \cdot t}w(t)=e−λ⋅t

where λ\lambdaλ is a decay factor controlling the rate of decrease. A higher λ\lambdaλ
results in faster decay, giving more importance to recent data.

 Time Sensitivity: Decaying windows are ideal for applications where recent data is more
indicative of current trends, such as fraud detection, real-time analytics, and monitoring
systems.
 Memory Efficiency: By reducing the weight of older data points, decaying windows
help in maintaining a compact representation of the stream, making them suitable for
high-volume data environments. Studocu+5na-win.io+5Studocu+5

🧩 Applications of Decaying Windows

 Trend Detection: Identifying shifts in data patterns by giving more weight to recent
observations.
 Anomaly Detection: Spotting outliers or unusual events that deviate from recent
trends.Ques10+9Studocu+9SpringerLink+9
 Time-Series Forecasting: Predicting future values based on weighted historical
data.www.slideshare.net+5SpringerLink+5Wikipedia+5
 Real-Time Analytics: Providing up-to-date insights by continuously processing the latest
data.

⚙️ Implementing a Decaying Window

To implement a decaying window, you can maintain a data structure that stores the current
weighted sum and count of data points. Upon receiving a new data point:

1. Decay Existing Weights: Multiply the current weighted sum by e−λ⋅Δte^{-\lambda \cdot
\Delta t}e−λ⋅Δt, where Δt\Delta tΔt is the time difference since the last update.
2. Add New Data Point: Incorporate the new data point with its weight into the sum.
3. Update Metrics: Recalculate any necessary metrics (e.g., moving averages, totals) using
the updated weighted sum.
This method ensures that the influence of older data diminishes over time, allowing the system to
adapt to recent changes in the data stream. Studocu

✅ Advantages of Decaying Windows

 Adaptability: Quickly responds to changes in data patterns, making them suitable for
dynamic environments.Studocu
 Efficiency: Reduces memory usage by focusing on recent data, which is crucial for high-
velocity data streams.Studocu
 Robustness: Mitigates the impact of outliers or anomalies by giving them less weight
over time.

unit 4 link analysis and frequent itemsets

pagerank

🔍 What Is PageRank?

PageRank assigns a numerical value to each web page, reflecting its relative importance. The
core idea is that a page is more important if it is linked to by other important pages. This concept
is akin to academic citation analysis, where a paper's significance is partly determined by how
often and by whom it is cited.

How It Works

PageRank operates on the principle of a random surfer model:

1. Random Surfer Model: Imagine a person randomly clicking on links across the web.
The probability that this person lands on a particular page is its PageRank.StudySmarter
UK+1Wikipedia+1
2. Mathematical Formula:

PR(A)=1−dN+d∑i=1nPR(Ti)C(Ti)PR(A) = \frac{1 - d}{N} + d \sum_{i=1}^{n}


\frac{PR(T_i)}{C(T_i)}PR(A)=N1−d+di=1∑nC(Ti)PR(Ti)

Where:

o PR(A)PR(A)PR(A) is the PageRank of page A.


o ddd is the damping factor, typically set to 0.85.
o NNN is the total number of pages.
o PR(Ti)PR(T_i)PR(Ti) is the PageRank of pages TiT_iTi that link to A.
o C(Ti)C(T_i)C(Ti) is the number of outbound links on page TiT_iTi
.TutorialsPoint+3StudySmarter UK+3Wikipedia+3

This formula ensures that a page's rank is influenced by both the quantity and quality of
links pointing to it.

3. Iterative Calculation: PageRank is computed iteratively. Starting with an initial guess


(e.g., equal rank for all pages), the algorithm repeatedly updates the ranks until they
converge to stable values.TutorialsPoint

⚙️ Practical Implementation

In practice, PageRank is computed using the power iteration method, which is efficient for large-
scale graphs like the web. This method involves multiplying the adjacency matrix of the web
graph by a vector representing the current PageRank estimates and normalizing the
result.TutorialsPoint+1Wikipedia+1

For example, in Python, one might use NumPy to implement this iterative process, adjusting the
PageRank values until they converge.

🌐 Applications Beyond Web Search

While originally designed for ranking web pages, PageRank has been adapted for various other
applications:

 Social Networks: Identifying influential users or communities.


 Citation Networks: Ranking academic papers based on citation patterns.
 Recommendation Systems: Suggesting products or content by evaluating
interconnectedness.
 Biological Networks: Analyzing protein interactions or ecological networks. WIRED

⚠️ Limitations and Challenges

 Dangling Nodes: Pages with no outbound links can disrupt the calculation. This is
typically handled by redistributing their rank evenly among all pages.
 Spider Traps: Closed loops of pages that link only to each other can cause rank to
accumulate without distribution.
 Scalability: Computing PageRank for the entire web requires significant computational
resources.TutorialsPoint

🧩 Variants of PageRank

To address specific needs, several variations of PageRank have been developed:

 Personalized PageRank: Tailors the ranking based on a user's preferences or behavior.


 Topic-Sensitive PageRank: Adjusts ranks according to specific topics or categories.
 Weighted PageRank: Assigns different weights to links based on their perceived
importance.

These variants enable more nuanced and context-aware ranking systems.

Efficient computation
Efficient computation is a cornerstone of computer science, aiming to solve problems using
minimal resources—time, memory, and energy. Optimizing algorithms and leveraging
appropriate computational models can significantly enhance performance, especially in large-
scale systems and real-time applications.

🔧 Strategies for Efficient Computation

1. Algorithmic Optimization

 Big-O Analysis: Assessing the time and space complexity of algorithms helps in
understanding their scalability and efficiency.
 Precomputation: Performing expensive computations ahead of time and storing the
results in lookup tables can save time during runtime. For instance, using precomputed
mathematical constants like π and e instead of calculating them repeatedly. Wikipedia
 Memoization: Storing the results of expensive function calls and reusing them when the
same inputs occur again, thereby avoiding redundant calculations. Klu

2. Advanced Computational Models

 CORDIC Algorithm: An efficient method for computing trigonometric, hyperbolic,


logarithmic, and exponential functions using only addition, subtraction, bitshift, and
lookup tables, making it ideal for hardware implementations. Wikipedia
 Divide and Conquer: Breaking a problem into smaller subproblems, solving them
independently, and combining their solutions. This approach is exemplified in algorithms
like QuickSort and MergeSort.
 Fast Fourier Transform (FFT): A divide-and-conquer algorithm that computes the
Discrete Fourier Transform (DFT) and its inverse efficiently, reducing the computational
complexity from O(n²) to O(n log n). Wikipedia

3. Parallel and Distributed Computing

 Parallel Computing: Dividing a problem into subproblems that can be solved


concurrently on multiple processors, leading to significant speedup. All About
AI+1Subscribed.FYI+1
 Distributed Systems: Utilizing multiple machines to solve a problem, which can handle
larger datasets and more complex computations than a single machine.

4. Hardware-Level Optimizations

 Specialized Hardware: Using hardware accelerators like Graphics Processing Units


(GPUs) and Field-Programmable Gate Arrays (FPGAs) to perform computations more
efficiently than general-purpose CPUs.
 Energy-Efficient Computing: Designing algorithms and systems that minimize energy
consumption, which is crucial for mobile devices and large-scale data centers.

Topic sensitive page rank


Topic-Sensitive PageRank (TSPR) is an enhancement of the original PageRank algorithm,
designed to provide more relevant search results by considering the topical context of a query.
Introduced by Taher H. Haveliwala in 2002, TSPR adjusts the PageRank computation to reflect
the importance of web pages concerning specific topics, rather than using a single, generic
ranking vector. Wikipedia+4ra.ethz.ch+4ResearchGate+4

🔍 How Topic-Sensitive PageRank Works

In traditional PageRank, a single vector is computed to capture the relative importance of web
pages based solely on their link structure. However, this approach does not account for the
topical relevance of pages to a given query. TSPR addresses this by computing a set of
PageRank vectors, each biased using a representative topic. This allows the algorithm to capture
the notion of importance with respect to a particular topic. ResearchGate+2ACM Digital
Library+2ra.ethz.ch+2ResearchGate+2ra.ethz.ch+2ACM Digital Library+2

At query time, TSPR computes the topic-sensitive PageRank scores for pages satisfying the
query using the topic of the query keywords. For searches performed in context (e.g., when the
search query is highlighted within a web page), TSPR computes the topic-sensitive PageRank
scores using the topic of the context in which the query appeared.
ResearchGate+2ra.ethz.ch+2ACM Digital Library+2ACM Digital
Library+2ResearchGate+2ra.ethz.ch+2
🧩 Key Concepts

 Biasing the Random Walk: TSPR introduces artificial links into the web graph during
the offline rank computation, biasing the random walk towards pages related to a specific
topic. Stanford NLP+3ra.ethz.ch+3Fiveable+3
 Personalization Vector: A non-uniform personalization vector is used in TSPR,
differing from the uniform vector used in traditional PageRank. This vector introduces
bias in all iterations of the iterative computation of the PageRank vector.
ra.ethz.ch+3studylib.net+3Fiveable+3
 Query-Time Processing: At query time, TSPR computes the total score of a page with
respect to a query by taking a linear combination of the topic-sensitive PageRank vectors,
weighted by the relevance of the query to each topic. studylib.net+2ra.ethz.ch+2ACM
Digital Library+2

📈 Applications and Benefits

 Improved Relevance: TSPR provides more accurate rankings by considering the topical
relevance of pages, ensuring that search results are more aligned with user interests.
Fiveable+1Cornell University Blog Service+1
 Context-Aware Search: By computing topic-sensitive PageRank scores based on the
context of the query, TSPR enhances the accuracy of search results in specific contexts.
ResearchGate+2ra.ethz.ch+2ACM Digital Library+2
 Personalized Search: TSPR can be adapted to personalize search results based on user
preferences and browsing history, providing a more tailored search experience.

⚠️ Challenges and Considerations

 Computational Complexity: Computing multiple PageRank vectors for each topic can
be computationally intensive, especially for large-scale web
graphs.ResearchGate+2ra.ethz.ch+2ACM Digital Library+2
 Topic Selection: The effectiveness of TSPR depends on the selection of representative
topics. Poorly chosen topics may not accurately reflect the user's intent.
 Dynamic Content: The web is constantly evolving, and maintaining up-to-date topic-
sensitive PageRank vectors requires continuous computation and updates.

Link spam
Link Spam refers to the practice of acquiring or placing backlinks with the primary intent of
manipulating a website's ranking in search engine results. These backlinks are often irrelevant,
low-quality, or artificially generated, and they violate search engine guidelines.Express Legal
Funding

🔍 Common Types of Link Spam

1. Paid Links: Purchasing backlinks from other websites, regardless of relevance or content
quality.
2. Automated Linking: Using bots or software to generate backlinks at scale without
regard for quality or relevance.Rank Math
3. Excessive Link Exchanges: Engaging in reciprocal linking schemes where websites
exchange links solely to boost rankings.Express Legal Funding+2Rank Math+2Ahrefs+2
4. Forum and Blog Comment Spam: Posting keyword-rich links in forums or blog
comments without contributing meaningful content.Serpzilla.com
5. Link Farms and Private Blog Networks (PBNs): Creating or utilizing networks of
websites that exist solely to link to each other, artificially inflating link profiles.
6. Widely Distributed Links: Embedding links in widgets, footers, or templates that are
distributed across multiple websites.Ahrefs+2Ranktracker+2Rank Math+2

⚠️ Risks and Consequences

Engaging in link spam can lead to several negative outcomes:Search engines may lower a
website's ranking or remove it from search results entirely.Complete removal of a website from
the search engine's index.Users may perceive the website as untrustworthy, leading to decreased
traffic and conversions.seoworkflows.comInvesting time and money in link spam tactics that
ultimately harm the website's performance.

Market basket model

Market Basket Analysis in Data Mining


Last Updated : 16 Aug, 2024



A data mining technique that is used to uncover purchase patterns in any retail setting is known as
Market Basket Analysis. Basically, market basket analysis in data mining involves analyzing the
combinations of products that are bought together.

This is a technique that gives the careful study of purchases done by a customer in a supermarket.
This concept identifies the pattern of frequent purchase items by customers. This analysis can help
to promote deals, offers, sale by the companies, and data mining techniques helps to achieve this
analysis task. Example:

 Data mining concepts are in use for Sales and marketing to provide better customer service, to
improve cross-selling opportunities, to increase direct mail response rates.

 Customer Retention in the form of pattern identification and prediction of likely defections is
possible by Data mining.

 Risk Assessment and Fraud area also use the data-mining concept for identifying inappropriate
or unusual behavior etc.

Market basket analysis mainly works with the ASSOCIATION RULE {IF} -> {THEN}.

 IF means Antecedent: An antecedent is an item found within the data

 THEN means Consequent: A consequent is an item found in combination with the antecedent.

Let's see ASSOCIATION RULE {IF} -> {THEN} rules used in Market Basket Analysis in Data
Mining. For example, customers buying a domain means they definitely need extra
plugins/extensions to make it easier for the users.

Like we said above Antecedent is the item sets that are available in data. By formulating from the
rules means {if} component and from the example is the domain.

Same as Consequent is the item that is found with the combination of Antecedents. By formulating
from the rules means {THEN} component and from the example is extra plugins/extensions.

With the help of these, we are able to predict customer behavioral patterns. From this, we are able
to make certain combinations with offers that customers will probably buy those products. That
will automatically increase the sales and revenue of the company.

With the help of the Apriori Algorithm, we can further classify and simplify the item sets which
are frequently bought by the consumer.

There are three components in APRIORI ALGORITHM:


 SUPPORT

 CONFIDENCE

 LIFT

Now take an example, suppose 5000 transactions have been made through a popular eCommerce
website. Now they want to calculate the support, confidence, and lift for the two products, let's say
pen and notebook for example out of 5000 transactions, 500 transactions for pen, 700 transactions
for notebook, and 1000 transactions for both.

SUPPORT: It is been calculated with the number of transactions divided by the total number of
transactions made,

Support=freq(A,B)/NSupport=freq(A,B)/N

support(pen) = transactions related to pen/total transactions

i.e support -> 500/5000=10 percent

CONFIDENCE: It is been calculated for whether the product sales are popular on individual sales
or through combined sales. That is calculated with combined transactions/individual transactions.

Confidence=freq(A,B)/freq(A)Confidence=freq(A,B)/freq(A)

Confidence = combine transactions/individual transactions

i.e confidence-> 1000/500=20 percent

LIFT: Lift is calculated for knowing the ratio for the sales.

Lift=confidencepercent/supportpercentLift=confidencepercent/supportpercent

Lift-> 20/10=2

When the Lift value is below 1 means the combination is not so frequently bought by consumers.
But in this case, it shows that the probability of buying both the things together is high when
compared to the transaction for the individual items sold.

With this, we come to an overall view of the Market Basket Analysis in Data Mining and how to
calculate the sales for combination products.

Types of Market Basket Analysis

There are three types of Market Basket Analysis. They are as follow:
1. Descriptive market basket analysis: This sort of analysis looks for patterns and connections in
the data that exist between the components of a market basket. This kind of study is mostly
used to understand consumer behavior, including what products are purchased in combination
and what the most typical item combinations. Retailers can place products in their stores more
profitably by understanding which products are frequently bought together with the aid of
descriptive market basket analysis.

2. Predictive Market Basket Analysis: Market basket analysis that predicts future purchases based
on past purchasing patterns is known as predictive market basket analysis. Large volumes of
data are analyzed using machine learning algorithms in this sort of analysis in order to create
predictions about which products are most likely to be bought together in the future. Retailers
may make data-driven decisions about which products to carry, how to price them, and how to
optimize shop layouts with the use of predictive market basket research.

3. Differential Market Basket Analysis: Differential market basket analysis analyses two sets of
market basket data to identify variations between them. Comparing the behavior of various
client segments or the behavior of customers over time is a common usage for this kind of
study. Retailers can respond to shifting consumer behavior by modifying their marketing and
sales tactics with the help of differential market basket analysis.

Benefits of Market Basket Analysis

1. Enhanced Customer Understanding: Market basket research offers insights into customer
behavior, including what products they buy together and which products they buy the most
frequently. Retailers can use this information to better understand their customers and make
informed decisions.

2. Improved Inventory Management: By examining market basket data, retailers can determine
which products are sluggish sellers and which ones are commonly bought together. Retailers
can use this information to make well-informed choices about what products to stock and how
to manage their inventory most effectively.

3. Better Pricing Strategies: A better understanding of the connection between product prices and
consumer behavior might help merchants develop better pricing strategies. Using this
knowledge, pricing plans that boost sales and profitability can be created.

4. Sales Growth: Market basket analysis can assist businesses in determining which products are
most frequently bought together and where they should be positioned in the store to grow
sales. Retailers may boost revenue and enhance customer shopping experiences by improving
store layouts and product positioning.

Applications of Market Basket Analysis

1. Retail: Market basket research is frequently used in the retail sector to examine consumer
buying patterns and inform decisions about product placement, inventory management, and
pricing tactics. Retailers can utilize market basket research to identify which items are sluggish
sellers and which ones are commonly bought together, and then modify their inventory
management strategy accordingly.

2. E-commerce: Market basket analysis can help online merchants better understand the customer
buying habits and make data-driven decisions about product recommendations and targeted
advertising campaigns. The behaviour of visitors to a website can be examined using market
basket analysis to pinpoint problem areas.

3. Finance: Market basket analysis can be used to evaluate investor behaviour and forecast the
types of investment items that investors will likely buy in the future. The performance of
investment portfolios can be enhanced by using this information to create tailored investment
strategies.

4. Telecommunications: To evaluate consumer behaviour and make data-driven decisions about


which goods and services to provide, the telecommunications business might employ market
basket analysis. The usage of this data can enhance client happiness and the shopping
experience.

5. Manufacturing: To evaluate consumer behaviour and make data-driven decisions about which
products to produce and which materials to employ in the production process, the
manufacturing sector might use market basket analysis. Utilizing this knowledge will increase
effectiveness and cut costs.

Apriori Algorithm
Last Updated : 05 Apr, 2025



Apriori Algorithm is a foundational method in data mining used for discovering frequent
itemsets and generating association rules. Its significance lies in its ability to identify
relationships between items in large datasets which is particularly valuable in market basket
analysis.

For example, if a grocery store finds that customers who buy bread often also buy butter, it can
use this information to optimise product placement or marketing strategies.

How the Apriori Algorithm Works?

The Apriori Algorithm operates through a systematic process that involves several key steps:

1. Identifying Frequent Itemsets: The algorithm begins by scanning the dataset to identify
individual items (1-item) and their frequencies. It then establishes a minimum support
threshold, which determines whether an itemset is considered frequent.
2. Creating Possible item group: Once frequent 1-itemgroup(single items) are identified, the
algorithm generates candidate 2-itemgroup by combining frequent items. This process
continues iteratively, forming larger itemsets (k-itemgroup) until no more frequent itemgroup
can be found.

3. Removing Infrequent Item groups: The algorithm employs a pruning technique based on
the Apriori Property, which states that if an itemset is infrequent, all its supersets must also be
infrequent. This significantly reduces the number of combinations that need to be evaluated.

4. Generating Association Rules: After identifying frequent itemsets, the algorithm generates
association rules that illustrate how items relate to one another, using metrics like support,
confidence, and lift to evaluate the strength of these relationships.

Key Metrics of Apriori Algorithm

 Support: This metric measures how frequently an item appears in the dataset relative to the
total number of transactions. A higher support indicates a more significant presence of the
itemset in the dataset. Support tells us how often a particular item or combination of items
appears in all the transactions ("Bread is bought in 20% of all transactions.")

 Confidence: Confidence assesses the likelihood that an item Y is purchased when item X is
purchased. It provides insight into the strength of the association between two items.
Confidence tells us how often items go together. ("If bread is bought, butter is bought 75% of
the time.")

 Lift: Lift evaluates how much more likely two items are to be purchased together compared to
being purchased independently. A lift greater than 1 suggests a strong positive association. Lift
shows how strong the connection is between items. ("Bread and butter are much more likely to
be bought together than by chance.")

Lets understand the concept of apriori Algorithm with the help of an example. Consider the
following dataset and we will find frequent itemsets and generate association rules for them:

Transactions of a Grocery Shop

Step 1 : Setting the parameters

 Minimum Support Threshold: 50% (item must appear in at least 3/5 transactions). This
threshold is formulated from this formula:
Support(A)=Number of transactions containing itemset ATotal number of transactionsSupport(A
)=Total number of transactionsNumber of transactions containing itemset A

 Minimum Confidence Threshold: 70% ( You can change the value of parameters as per the use
case and problem statement ). This threshold is formulated from this formula:

Confidence(X→Y)=Support(X∪Y)Support(X)Confidence(X→Y)=Support(X)Support(X∪Y)

Step 2: Find Frequent 1-Itemsets

Lets count how many transactions include each item in the dataset (calculating the frequency of
each item).

Frequent 1-Itemsets

All items have support% ≥ 50%, so they qualify as frequent 1-itemsets. if any item has
support% < 50%, It will be omitted out from the frequent 1- itemsets.

Step 3: Generate Candidate 2-Itemsets

Combine the frequent 1-itemsets into pairs and calculate their support.

For this use case, we will get 3 item pairs ( bread,butter) , (bread,ilk) and (butter,milk) and will
calculate the support similar to step 2

Candidate 2-Itemsets

Frequent 2-itemsets:

 {Bread, Milk} meet the 50% threshold but {Butter, Milk} and {Bread ,Butter} doesn't meet the
threshold, so will be committed out.

Step 4: Generate Candidate 3-Itemsets

Combine the frequent 2-itemsets into groups of 3 and calculate their support.

for the triplet, we have only got one case i.e {bread,butter,milk} and we will calculate the
support.
Candidate 3-Itemsets

Since this does not meet the 50% threshold, there are no frequent 3-itemsets.

Step 5: Generate Association Rules

Now we generate rules from the frequent itemsets and calculate confidence.

Rule 1: If Bread → Butter (if customer buys bread, the customer will buy butter also)

 Support of {Bread, Butter} = 2.

 Support of {Bread} = 4.

 Confidence = 2/4 = 50% (Failed threshold).

Rule 2: If Butter → Bread (if customer buys butter, the customer will buy bread also)

 Support of {Bread, Butter} = 3.

 Support of {Butter} = 3.

 Confidence = 3/3 = 100% (Passes threshold).

Rule 3: If Bread → Milk (if customer buys bread, the customer will buy milk also)

 Support of {Bread, Milk} = 3.

 Support of {Bread} = 4.

 Confidence = 3/4 = 75% (Passes threshold).

The Apriori Algorithm, as demonstrated in the bread-butter example, is widely used in modern
startups like Zomato, Swiggy, and other food delivery platforms. These companies use it to
perform market basket analysis, which helps them identify customer behaviour patterns and
optimise recommendations.

Applications of Apriori Algorithm

Below are some applications of Apriori algorithm used in today's companies and startups

1. E-commerce: Used to recommend products that are often bought together, like laptop + laptop
bag, increasing sales.
2. Food Delivery Services: Identifies popular combos, such as burger + fries, to offer combo deals
to customers.

3. Streaming Services: Recommends related movies or shows based on what users often watch
together, like action + superhero movies.

4. Financial Services: Analyzes spending habits to suggest personalised offers, such as credit card
deals based on frequent purchases.

5. Travel & Hospitality: Creates travel packages (e.g., flight + hotel) by finding commonly
purchased services together.

6. Health & Fitness: Suggests workout plans or supplements based on users' past activities, like
protein shakes + workouts.

Handling larger datasets in main memory

Handling Large data in Data Science


Last Updated : 14 Aug, 2024



Large data workflows refer to the process of working with and analyzing large datasets using the
Pandas library in Python. Pandas is a popular library commonly used for data analysis and
modification. However, when dealing with large datasets, standard Pandas procedures can
become resource-intensive and inefficient.

In this guide, we'll explore strategies and tools to tackle large datasets effectively, from
optimizing Pandas to leveraging alternative packages.

Optimizing Pandas for Large Datasets

Even though Pandas thrives on in-memory manipulation, we can leverage more performance out
of it for massive datasets:

Selective Column Reading

When dealing with large datasets stored in CSV files, it's prudent to be selective about which
columns you load into memory. By utilizing the usecols parameter in Pandas when reading
CSVs, you can specify exactly which columns you need. This approach avoids the unnecessary
loading of irrelevant data, thereby reducing memory consumption and speeding up the parsing
process.
For example, if you're only interested in a subset of columns such as "name," "age," and
"gender," you can instruct Pandas to only read these columns, rather than loading the entire
dataset into memory.

Engine Selection

The choice of engine when reading data can significantly impact performance, especially with
large datasets. Opting for the pyarrow engine parameter can lead to notable improvements in
loading speed. PyArrow is a cross-language development platform for in-memory analytics, and
utilizing it as the engine for reading data in Pandas can leverage its optimized processing
capabilities. This choice is particularly beneficial when working with large datasets where
efficient loading is crucial for maintaining productivity.

Efficient DataTypes Usage

Efficient management of data types can greatly impact memory usage when working with large
datasets. By specifying appropriate data types, such as category for columns with a limited
number of unique values or int8/16 for integer columns with a small range of values, you can
significantly reduce memory overhead. Conversely, using generic data types like object or
float64 can lead to unnecessary memory consumption, especially when dealing with large
datasets. Therefore, optimizing data types based on the nature of your data can help conserve
memory and improve overall performance.

Chunked Reading

Loading large datasets into memory all at once can be resource-intensive and may lead to
memory errors, particularly on systems with limited RAM. To address this challenge, Pandas
offers the ability to read data in chunks. This allows you to lazily load data in manageable
chunks, processing each chunk iteratively without the need to load the entire dataset into
memory simultaneously.

By applying operations chunk-by-chunk, you can effectively handle large datasets while
minimizing memory usage and optimizing performance. Utilize lazy evaluation methods
provided by Pandas, such as DataFrame.iterrows() or DataFrame.itertuples(), to iterate over
the dataset row by row without loading the entire dataset into memory.

Vectorization

Vectorized operations, which involve applying operations to entire arrays or dataframes at once
using optimized routines, can significantly improve computational efficiency compared to
traditional Python loops. By leveraging vectorized Pandas/NumPy operations, you can perform
complex computations on large datasets more efficiently, taking advantage of underlying
optimizations and parallelization. This approach not only speeds up processing but also enhances
scalability, making it well-suited for handling large datasets with high performance
requirements.
Copy Avoidance

When performing operations on DataFrame objects in Pandas, it's essential to be mindful of


memory usage, particularly when dealing with large datasets. Chaining operations that modify
the original DataFrame using .loc() or .iloc() instead of creating copies can help minimize
memory overhead.

By avoiding unnecessary duplication of data, you can optimize memory usage and prevent
potential memory errors, especially when working with large datasets that exceed available
memory capacity. This practice is crucial for maintaining efficiency and scalability when
processing large datasets in Python.

Packages for Extreme Large Datasets

When Pandas isn't sufficient, these alternative packages come to the rescue:

Dask

Positioned as a true champion, Dask revolutionizes data handling by distributing DataFrames


across a network of machines. This distributed computing paradigm enables seamless scaling of
Pandas workflows, allowing you to tackle even the most mammoth datasets with ease. By
leveraging parallelism and efficient task scheduling, Dask optimizes resource utilization and
empowers users to perform complex operations on datasets that surpass traditional memory
limits.

Vaex

Renowned for its prowess in exploration, Vaex adopts a unique approach to processing colossal
DataFrames. Through the technique of lazy evaluation, Vaex efficiently manages large datasets
by dividing them into manageable segments, processing them on-the-fly as needed. This method
not only conserves memory but also accelerates computation, making Vaex an invaluable tool
for uncovering insights within massive datasets. With its ability to handle data exploration tasks
seamlessly, Vaex facilitates efficient analysis and discovery, even in the face of daunting data
sizes.

Modin

Modin accelerates Pandas operations by automatically distributing computations across multiple


CPU cores or even clusters of machines. It seamlessly integrates with existing Pandas code,
allowing users to scale up their data processing workflows without needing to rewrite their
codebase.

Spark
Apache Spark is a distributed computing framework that provides high-level APIs in Java, Scala,
Python, and R for parallel processing of large datasets. Spark's DataFrame API allows users to
perform data manipulation and analysis tasks at scale, leveraging distributed computing across
clusters of machines. It excels in handling big data scenarios where traditional single-node
processing is not feasible.

Efficient memory management is essential when dealing with large datasets. Techniques like
chunking, lazy evaluation, and data type optimization help in minimizing memory usage and
improving performance.

To delve further, please refer to:

 Handling Large Datasets in Python

 Handling Large Datasets in Pandas

 Working with large CSV files in Python

Conclusion

Handling large datasets in Python demands a tailored approach. While Pandas serves as a
foundation, optimizing its usage and exploring alternative can unlock superior performance and
scalability. Don't hesitate to venture beyond conventional techniques to conquer the challenges
of large-scale data analysis.

Limited pass algorithm


A limited pass algorithm refers to a type of data processing algorithm that reads its input data
stream a limited number of times, typically one or two, to complete a task. This contrasts with
algorithms that require multiple passes, especially when dealing with very large datasets that
don't fit entirely in memory.
Key Characteristics of Limited Pass Algorithms:

 Reduced Memory Usage:

By processing the data in a limited number of passes, limited pass algorithms minimize
the need for large amounts of memory storage.

  Efficiency for Large Datasets:


They are often preferred when dealing with datasets that are too large to fit into main memory.
  Examples:
The Apriori algorithm is a classic example of an algorithm that uses multiple passes. In contrast,
algorithms like the Eclat algorithm are designed to minimize the number of passes.
  Single-Pass Algorithms:
One-pass algorithms, a specific type of limited pass algorithm, read the input data stream only
once, processing items in order without unbounded buffering.

In essence, a limited pass algorithm is a strategy for efficient data processing, especially when
dealing with large datasets where memory limitations are a concern.
Counting frequent itemsets involves identifying sets of items that appear together frequently in a
dataset, based on a minimum support threshold. This is a core process in association rule mining,
used to find relationships between items.
Here's a breakdown of the process:
1. Identifying Frequent Itemsets:

 Minimum Support Threshold:

A threshold is set, determining the minimum frequency (support) an itemset must have to
be considered frequent.

  Support Count:
The support count of an itemset is the number of transactions or records that contain that specific
itemset.
  Algorithms:
Algorithms like Apriori, FP-Growth, and Eclat are used to efficiently find frequent itemsets.

 Apriori is a candidate generation algorithm that iteratively identifies frequent itemsets.

  FP-Growth avoids candidate generation and uses a tree-based approach for efficiency.
  Eclat uses a depth-first search approach and equivalence classes to reduce the search space.


o

2. Counting Frequent Itemsets:

 Algorithm Efficiency:

Efficient algorithms are crucial for handling large datasets, as finding all frequent
itemsets can be computationally intensive.

  Candidate Generation:
Some algorithms (like Apriori) generate potential frequent itemsets as candidates and then count
their support.
  Pruning:
Pruning techniques, like the Apriori property, can reduce the number of candidate itemsets that
need to be evaluated.
  Tree-Based Approaches:
FP-Growth uses an FP-tree to store and extract patterns, reducing the need for multiple scans of
the data.

3. Beyond Counting:

 Association Rule Mining: Once frequent itemsets are identified, association rules can be
generated, revealing relationships between items.

  Maximal Frequent Itemsets: Identifying maximal frequent itemsets can help reduce the
number of generated itemsets while still capturing all relevant relationships.
  Closed Itemsets: Closed itemsets preserve the support count of their subsets.

Unit5 clustering

Clustering techniques
Clustering techniques are a class of unsupervised learning algorithms that group similar data
points into clusters based on their characteristics, according to Analytics Vidhya. These
techniques help uncover hidden patterns and structures within datasets, revealing natural
groupings that might not be immediately apparent. They are used in various fields like image
analysis, bioinformatics, and machine learning.
Here's a more detailed look at some common clustering techniques:
1. Hierarchical Clustering:

 Agglomerative:

Starts with each data point as its own cluster and iteratively merges the closest clusters
until only one cluster remains.

 Divisive:

Starts with all data points in one cluster and iteratively splits it into smaller clusters until
each data point is in its own cluster.

2. Centroid-based Clustering (Partitioning Methods):

 K-means:

Divides data into K clusters based on minimizing the distance of data points to their
nearest cluster centroid.

  K-medians:
Similar to K-means, but uses medians instead of means to represent cluster centers.

3. Density-based Clustering (Model-based methods):

 DBSCAN: Groups data points based on their density in the feature space, forming
clusters from dense regions.

  OPTICS: Similar to DBSCAN but can find clusters of arbitrary shapes and sizes.

4. Distribution-based Clustering:

 Gaussian Mixture Models (GMM): Models clusters using a mixture of probability


distributions, allowing data points to belong to multiple clusters with varying
probabilities.

5. Other Techniques:

 Fuzzy Clustering: Allows data points to belong to multiple clusters with varying degrees
of membership.

  Spectral Clustering: Uses the spectral properties of a similarity graph to find clusters.
  Grid-based Clustering: Divides the data space into a grid and uses the grid cells as
clusters.
  Brown Clustering: A hierarchical clustering approach that uses distributional information
to construct clusters, often applied in natural language processing.

Applications of Clustering:

 Customer Segmentation: Grouping customers based on their purchasing behavior for


targeted marketing.

  Anomaly Detection: Identifying unusual data points that don't belong to any cluster.
  Image Segmentation: Dividing an image into different regions based on pixel similarity.
  Bioinformatics: Clustering genes or proteins based on their expression patterns

Hierarchical clustering
Hierarchical clustering is a method of clustering that creates a hierarchy of clusters, starting with
individual data points and progressively merging or splitting them based on similarity. This
creates a tree-like structure, where each node represents a cluster at a different level of
granularity. Unlike methods like k-means that require a pre-specified number of clusters,
hierarchical clustering allows for the exploration of different levels of clustering.
Here's a more detailed breakdown:
Key Concepts:

 Hierarchy:

Hierarchical clustering builds a hierarchy of clusters, allowing for multiple levels of


grouping.

  Similarity:
Data points are grouped based on their similarity, creating clusters of related objects.
  Dendrogram:
The hierarchical structure is typically visualized using a dendrogram, a tree diagram that shows
the relationships between clusters.
  Agglomerative vs. Divisive:
There are two main types of hierarchical clustering:

 Agglomerative (Bottom-Up): Starts with each data point as its own cluster and merges
similar clusters until a single cluster remains.

  Divisive (Top-Down): Starts with all data points in a single cluster and repeatedly splits it
into smaller clusters.

 

 No Need for Pre-defined Number of Clusters:


Unlike some other clustering algorithms, hierarchical clustering doesn't require you to specify
the number of clusters beforehand. You can choose the level of clustering based on the
dendrogram.
  Unsupervised Learning:
Hierarchical clustering is an unsupervised learning technique, meaning it doesn't require labeled
data for training.

Applications:

 Data Mining: Hierarchical clustering can be used to identify patterns and relationships in
data.

  Image Analysis: Grouping similar pixels or features in images.


  Social Sciences: Identifying groups of individuals with similar characteristics or behaviors.
  Biology: Classifying populations or species.
  Customer Segmentation: Identifying different customer groups based on their purchasing
behavio

Algorithms kmeans

K means Clustering – Introduction


Last Updated : 13 May, 2025



K-Means Clustering is an Unsupervised Machine Learning algorithm which groups unlabeled


dataset into different clusters. It is used to organize data into groups based on their similarity.

Understanding K-means Clustering

For example online store uses K-Means to group customers based on purchase frequency and
spending creating segments like Budget Shoppers, Frequent Buyers and Big Spenders for
personalised marketing.

The algorithm works by first randomly picking some central points called centroids and each
data point is then assigned to the closest centroid forming a cluster. After all the points are
assigned to a cluster the centroids are updated by finding the average position of the points in
each cluster. This process repeats until the centroids stop changing forming clusters. The goal of
clustering is to divide the data points into clusters so that similar data points belong to same
group.

How k-means clustering works?

We are given a data set of items with certain features and values for these features like a vector.
The task is to categorize those items into groups. To achieve this we will use the K-means
algorithm. 'K' in the name of the algorithm represents the number of groups/clusters we want to
classify our items into.
K means Clustering

The algorithm will categorize the items into k groups or clusters of similarity. To calculate that
similarity we will use the Euclidean distance as a measurement. The algorithm works as follows:

1. First we randomly initialize k points called means or cluster centroids.

2. We categorize each item to its closest mean and we update the mean's coordinates, which are
the averages of the items categorized in that cluster so far.

3. We repeat the process for a given number of iterations and at the end, we have our clusters.

The "points" mentioned above are called means because they are the mean values of the items
categorized in them. To initialize these means, we have a lot of options. An intuitive method is to
initialize the means at random items in the data set. Another method is to initialize the means at
random values between the boundaries of the data set. For example for a feature x the items have
values in [0,3] we will initialize the means with values for x at [0,3].

Implementation of K-Means Clustering in Python

We will use blobs datasets and show how clusters are made.

Step 1: Importing the necessary libraries


We are importing Numpy, Matplotlib and scikit learn.

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

X,y = make_blobs(n_samples = 500,n_features = 2,centers =


3,random_state = 23)

fig = plt.figure(0)

plt.grid(True)

plt.scatter(X[:,0],X[:,1])

plt.show()

Basic Understanding of CURE Algorithm


Last Updated : 31 Aug, 2021



CURE(Clustering Using Representatives)

 It is a hierarchical based clustering technique, that adopts a middle ground between the
centroid based and the all-point extremes. Hierarchical clustering is a type of clustering, that
starts with a single point cluster, and moves to merge with another cluster, until the desired
number of clusters are formed.
 It is used for identifying the spherical and non-spherical clusters.
 It is useful for discovering groups and identifying interesting distributions in the underlying data.
 Instead of using one point centroid, as in most of data mining algorithms, CURE uses a set of
well-defined representative points, for efficiently handling the clusters and eliminating the
outliers.
Representation of Clusters and Outliers

Six steps in CURE algorithm:

CURE
Architecture

 Idea: Random sample, say 's' is drawn out of a given data. This random sample is partitioned,
say 'p' partitions with size s/p. The partitioned sample is partially clustered, into say 's/pq'
clusters. Outliers are discarded/eliminated from this partially clustered partition. The partially
clustered partitions need to be clustered again. Label the data in the disk.
Representation of partitioning and clustering

 Procedure :
1. Select target sample number 'gfg'.
2. Choose 'gfg' well scattered points in a cluster.
3. These scattered points are shrunk towards centroid.
4. These points are used as representatives of clusters and used in 'Dmin' cluster merging
approach. In Dmin(distance minimum) cluster merging approach, the minimum distance
from the scattered point inside the sample 'gfg' and the points outside 'gfg sample, is
calculated. The point having the least distance to the scattered point inside the sample,
when compared to other points, is considered and merged into the sample.
5. After every such merging, new sample points will be selected to represent the new
cluster.
6. Cluster merging will stop until target, say 'k' is reached.
Clustering in non-Euclidean spaces refers to grouping data points into clusters based on a non-
Euclidean distance measure, which can be used to model data with complex structures or
relationships. This approach is particularly useful when traditional Euclidean distances fail to
capture the inherent structure of the data.
Elaboration:

 Non-Euclidean Spaces:

Unlike Euclidean spaces where distances are measured using the Pythagorean theorem,
non-Euclidean spaces can have curved geometries or distances based on other criteria like
shortest paths or functional relationships.

  Applications:
Clustering in non-Euclidean spaces is used in various fields, including:

 Data Analysis: Analyzing data with complex relationships, like social networks or
biological systems.

  Machine Learning: Developing algorithms that can handle non-Euclidean data structures.
  Image Processing: Clustering pixels based on spatial relationships or visual features.

 

 Alternative Clustering Algorithms:

 K-means: The classic K-means algorithm can be adapted to non-Euclidean spaces by


using different distance metrics and potentially modifying the centroid update process.

  Hierarchical Clustering: Hierarchical clustering methods can also be adapted to use non-
Euclidean distances and create dendrograms that reflect the data's structure.

 

 Challenges:

 Finding Appropriate Distance Metrics: Choosing the right non-Euclidean distance


measure is crucial and can be challenging.

  Computational Complexity: Some non-Euclidean distance metrics and clustering


algorithms can be computationally intensive.

 

 Examples of Non-Euclidean Spaces:


 Graph Networks: Nodes and edges can be used to represent data and the distance
between nodes can be the length of the shortest path.

  Manifold Learning: Non-linear structures in data can be approximated using manifolds,


and clustering can be performed on the manifold space.

  Semantic Spaces: Words or concepts can be clustered based on their semantic


similarity, which is a non-Euclidean distance measure

Streams in parallel refer to processing data using multiple threads simultaneously, leveraging the
parallel capabilities of modern CPUs. This contrasts with sequential streams, where data is
processed one element at a time. Parallel streams can significantly improve performance for
computationally intensive tasks by distributing the workload across multiple cores.
Key Concepts:

 Parallel Streams:

A parallel stream processes data elements concurrently using multiple threads, typically
determined by the number of available CPU cores.

  Sequential Streams:
A sequential stream processes data elements one after another, in the order they appear in the
source.
  Performance:
Parallel streams can offer significant performance gains for tasks that benefit from
parallelization, such as filtering, mapping, and aggregating large datasets.
  Order:
Parallel streams do not guarantee the order of elements in the output, as elements may be
processed and collected by different threads.
  Fork/Join Framework:
Parallel streams in Java are built on top of the Fork/Join Framework. This framework
decomposes the stream's data into smaller chunks, which are then processed concurrently by
multiple threads.

When to use parallel streams:

 Large datasets:

Parallel streams are most effective when dealing with large datasets that can be
efficiently partitioned and processed by multiple threads.

  Computationally intensive tasks:


Parallel streams can significantly reduce the execution time of tasks that involve complex
calculations or operations.
  When order is not crucial:
If the order of elements in the output is not important, parallel streams can be a suitable choice
for optimizing performance.

When to use sequential streams:

 Small datasets:

For small datasets, the overhead of parallelization may outweigh the performance
benefits.

  Tasks requiring specific order:


If the order of elements in the output is critical, sequential streams should be used.
  Tasks with side effects:
If the operations in the stream have side effects (e.g., modifying shared mutable state), sequential
streams are generally preferred to avoid concurrency issues

import java.util.Arrays;

import java.util.List;

public class ParallelStreamsExample {

public static void main(String[] args) {

List<Integer> numbers = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9,


10);

// Sequential stream

List<Integer> evenNumbersSequential = numbers.stream()

.filter(n -> n % 2 == 0)

.collect(java.util.stream.Collectors.toList());
// Parallel stream

List<Integer> evenNumbersParallel =
numbers.parallelStream()

.filter(n -> n % 2 == 0)

.collect(java.util.stream.Collectors.toList());

System.out.println("Sequential: " +
evenNumbersSequential);

System.out.println("Parallel: " + evenNumbersParallel);

You might also like