Data Mining Series 2 Important Topics
Data Mining Series 2 Important Topics
https://rtpnotes.vercel.app
Data-Mining-Series-2-Important-Topics
1. Web usage mining
2. Web structure mining
3. Web content mining
4. TF-IDF
Term Frequency (TF)
Types of Term Frequency:
Inverse Document Frequency (IDF)
5. Text retrieval methods
1. Document Selection Methods
Examples of Boolean Queries:
2. Document Ranking Methods
How Ranking Works?
Tracks users' browsing history (which pages they visit and in what order).
Finds patterns in user behavior (like frequently visited pages, search trends, and
associations).
Helps predict what users might be searching for on the Internet.
Example:
Imagine a shopping website tracking what products users view. If many users check mobile
phones → reviews → price comparison, the site can recommend reviews whenever
someone views a mobile phone.
Why is it useful?
Why is it useful?
Helps in search engine optimization (SEO) (Google ranks pages based on links).
Improves website navigation by organizing links efficiently.
Assists in detecting spam websites (unnatural link networks).
4. TF-IDF
Term Frequency (TF)
1. Binary TF:
If a word appears in a document, TF = 1; otherwise, TF = 0.
Example: If the word "data" appears in a document, TF(data) = 1.
2. Raw Count TF:
Counts how many times a word appears.
Example: If "data" appears 3 times in a document, TF(data) = 3.
3. Relative Term Frequency:
Adjusts for document length by dividing the term frequency by the total number of words
in the document.
IDF measures how important a word is by checking how common it is across multiple
documents.
These methods select documents that exactly match the query conditions.
Uses Boolean retrieval model, where documents are represented by keywords.
Users provide a Boolean expression (AND, OR, NOT) to filter documents.
✅ "car AND repair shops" → Retrieves documents containing both "car" and "repair shops".
✅ "tea OR coffee" → Retrieves documents containing either "tea" or "coffee".
✅ "database systems BUT NOT Oracle" → Retrieves documents about database systems
but excludes those mentioning "Oracle".
🔹 Limitations:
Only returns documents that fully satisfy the query.
Doesn't rank results by relevance.
Works well for precise searches but not great for exploratory searches.
🔹 Why is it useful?
Helps in search engines, recommendation systems, and large databases.
More user-friendly for ordinary users compared to strict Boolean searches.
6. Text Mining-Text Data Analysis and-information Retrieval
Text Mining
Text mining is an interdisciplinary field that applies techniques from data mining, machine
learning, statistics, and computational linguistics to extract meaningful information from textual
data. It is commonly used in various domains such as digital libraries, web pages, emails, and
news articles.
In text mining, various basic measures are used to analyze and retrieve information from text-
based datasets. Two primary methods used in text data analysis include:
1. Precision – The percentage of retrieved documents that are actually relevant to the query.
2. Recall – The percentage of relevant documents that were retrieved from the dataset.
3. F-Score – A metric that balances precision and recall to provide a single evaluation score.
The TF-IDF measure is commonly used in text retrieval and ranking systems
Term Frequency (TF): Measures how frequently a term appears in a document.
Inverse Document Frequency (IDF): Measures the importance of a term. A term that
appears in many documents has a lower IDF value.
7. Apriori algorithm
The Apriori Algorithm is used in data mining to find frequent itemsets in a large dataset of
transactions. It is commonly used for market basket analysis, where we find items that are
often bought together.
Imagine you own a supermarket, and you have 9 transactions (customer purchases). Your
goal is to find which products are frequently bought together.
1. Count how many times each item appears in the transactions (from the table above)
2. Keep only those that meet the minimum support count (Minimum support count = 2).
Item Count
Milk 6
Bread 6
Butter 4
Itemset Count
(Milk, Bread) 4
(Milk, Butter) 3
(Bread, Butter) 3
Itemset Count
(Milk, Bread, Butter) 2
Now that we have found frequent itemsets, we can generate association rules to understand
how items are related.
Antecedent (Left-hand side - LHS): The item(s) that appear first (e.g., Milk, Bread).
Consequent (Right-hand side - RHS): The item(s) that might appear next (e.g., Butter).
This means:
We already know:
So, if a customer buys Milk and Bread together, there is a 50% chance they will also buy
Butter.
We already know:
If a customer buys Milk and Bread, they might also buy Butter.
(M ilk, Bread) ⇒ Butter
Confidence = 50%
If a customer buys Milk, they might also buy Bread.
M ilk ⇒ Bread
Confidence = 66.7%.
Apriori scans the database multiple times, which makes it slow for large datasets.
FP-Growth scans the database only twice and stores the data in a compact tree format,
making it much faster.
No need to generate large candidate itemsets, reducing computational effort.
Just like Apriori, first, we scan the database to find the frequency (support count) of
individual items.
Only keep items that meet the minimum support count.
Sort items in descending order of frequency.
Milk: 6 times
Bread: 6 times
Butter: 4 times
null
├── Milk (6)
│ ├── Bread (4)
│ │ ├── Butter (2)
│ ├── Butter (2)
├── Bread (2)
│ ├── Butter (2)
├── Butter (1)
Start from the least frequent item (Butter) and move upward.
Construct conditional pattern bases (paths leading to an item).
Generate frequent patterns from these paths.
Summary
Imagine you own a supermarket and you want to find which products are frequently bought
together (like "Milk & Bread").
It would be better if we could add new item combinations while scanning instead of waiting
for a full database scan
DIC solves this problem by allowing new item combinations to be added while scanning
instead of waiting for a full pass.
Step-by-Step Explanation
Apriori waits until it scans all 4 transactions before checking which book combinations are
common.
DIC starts adding "Math & Science" as a pattern earlier, maybe after just 2 transactions,
making it much faster.
1. Dividing the database into smaller partitions and processing them separately.
2. Finding frequent itemsets locally in each partition.
3. Merging the results to find global frequent itemsets in just two database scans.
✅ Advantage: Faster than Apriori because it only scans the database twice.
❌ Disadvantage: Requires extra memory for partition management.
3. Efficiency Comparison
✅ Partitioning Algorithm is more efficient than Apriori because it reduces database scans
and processes smaller partitions instead of the full dataset.
Apriori is simple but slow because it scans the database multiple times.
Partitioning Algorithm is faster and more efficient because it scans the database only
twice.
11. DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a smart way to
group similar data points together without needing to know how many clusters there are
beforehand! 🎯
Imagine you are looking at a map of a city and trying to group areas based on crowd density
1. A busy shopping mall → Many people are close together → Forms a cluster.
2. A small shop in the countryside → Very few people around → Considered noise
(outlier).
3. A market with multiple stalls → Another crowded area → Forms another cluster.
ε (epsilon): A small circle around a point (like a small area on the map).
MinPts (minimum points): The minimum number of points needed to form a cluster.
Step-by-Step Explanation 📌
1️⃣ Start with a Random Point
If there are enough nearby points (MinPts), this becomes a core point and a new
cluster starts.
If not enough points are nearby, the point is marked as noise (outlier).
Confidence tells us how often a rule is correct when a certain item is present.
Confidence(A ⇒ B) = Support(A, B) / Support(A)
🔹 Example:
"Milk & Bread" appear in 20 transactions.
"Milk" appears in 40 transactions.
💡 Meaning: If a customer buys Milk, there is a 50% chance they will also buy Bread.
Accuracy (Used in Classification & Prediction Models)
Accuracy measures how many predictions were correct out of all predictions.
🔹 Example:
A medical test identifies 10 people as having a disease:
Precision = 8 / (8 + 2) = 8 / 10 = 80%
💡 Meaning: When the test says a person has the disease, it is 80% correct.
Recall (Sensitivity) (Used in Classification & Information Retrieval)
Recall = 8 / (8 + 4) = 8 / 12 = 66.7%