Data Mining Module 5 Important Topics PYQs
Data Mining Module 5 Important Topics PYQs
PYQs
For more notes visit
https://rtpnotes.vercel.app
Data-Mining-Module-5-Important-Topics-PYQs
1. Describe any two-text retrieval indexing techniques.
What is Text Retrieval?
1. Inverted Index (Like a word-to-document dictionary)
Advantages:
Drawbacks:
2. Signature Files (Like a digital fingerprint for each doc)
Why fetch and check?
Advantages:
Drawbacks:
2. Compare and contrast the focused crawling and regular crawling techniques.
Introduction to Crawlers
Regular Crawlers (a.k.a. General or Periodic Crawlers)
What they do:
Use:
Types:
Focused Crawlers (Topic-Specific Crawlers)
What they do:
Example:
How they work:
3. Describe the following activities involved in the web usage mining i) Pre-processing
activity ii) pattern analysis
i) Pre-processing Activity
Key Steps in Pre-processing:
ii)Pattern Analysis
Key Activities in Pattern Analysis:
4. Differentiate between web content mining and web structure mining.
5. Compare web structure mining and web usage mining.
6. Explain HITS algorithm with an example.
What are Hubs and Authorities?
How does HITS work?
Example:
7. Describe different Text retrieval methods. Explain the relationship between text
mining, information retrieval and information extraction.
What is Text Mining?
What is Information Retrieval (IR)?
What is Information Extraction (IE)?
Text retrieval methods
1. Document Selection Methods
2. Document Ranking Methods
8. Explain how web structure mining is different from web usage mining and web
content mining? write a CLEVER algorithm for web structure mining.
1. Web Content Mining
2. Web Structure Mining
3. Web Usage Mining
CLEVER Algorithm
What are Authorities and Hubs?
Goal of CLEVER:
Basic Idea (How it Works):
Algorithm Steps
9.Term frequency matrix given in the table shows the frequency terms per document
Step 1: What is TF-IDF?
Step 2: Find TF (Term Frequency of T4 in D3)
Step 3: Find how common T4 is in all documents (IDF part)
Step 1: Term Frequency (TF)
Step 2: Inverse Document Frequency (IDF)
Step 3: Calculate TF-IDF
Step 4: Final answer
10. List and explain the different data structures used for web usage mining?
What is Web Usage Mining?
Data Structures Used in Web Usage Mining
1. Trie (Prefix Tree)
What is a Trie?
How It Helps in Web Usage Mining:
Example:
Problem with Standard Trie:
2. Compressed Trie / Suffix Tree
What is a Suffix Tree?
Why Use It?
11. Write any three applications of web usage mining and explain
1. Personalization
How Web Usage Mining helps:
Real-life Example:
2. Improving Website Design
How Web Usage Mining helps:
Example:
4.Text Classification
Examples:
5. Information Extraction (IE)
To answer this quickly, you need a system that can find documents fast — without opening
each one manually.
There are many techniques, but the most popular two are:
Imagine this:
You have 3 documents:
Advantages:
Fast searching
Simple to implement
Used in real search engines (like Google)
Drawbacks:
Each word is converted into a set of bits using a method called hashing
If a word is in the document, some bits are turned ON ( 1 )
The result is a compact signature for that document
Because two different words might activate the same bits (due to multiple-to-one mapping),
some documents may look like a match even if they aren’t. This is called a false positive, and
that’s why checking is needed.
Advantages:
Drawbacks:
Imagine you want to collect customer reviews from various shopping websites.
You could:
Start at Amazon
Go to the product page
Open every review page
Collect text, rating, and date
Then go to Flipkart, do the same
Repeat this for every site
This automated program that visits pages, follows links, and gathers data is called a web
crawler (or spider or robot). Crawlers are the backbone of search engines, helping build and
update their page indexes.
Use:
Types:
Periodic Crawlers: Crawl the entire web again after a fixed time to update the index.
Incremental Crawlers: Only update parts of the index that have changed recently.
Example:
If the crawler is focused on “customer reviews,” it won’t waste time going into sports news or
recipe blogs.
i) Pre-processing Activity
Pre-processing is the initial and essential step in Web Usage Mining where raw web log data is
cleansed, structured, and transformed into a format suitable for mining.
ii)Pattern Analysis
Pattern analysis is the process of interpreting the mined patterns (from logs) to discover
meaningful, actionable insights.
Hubs are web pages that link to many other pages. They're like "directories" that point to
useful content. Think of a hub like a page that links to a lot of articles, resources, or useful
websites.
Authorities are web pages that are linked to by many other pages. They're like "experts" or
"important pages" on a particular topic. These are pages that people tend to link to because
they contain valuable content.
Example:
You might find a page that links to several reviews, product details, and comparison sites.
This page would be a hub because it points to many useful resources.
Then, there might be a page that has a detailed review of the top laptops, and many other
pages link to it. This page would be an authority because it is the expert or highly
recommended by other sources.
The HITS algorithm helps find these types of pages to make sure the results you see are from
reputable sources (authorities) and useful directories (hubs)
7. Describe different Text retrieval methods. Explain the
relationship between text mining, information retrieval and
information extraction.
What is Text Mining?
Text mining is like teaching a computer to read, understand, and find useful info from lots of
text documents.
Imagine you have thousands of news articles or research papers. You can’t read them all, right?
Text mining helps you:
IR is about searching through a large collection of documents and retrieving only the
relevant ones based on a user’s query (search).
Like Google Search. You type “top movies 2024” → it shows only those relevant pages.
IR is used to:
Locate documents that match a keyword
Rank documents based on how relevant they are to the query
It’s pull-based: You pull information when you search.
IE is a part of text mining. Once you have the documents, IE is used to pull specific pieces of
data from them.
When you ask a question or type a search (like “best smartphones”), the system needs to
retrieve relevant documents. There are two types of text retrieval methods:
Here, the system uses Boolean logic (AND, OR, NOT) to find documents that exactly match
the user’s conditions.
Example Queries:
“mobile AND cheap”
“data science OR machine learning”
“coffee NOT tea”
Drawbacks:
Requires exact logic
Not beginner-friendly
Not flexible (no ranking)
Use Case: Works well if the user knows exactly what they want.
What it is: Extracting useful information from the actual content of web pages.
Focus: Text, images, videos, metadata, etc.
Example: Searching for "healthy recipes" on Google and getting actual articles, blog posts,
or videos related to recipes.
Techniques Used: Text mining, NLP (Natural Language Processing), multimedia mining.
What it is: Discovering relationships and structures from the hyperlink structure of the
web.
Focus: The links between web pages—like a web graph.
Example: Figuring out which websites are authoritative (like Wikipedia or a government
site) by looking at how many other pages link to it.
Techniques Used: Graph theory, link analysis algorithms like PageRank or CLEVER.
What it is: Mining data from web user behavior—how users interact with websites.
Focus: Clickstreams, browsing patterns, session logs.
Example: Netflix analyzing your viewing habits to suggest movies.
Techniques Used: Log file analysis, pattern recognition, clustering.
CLEVER Algorithm
The CLEVER algorithm is a famous algorithm used in Web Structure Mining. It helps find
authoritative pages and hub pages.
Authority Page: A page that is a credible source of information (ex: official documentation,
Wikipedia).
Hub Page: A page that links to many authoritative pages (like a curated list of resources
or a directory).
A good hub points to good authorities, and a good authority is pointed to by good hubs.
Goal of CLEVER:
Authority Score
Hub Score
These scores help us rank the pages.
Algorithm Steps
1. Initialize:
1. Set hub and authority scores of all pages to 1.
2. Authority Update Rule:
1. Each page's authority score = sum of the hub scores of all pages linking to it.
3. Hub Update Rule:
1. Each page's hub score = sum of the authority scores of all pages it links to.
4. Normalize the Scores to prevent values from growing too large.
5. Repeat the update steps until the scores converge (i.e., stop changing significantly).
Think of TF-IDF like a score that tells us how important a word (or term) is in a document,
compared to other documents.
TF (Term Frequency) – How many times the word appears in this document?
IDF (Inverse Document Frequency) – Is this word special? Or is it found in almost every
document?
We use this to calculate a small number (called IDF) using a formula, but you can just think:
Document T4
D3 6
So,
TF(D3, T4) = 6
We need:
Document T4
D1 0 → Not present
D2 3 → Present
Document T4
D3 6 → Present
D4 8 → Present
So,
df(T4) = 3
10. List and explain the different data structures used for
web usage mining?
What is Web Usage Mining?
Web Usage Mining is the process of analyzing user behavior by studying logs such as:
Clickstreams
Browser history
Session data
To make sense of this data, we need efficient data structures to store and process usage
patterns.
To efficiently track and discover patterns, especially sequences of web pages visited by
users, the following data structures are commonly used:
What is a Trie?
Example:
"ABOUT"
"CAT"
"CATEGORY"
A standard trie would look like this:
Saves memory.
Still supports fast lookups and pattern matching.
Real-life Example:
If a user frequently visits tech-related news articles, the homepage can be customized to show
tech news at the top next time.
By analyzing how users interact with a website, designers can identify problems and optimize
user experience.
Example:
If most users never go beyond the product listing page, the website might need a better call-to-
action or more intuitive navigation to the checkout page.
Web usage data can be used to make data-driven business decisions for increasing sales
and improving marketing strategies.
Example:
An e-commerce site may discover that users who search for "budget laptops" often end up
buying mid-range ones. The business can then highlight mid-range laptops more
prominently.
12. Explain the different traversal patterns and discovery
methods in web usage data.
Different traversal patterns
Traversal patterns describe how users move through a website — from one page to another.
1. Sequential Patterns
2. Frequent Patterns
3. Cyclic Patterns
These are techniques used to analyze and find patterns from web usage data (usually from
log files or session data).
2. Clustering
3. Classification
Used when content is plain text (without a defined structure), like articles, blogs, comments.
Techniques:
Text Classification:
Assigns categories to content.
Example: Classify a blog as "tech", "health", or "news"
Text Clustering:
Groups similar texts together (like news articles on the same topic).
Keyword/Concept Extraction:
Finds important terms or topics from the text.
Techniques:
Wrapper Induction:
Used to extract data from structured parts of HTML (like product listings, tables).
DOM Tree Analysis:
Analyzes the Document Object Model (DOM) structure to extract relevant parts (like
prices, headings).
Pattern Matching (Regular Expressions):
Used to find patterns in HTML source code (like emails, phone numbers, prices).
XML files
JSON from APIs
HTML with tags but mixed text
Techniques:
Tree-Based Mining:
Uses tree-like structures (like XML) for mining.
Schema Extraction:
Identifies patterns or templates used across pages (e.g., product layout on shopping
websites).
Method Description
Document Selection Uses Boolean logic (AND, OR, NOT) to filter documents.
Document Ranking Ranks documents based on relevance score (used in Google, etc.).
Before mining, text must be cleaned and converted into a usable format.
Step Purpose
Tokenization Break text into words or tokens.
Stop Word Removal Remove common but uninformative words like "the", "is", "of".
Stemming Reduce words to their base/root form (e.g., "running" → "run").
Lemmatization More advanced than stemming; returns actual dictionary word.
Formula:
TF(d,t) = Frequency of term t in document d
IDF(t) = log (Total number of docs / Number of docs containing t)
TF-IDF(d,t) = TF × IDF
TF-IDF increases with term frequency in a document and decreases with its frequency
across all documents.
4.Text Classification
Examples: