0% found this document useful (0 votes)
9 views82 pages

AIML Sem 8

The document discusses various concepts in text preprocessing, statistical estimation, and machine learning, including stemming, maximum likelihood estimation (MLE), and Bayesian networks. It also compares sentiment analysis with opinion mining, explains social media mining and its graph types, and outlines web usage mining and distance-based clustering algorithms like K-Means and hierarchical clustering. Additionally, it touches on classical recommendation algorithms used in social media platforms.

Uploaded by

ARSHIA SHAIKH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views82 pages

AIML Sem 8

The document discusses various concepts in text preprocessing, statistical estimation, and machine learning, including stemming, maximum likelihood estimation (MLE), and Bayesian networks. It also compares sentiment analysis with opinion mining, explains social media mining and its graph types, and outlines web usage mining and distance-based clustering algorithms like K-Means and hierarchical clustering. Additionally, it touches on classical recommendation algorithms used in social media platforms.

Uploaded by

ARSHIA SHAIKH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

1

Paper 1

Q1)

a)​ What is stemming? Explain its role in text preprocessing?

Stemming is a process in text preprocessing where we reduce words to their base or


root form.

Stemming means cutting off the ends of words to remove prefixes or suffixes and get
the base/root form of a word.

🧹 Example:
Words like:

●​ "playing", "played", and "plays"​


All become → "play"​

🛠️ Role in Text Preprocessing (Importance):


1.​ Reduces word forms: Helps group similar words with the same meaning.​

2.​ Decreases vocabulary size: Makes the dataset smaller and easier to analyze.​

3.​ Improves model performance: Makes machine learning models faster and
more accurate.​

4.​ Makes comparison easy: All similar words are treated as one.
5.​ Improves search results: In search engines, stemming helps match different
forms of a word. For example, searching for "connect" will also find "connected",
"connecting", etc.​

6.​ Supports text mining and NLP tasks: Stemming helps in tasks like sentiment
analysis, classification, and information retrieval by reducing noise from word
variations.

b) What is the principle of maximum likelihood estimation (MLE)?

Maximum Likelihood Estimation (MLE) is a method used to find the best values for
the parameters of a model by maximizing the likelihood of observing the given data.
2

MLE picks the model that gives the highest probability of producing the data we
actually have.

How it works:

1.​ Assume a probability distribution (like Normal, Binomial, etc.).​

2.​ Use your data to build a likelihood function — a formula showing how likely the
data is for different parameter values.​

3.​ Find the parameter values that maximize this function.


3

Simple Example:

Suppose you flip a coin 10 times and get 7 heads and 3 tails.

You want to estimate the probability of getting heads (p).

🧠 How MLE works:


1.​ Assume that the coin has some unknown probability p of getting heads.​

2.​ The likelihood of getting this result (7 heads, 3 tails) depends on the value of p.​

3.​ Try different values of p (like 0.1, 0.5, 0.7, etc.) and calculate how likely this result
is.​

4.​ The value of p that gives the highest likelihood is the MLE.​

🎯 In our case:
If we try p = 0.7, it gives the highest chance of getting 7 heads and 3 tails.

So, MLE = 0.7 → our best estimate of the probability of heads is 0.7.

Conclusion:

MLE finds the value of a parameter (like probability) that makes the observed data
most likely.​
It is used in machine learning and statistics to build the best models from data.

c) Explain how Bayesian Networks represent probabilistic relationships between


variables
4

A Bayesian Network is a type of diagram (graph) that shows how different variables
are related using probability.

It is used to represent uncertainty and dependencies between variables.

🧩 Structure:
●​ It is a Directed Acyclic Graph (DAG)​
(arrows go one way, and there are no loops).​

●​ Each node represents a random variable (like Rain, Traffic).​

●​ Each arrow shows a direct relationship (like Rain → Traffic).​

🔗 How it represents relationships:


●​ If there’s an arrow from A to B, it means B depends on A.​

●​ Each variable has a Conditional Probability Table (CPT) that tells the chance
of that variable, given the values of its parent(s).​

✅ Example:
Imagine 3 variables:
●​ Rain
●​ Sprinkler
●​ Wet Grass​

The Bayesian Network might look like:

Rain → Wet Grass ← Sprinkler

This shows:
5

●​ Wet Grass depends on whether it rained or the sprinkler was on.​

●​ The network tells us how likely the grass is wet based on those causes.​

Conclusion:

Bayesian Networks use a graph to visually and mathematically show how variables
influence each other using probabilities.​
They are useful for prediction, decision making, and handling uncertainty in data.

1️⃣ Joint Probability Distribution

Definition:​
It shows the probability of all variables happening together.

Example:​
For variables Rain and Traffic:

Rain Traffic P(Rain, Traffic)

Yes Yes 0.3


6

Yes No 0.2

No Yes 0.1

No No 0.4

👉 This table is the joint probability distribution of Rain and Traffic.


2️⃣ Inference in Bayesian Networks

Definition:​
Inference means finding unknown probabilities based on known evidence.

Example:​
If you know it rained, what’s the chance traffic is bad?

Use the network to calculate:​


P(Traffic = Yes | Rain = Yes)

👉 Inference helps us predict or update beliefs using the network.


3️⃣ Conditional Independence

Definition:​
Two variables are conditionally independent if knowing a third variable makes
them unrelated.

Example:​
Let’s say:

●​ Rain → Wet Grass ← Sprinkler​

Here, Rain and Sprinkler are conditionally independent if we know the grass is
wet.

👉 That means:​
If we already know the Wet Grass = Yes, then Rain and Sprinkler don’t affect each
other anymore.
7

d) Compare sentiment analysis with opinion mining.

Sentiment Analysis

Sentiment Analysis is a method to find out if a sentence or text shows a positive,

👉
negative, or neutral feeling.​
Example: "I love this phone" → Positive

Opinion Mining

Opinion Mining is a method to find out what people think about something, like a

👉
product or service, including which part they like or dislike.​
Example: "The phone's battery is bad" → Opinion on battery is negative

No Parameter Sentiment Analysis Opinion Mining


.

1. Main Focus Feeling or emotion Thought or opinion

2. Purpose To find mood (positive, To understand what people


negative, neutral) think about something

3. Type of Data Emotional statements Opinion-based statements

4. Output Positive, Negative, or Opinion + related topic or


Neutral feature

5. Detail Level Less detailed More detailed

6. Example “I love this phone.” → “The phone battery is bad” →


Positive Opinion on battery

7. Use Case Social media, news Product reviews, customer


analysis feedback

8. Target Overall text feeling Specific product feature or topic

9. Process Simple Slightly complex


Complexity
8

10. Includes No Yes


Topics?

11. Includes Yes Yes


Sentiment?

12. Used For General emotion tracking Business and market research

13. Relation A part of opinion mining A broader process including


Between Both sentiment analysis

e) What is social media mining? Explain various types of social media graphs.

Social Media Mining is the process of collecting and analyzing data from social
media platforms (like Facebook, Twitter, Instagram) to find useful patterns, opinions,
or trends.

It helps businesses, researchers, and organizations understand what people are


talking about, how they feel, and what’s popular.

Example:

If a company checks thousands of tweets to see what people are saying about their new
product, and finds that most people like it, that’s social media mining.

👉 It includes reading comments, likes, shares, hashtags, etc.


In social media mining, social networks are often represented as graphs, where:

●​ Nodes (Vertices) represent individuals, organizations, or entities.


●​ Edges (Links) represent relationships, interactions, or connections between
nodes.

1. Undirected Graph

●​ Definition: A graph where connections between nodes are bidirectional.


9

●​ Example: Facebook friendships (if A is a friend of B, B is also a friend of A).


●​ Use Case: Studying mutual relationships in online communities.

2. Directed Graph (Digraph)

●​ Definition: A graph where edges have a direction, indicating one-way


relationships.
●​ Example: Twitter follow relationships (A can follow B, but B may not follow A).
●​ Use Case: Influence analysis, information diffusion, and viral marketing.

3. Weighted Graph

●​ Definition: A graph where the connections have weights that show the strength
of the relationship between users.​

●​ Example: WhatsApp messages—the more frequent the messages, the


stronger the connection between users.​

●​ Use Case: Helps identify strong and weak connections between users.

4. Multigraph

●​ Definition: A graph where there can be multiple connections (edges) between


the same pair of nodes, each representing different types of relationships.​

●​ Example: A user on Instagram can like, comment, and share a post (3


different interactions).​

●​ Use Case: Helps understand different ways users engage with content.

5. Dynamic Graph (Evolving Graph)

●​ Definition: A graph that changes over time. Nodes and edges can be added or
removed as relationships or interactions evolve.​
10

●​ Example: A social network where friendships form and break over time.​

●​ Use Case: Used for studying trends, how communities grow, and how content
spreads.

6. Ego Network (Personal Network)

●​ Definition: A graph that focuses on a single user (ego) and shows all their
direct connections (alters).​

●​ Example: A LinkedIn user's connections—this shows all the people they are
directly connected to.​

●​ Use Case: Helps analyze personal influence and recommend connections.

Q2)
a)​ Write a short note on web usage mining.

Web Usage Mining is the process of analyzing user behavior on websites by looking
at the data generated when people use a website.

It helps website owners understand how users interact with their site, such as:

●​ Which pages users visit,​

●​ How much time they spend,​

●​ What links they click,​

●​ What path they follow on the site.

Sources of Data in Web Usage Mining:


Source Explanation (Simple)

1. Web Server Files stored on a server that record every page a user visits on a
Logs website. Example: time of visit, IP address, pages clicked.
11

2. Cookies Small files stored in the user's browser to track repeated visits and
preferences.

3. Browser The list of websites and pages visited by the user, which can help
History understand their interests.

4. Clickstream The sequence of clicks or paths the user follows while using the
Data website.

🎯 Goals of Web Usage Mining (Expanded):


1.​ ✅ Improve Website Design:​
Helps redesign websites based on how users actually use them (like rearranging
buttons or menus).​

2.​ ✅ Personalization:​
Shows users content or products based on their past behavior (like YouTube or
Amazon recommendations).​

3.​ ✅ Enhance User Experience:​


By understanding user flow, site speed, and popular pages, the website can be
made easier and faster to use.​

4.​ ✅ Increase Sales or Conversions:​


If it's an e-commerce website, mining helps identify why users abandon carts or
which products are most viewed.​

5.​ ✅ Targeted Advertising:​


Shows users ads based on their browsing behavior (e.g., visited a mobile page
→ sees mobile ads).​

6.​ ✅ Detect Fraud or Unusual Behavior:​


Abnormal user actions like repeated failed logins can be spotted for security
reasons.​

7.​ ✅ Content Optimization:​


Finds which articles or pages are most liked or ignored, helping decide what kind
of content to create more of.​
12

8.​ ✅ Better Business Decisions:​


The data helps businesses plan strategies, marketing campaigns, and customer
support based on real user data.​

📌 Example:
If a lot of users leave the site without buying, web usage mining might show they all left
on the payment page → the business can then fix that page to increase sales.

📝 Conclusion:
Web Usage Mining is a powerful tool to understand users, improve website
performance, and make smarter business decisions using data from user activity.

b) Discuss any two Distance-based Clustering Algorithms.

Clustering means grouping similar items together. In distance-based clustering, we use


distance or similarity to decide which items are in the same group. Here are two
popular clustering algorithms:

🔷 1. K-Means Clustering
✅ Concept (Simple):​
K-Means groups data into K clusters based on how close they are to the center (called
centroid) of each group.​
It works by assigning items to the nearest centroid, then updating centroids based
on the average of all items in the group.

✅ Steps of K-Means:
1.​ Choose the number of clusters (K).​

2.​ Pick K random points as centroids.​


13

3.​ Assign each point to the closest centroid.​

4.​ Update centroids by taking the average.​

5.​ Repeat steps 3 and 4 until groups stop changing.​

✅ Distance Used:​
Mostly Euclidean distance (straight-line distance).

✅ Example (Simple):​
If we have students with height and weight:

●​ K=2 → Two clusters.​

●​ K-Means groups tall-heavy students together and short-light students together.​

✅ Text Applications:​
Words or phrases are changed into vectors using TF-IDF or Word2Vec.​
Then, K-Means groups similar meaning words together.

📘 Used for:

●​ Topic modeling
●​ Word grouping
●​ Customer segmentation​

🔷 2. Hierarchical Clustering
✅ Concept (Simple):​
This algorithm builds a tree of clusters. It can work in two ways:

●​ Bottom-Up (Agglomerative): Start with each item alone, and merge similar
ones step-by-step.​

●​ Top-Down (Divisive): Start with all items in one group and keep splitting them.​

✅ Distance Used:​
To decide how close two clusters are, we use:
14

●​ Single linkage (closest points)​

●​ Complete linkage (farthest points)​

●​ Average linkage (average of all points)​

●​ KL-Divergence (for document comparison)​

✅ Example (Simple):​
If we have news articles:

●​ Hierarchical clustering groups articles about sports, politics, and technology


into separate folders, then groups similar folders into bigger folders.​

✅ Text Applications (Documents):​


Documents are treated as word probability distributions (like using LDA).​
Similar documents are grouped together in a tree structure.

📘 Used for:
●​ Organizing large text collections​

●​ Document clustering​

●​ Topic modeling​

●​ Building folder-like structures (taxonomy)​

✅ Final Comparison:
Feature K-Means Hierarchical Clustering

Output Fixed K groups Tree-like structure (dendrogram)

Type Flat clustering Hierarchical clustering

Distance Method Euclidean Linkage or KL Divergence


15

Best For Grouping data fast Organizing documents, understanding


structure

Example Grouping words or Grouping news articles or documents


Application customers

📝 Conclusion:
Both algorithms are useful for finding hidden patterns in data.

●​ K-Means is simple and fast.​

●​ Hierarchical is detailed and better for documents.​

Q3)
a)​ Explain classical recommendation algorithms based on social media.

Recommendation algorithms are used by social media platforms like Facebook,


Instagram, YouTube, or TikTok to suggest:

●​ Friends
●​ Posts
●​ Videos
●​ Pages or Products

These suggestions are based on your activity, likes, interests, and connections.

🔷 1. Collaborative Filtering (CF)


Definition: Suggests content based on what other similar users like.​
There are 3 types:

●​ ✅ User-Based CF:​
🧠 Example: If you and another person like the same posts, you might also like
Finds users like you and recommends what they like.​

their other liked posts.​


16

●​ ✅ Item-Based CF:​
🧠 Example: If you liked a fitness video, the app suggests other fitness videos.​
Suggests items similar to what you liked.​

●​ ✅ Matrix Factorization:​

🧠 Example: Helps platforms recommend posts even if there’s little direct data
Finds hidden patterns in user preferences.​

about you.​

🔷 2. Content-Based Filtering
Definition: Recommends content similar to what you’ve liked before based on

🧠
keywords, tags, etc.​
Example: If you liked travel posts, you will see more posts with hashtags like
#travel, #beach.

🔷 3. Hybrid Recommendation System


Definition: Combines both collaborative filtering and content-based filtering for better

🧠
suggestions.​
Example: Netflix uses your watch history + what others like to recommend movies.

🔷 4. Graph-Based Recommendation (Social Graph)


🧠
Definition: Uses your network (friends/followers) to give suggestions.​
Example: Facebook shows “People You May Know” using mutual friends.

🔷 5. Association Rule Mining (ARM)


🧠
Definition: Finds patterns between items people often use together.​
Example: People who like a fitness page also like healthy food pages.
17

🔷 6. Deep Learning-Based Recommendation


Definition: Uses neural networks to learn from your activity (likes, watch time,

🧠
comments).​
Example: TikTok shows videos based on what you watch and how long you watch.

🔹 7. Popularity-Based Recommendation
🧠
Definition: Recommends trending or popular posts.​
Example: Instagram shows viral Reels that are liked by many people.

🔹 8. Demographic-Based Recommendation
🧠
Definition: Recommends based on age, gender, or location.​
Example: Teenagers may get different music recommendations than adults.

🔹 9. Knowledge-Based Recommendation
🧠
Definition: Suggests based on user preferences or needs, even without past activity.​
Example: A user selects “I like action movies” → app suggests action films directly.

🔹 10. Context-Aware Recommendation


🧠
Definition: Uses your current context like time, location, or device.​
Example: Suggests restaurants near you during lunch hours.

These algorithms help make social media smarter by learning what users like and
suggesting the most relevant content, improving user engagement and satisfaction.

b)​ Explain the concept of language modeling, N gram models and its
applications.
18

Language modeling is the process of predicting the next word in a sentence or checking
how likely a sentence is in a language.

It learns from large amounts of text to predict the next word or fill in missing words.

There are two main types of language models:

1. Statistical Language Models

These use math and counting (like N-grams) to find patterns in text.​
Example: Bigram model predicts "you" after "thank" if "thank you" appears often.

2. Neural Language Models

These use machine learning and deep learning to understand more complex
language patterns.​
Example: LSTM or Transformer models (like GPT) that can write long, meaningful
sentences.

✅ Why Language Modeling is Important


●​ Helps machines understand grammar and meaning​

●​ Improves performance of applications like:​

○​ Translators
○​ Voice assistants
○​ Autocorrect tools
○​ Search engines​

✅ Example (with missing word):


Sentence: "I want to eat ___."​
Language model looks at common word sequences and suggests: "pizza" or "food"
based on what it learned from data.

✅ What is an N-Gram Model?


An N-gram is a sequence of N words used together.​
For example:
19

●​ Unigram (1-gram): "I", "am", "happy"​

●​ Bigram (2-gram): "I am", "am happy"​

●​ Trigram (3-gram): "I am happy"​

🟩 N-Gram Modeling predicts the next word by looking at the previous (N-1) words.
🔹 How it Works:
1.​ Collect a large amount of text (called a corpus).​

2.​ Count how often word pairs or groups (N-grams) appear.​

3.​ Use that frequency to guess the next word in a sentence.​

🧠 Example (Bigram/2-gram):​
If the bigram “thank you” appears often in text, and you type “thank”, the model will
suggest “you”.

Applications of N-gram Modeling (8+ Examples):


1.​ Text Prediction/Autocompletion: Used in keyboards and search engines to
predict the next word or phrase a user is likely to type.
2.​ Speech Recognition: Helps improve the accuracy of speech-to-text systems by
predicting likely word sequences.
3.​ Machine Translation: Aids in generating more fluent and natural-sounding
translations by predicting likely word sequences in the target language.
4.​ Spelling Correction: Detects and corrects spelling errors by identifying
improbable word sequences.
5.​ Sentiment Analysis: Can be used to identify patterns of words associated with
positive or negative sentiment.
6.​ Language Identification: Determines the language of a text by analyzing the
frequency of N-grams.
20

7.​ Text Generation: Generates new text that resembles the style and content of a
training corpus.
8.​ Information Retrieval: Improves search results by considering the context of
search queries.
9.​ Plagiarism Detection: Identifies similarities between documents by comparing
their N-gram distributions.
10.​Chatbots: Helps chatbots generate more natural and contextually relevant
responses

N-gram models are simple but powerful tools in NLP that help machines read, write, and
understand text like humans.

Q4)
a)​ Explain various types of web spamming techniques?

Web spamming means using unfair tricks to fool search engines and get a higher
ranking for a website. These tricks make the website look important, but often give a
bad experience to users.

📌 1. Content Spamming
What it is: Adding useless, copied, or repeated content just to match keywords.​
Example: “Buy cheap phone. Cheap phone deals. Phone online cheap…” — same
sentence repeated with keywords.​
Why it's bad: Confuses users and doesn’t provide real value.

📌 2. Link Spamming
What it is: Creating many fake or unrelated links to your website to trick search
engines.​
Example: Posting your site link again and again in blog comments or forums that are
not related.​
Why it's bad: Misleads search engines with false popularity.
21

📌 3. Keyword Stuffing
What it is: Repeating the same keyword many times unnaturally.​
Example: “Shoes online. Buy shoes. Cheap shoes. Best shoes. Shoes sale…”​
Why it's bad: Makes the page look spammy and hard to read.

📌 4. Cloaking
What it is: Showing different content to users and search engines.​
Example: You see a news article, but search engine sees only keyword-loaded junk.​
Why it's bad: It hides the real page from search engines.

📌 5. Hidden Text or Links


What it is: Putting invisible text or links by using white text on a white background.​
Example: A page with hidden links or keywords that only search engines can see.​
Why it's bad: Users don’t see it, but search engines are tricked.

📌 6. Doorway Pages
What it is: Special pages made to trap search engines and redirect users to another
page.​
Example: A user clicks a result for “best laptops” and is taken to an unrelated ad page.​
Why it's bad: Misleads users by not giving what was promised.

📌 7. Scraping Content
What it is: Copying content from other websites without permission.​
Example: A blog that steals articles from other blogs to get more traffic.​
Why it's bad: It’s unethical and gives no original value.
22

📌 8. Clickbait Titles
What it is: Using catchy or fake titles just to make people click.​
Example: “You won’t believe what happened next!” but the article is boring or
unrelated.​
Why it's bad: Tricks users and wastes their time.

These spamming methods are used to trick search engines but hurt real users. Search
engines like Google now use smart algorithms to detect and punish such spammy
behavior.

b)​ Explain various behavior analytics techniques used for social media
mining.

Behavior analytics means studying what users do on social media — like what they
click, like, share, or comment — to understand their interests, predict trends, and
improve user experience.

📌 1. Sentiment Analysis
What it does: Checks if a post or comment is positive, negative, or neutral.​
Example: If someone writes, “I love this movie,” it detects the positive sentiment.​
Used for: Product reviews, customer feedback, public opinion.

Techniques Used:

●​ Lexicon-Based Approach: Uses predefined dictionaries of positive and


negative words.
●​ Machine Learning Approach: Uses classification models like Naïve Bayes,
SVM, or deep learning models (e.g., LSTMs, BERT) to classify sentiments.
●​ Hybrid Approach: Combines lexicon-based and machine-learning techniques
for better accuracy.

📌 2. Clickstream Analysis
23

What it does: Tracks where users click, what they view, and how long they stay.​
Example: Knowing that most people click “like” on food posts.​
Used for: Improving website/app layout, recommendations.

📌 3. Social Network Analysis (SNA)


What it does: Analyzes connections between users — who talks to who, who is
popular.​
Example: Finding influencers in a friend group.​

Techniques Used:

●​ Graph Theory: Represents social networks as graphs where nodes = users and
edges = relationships.
●​ Centrality Measures:
o​ Degree Centrality: Number of connections a user has.
o​ Betweenness Centrality: How often a user acts as a bridge in the
network.
o​ Closeness Centrality: How fast a user can reach others in the network.

📌 4. Anomaly Detection
What it does: Finds unusual behavior, like spam or fake accounts.​
Example: A new account posting 100 links in 5 minutes.​
Used for: Detecting bots, fraud, or fake followers.

📌 5. Topic Modeling
What it does: Finds main topics or themes in lots of posts.​
Example: Grouping posts into topics like "sports," "politics," or "travel."​
Used for: Trend tracking, organizing content.

📌 6. Engagement and Interaction Analysis


24

What it does: Measures how people interact with posts — likes, shares, comments.​
Example: A post with 1,000 likes and 200 comments is highly engaging.​
Used for: Knowing which content works best.

📌 7. Emotion Recognition
What it does: Detects specific emotions like happiness, anger, sadness.​
Example: A comment like “I’m so angry with this service” shows anger.​
Used for: Mental health monitoring, emotional analysis.

Techniques Used:

●​ Natural Language Processing (NLP): Analyzes text-based emotions.


●​ Computer Vision: Detects facial expressions in images and videos.
●​ Audio Analysis: Recognizes emotions from voice tone and pitch.

📌 8. Virality and Trend Analysis


What it does: Studies why some posts go viral and how trends spread.​
Example: A meme shared by 10,000 people in 2 hours.​
Used for: Marketing, campaign success tracking.

📌 9. Churn Prediction
What it does: Predicts if a user is about to stop using the app or platform.​
Example: A user hasn't liked or posted anything in 10 days.​
Used for: Sending re-engagement messages or offers.

📌 10. User Profiling


What it does: Builds a profile of users based on behavior — interests, age, location.​
Example: Someone liking only fitness pages is tagged as a “fitness enthusiast.”​
Used for: Personalized ads and content.
25

Q5)
a)​ Discuss the techniques used for information extraction from text.

Information extraction means pulling out useful facts like names, dates, locations, and
relationships from unstructured text (plain sentences). It is used in AI, search engines,
chatbots, and data analysis.

1. Named Entity Recognition (NER)


●​ What it does? Identifies and categorizes important names, places, dates, etc.
●​ Example:
○​ Text: "Elon Musk founded Tesla in 2003 in the USA."
○​ NER Output:
■​ Person: Elon Musk
■​ Organization: Tesla
■​ Year: 2003
■​ Location: USA
●​ Used in: Chatbots, search engines, news categorization.

2. Relation Extraction
●​ What it does? Identifies relationships between words/entities in a sentence.
●​ Example:
○​ Text: "Barack Obama was born in Hawaii."
○​ Relation: (Person → Birthplace → Location)
○​ Output: (Barack Obama, born in, Hawaii)
●​ Used in: Knowledge graphs, automatic question answering.

3. Unsupervised Information Extraction


●​ What does it does? Extracts information without labeled data using clustering
or pattern detection.
●​ Example:
26

○​ Text: "Apple released the iPhone 15 in 2023."


○​ The model automatically detects that "Apple" is a company, "iPhone 15"
is a product, and "2023" is a release year without predefined rules.
●​ Used in: Automatic fact extraction, data mining.

✨ 4. Coreference Resolution
What it does: Finds which words refer to the same thing.​
Example:​
Text: "Priya bought a car. She loves it."​
Output: “She” = Priya, “it” = car​
Used in: Chatbots, summarization, question answering.

✨ 5. Part-of-Speech Tagging (POS Tagging)


What it does: Labels each word with its grammar type — noun, verb, adjective, etc.​
Example:​
Text: "The cat sleeps."​
Output: The (Det), cat (Noun), sleeps (Verb)​
Used in: Grammar correction, sentence parsing.

✨ 6. Chunking (Shallow Parsing)


What it does: Groups words into meaningful phrases like noun phrases or verb
phrases.​
Example:​
Text: "The big brown dog"​
27

Output: [Noun Phrase: The big brown dog]​


Used in: Sentence structure understanding.

✨ 7. Event Extraction
What it does: Detects events or actions in a sentence.​
Example:​
Text: "India won the World Cup in 2011."​
Output: (India, won, World Cup, 2011)​
Used in: News summarization, historical data extraction.

✨ 8. Template Filling
What it does: Extracts specific data to fill predefined fields.​
Example:​
Text: "Samsung launched Galaxy S23 in 2024."​
Output (Template):

●​ Company: Samsung
●​ Product: Galaxy S23
●​ Launch Year: 2024​
Used in: Reports, product catalogs.

b)​ Explain the working of sentiment analysis systems and its application for
business intelligence

Sentiment Analysis is a technique in Natural Language Processing (NLP) that finds


out the emotion or opinion in a piece of text.
28

It checks whether the text is Positive, Negative, or Neutral.

Example:

●​ "I love this phone!" → Positive​

●​ "The service was terrible." → Negative​

●​ "The food was okay." → Neutral​

⚙️ How Sentiment Analysis Works (Step-by-Step):


1.​ Text Collection:​
Gather text data from social media, reviews, surveys, etc.​

2.​ Preprocessing:​
Clean the text — remove emojis, stop words, punctuations.​
Example: "I loved the movie!" → "loved movie"​

3.​ Tokenization:​
Break text into words or phrases.​
Example: "bad product" → ["bad", "product"]​

4.​ Feature Extraction:​


Convert words into numbers using:​

○​ Bag of Words (BoW)​

○​ TF-IDF (Term Frequency-Inverse Document Frequency)​

○​ Word Embeddings (Word2Vec, GloVe)​

5.​ Sentiment Classification:​


Use machine learning or deep learning models to classify the text.​
Common models:​

○​ Naïve Bayes​
29

○​ Support Vector Machine (SVM)​

○​ LSTM, BERT (Deep Learning)​

6.​ Output Sentiment:​


The system gives the final sentiment label:​

○​ Positive
○​ Negative
○​ Neutral

Applications of Sentiment Analysis

1.​ Customer Feedback Understanding​


→ Helps companies know what people like or don’t like about their product.​
Example: Reviews on Amazon help a company improve its product.​

2.​ Social Media Monitoring​


→ Tracks what people say about a brand on Twitter, Instagram, etc.​
Example: If many tweets are angry, the company knows there's a problem.​

3.​ Product Improvement​


→ Finds common complaints or praises to make products better.​
Example: If many people say "battery drains fast," the company can fix it.​

4.​ Marketing Strategy​


→ Helps in planning ads based on how customers feel.​
Example: If people feel positive about eco-friendly products, ads can focus on
that.​

5.​ Competitor Analysis​


→ See what people are saying about other brands.​
Example: If a competitor is getting negative reviews, your company can do
better.​

6.​ Crisis Detection​


→ Detects problems early when people suddenly post many negative
comments.​
Example: Airline gets complaints about delays—company can respond quickly.​
30

7.​ Sales Prediction​


→ Positive reviews = more chances of sales.​
Example: A phone with good reviews is likely to sell more.​

8.​ Chatbots and Customer Support​


→ Helps chatbots understand if the customer is happy, angry, or confused.​
Example: If user sounds angry, chatbot can offer help fast or alert a human.

Q6)
a)​ Explain in detail rule based and probabilistic classifiers for text
classification.

1. Rule-Based Classifiers

🔸 What is it?
Rule-based classifiers use "if-then" rules to classify text. These rules are manually
created or automatically generated based on patterns in the text.

🔸 How it works?
●​ The system checks for specific keywords, phrases, or patterns in the text.​

●​ If a rule matches, the text is assigned to a class.​

✅ Example:
●​ Rule:​
If a sentence contains words like “free”, “win”, “prize” → classify as spam.

Used in:

●​ Spam filters
●​ Medical text classification
●​ Simple chatbot intent detection​

🔸 Pros:
31

●​ Easy to understand
●​ Gives exact reasons for classification​

🔸 Cons:
●​ Hard to maintain many rules
●​ Doesn’t handle unseen or complex cases well​

(a) Decision Tree Classifier

Easy Definition:​
It asks a series of "yes/no" questions to decide the final class.

How it works:

●​ It builds a tree where each node checks for a feature (like a word).​

●​ Based on the answer, it moves left or right in the tree.​

●​ The final leaf node gives the predicted class.​

Example:​
If text contains “buy” → yes​
If text contains “free” → yes → spam​
Else → not spam

Used in: Email filtering, news classification

(b) Rule-Based Pattern Classifier

Easy Definition:​
It uses manually written rules or regular expressions to match patterns.

How it works:

●​ Rules like: If the text has “refund” and “delay” → label as complaint​
32

●​ Works well in limited domains like helpdesk or finance.​

Example:​
“If text has 'cancel' and 'booking' → class = cancellation request”

Used in: Customer service automation, chatbots

2. Probabilistic Classifiers

🔸 What is it?
Probabilistic classifiers use math and probability to predict the most likely category for
a text. The most common example is the Naive Bayes Classifier.

🔸 How it works?
●​ It learns from a training dataset with labeled examples.​

●​ It calculates the probability of each class given the words in the text.​

●​ Chooses the class with the highest probability.

Used in:

●​ Email spam detection


●​ Sentiment analysis
●​ News or topic categorization​

🔸 Pros:
●​ Works well even with small data
●​ Fast and simple to implement​

🔸 Cons:
●​ Assumes words are independent (not always true)
●​ Can struggle with sarcasm or complex language
33

a) Probability-Based Classifier: Naïve Bayes


●​ Concept: Uses probability to classify text based on word occurrences.
●​ Works On: Bayes’ Theorem, assuming words are independent.
●​ Example:
○​ If a document contains words like "buy," "offer," and "discount," it is more
likely to be classified as spam.
○​ If it contains "project," "research," and "development," it is likely a
technical document.
●​ Used In: Spam detection, sentiment analysis.

(b) Logistic Regression

Easy Definition:​
It calculates a score using a formula and converts it to a probability between 0 and 1.

How it works:

●​ Each word is given a weight (positive/negative impact)


●​ The total score decides the final class​

Example:​
Text: “Amazing product”​
Weights: “amazing” = +2 → High score = positive review

Used in: Review classification, product feedback analysis

b)​ Explain with block diagram working of web search engines and
significance of semantic indexing.
34

Search engines are programs that allow users to search and retrieve information
from the vast amount of content available on the internet. They use algorithms to
index and rank web pages based on relevance to a user’s query, providing a list of
results for users to explore. Popular search engines include Google, Bing, and Yahoo.

Working of a Web Search Engine

Search engines work in three main steps:

🔹 1. Crawling (Finding the content)


●​ Search engines send out bots or spiders to visit websites across the internet.​

●​ These bots follow links and collect information from web pages such as text,
titles, images, and keywords.​

●​ They also check how recently the page was updated.​

●​ Example: Googlebot visits your site and reads your pages.​

🟢 Why it's important:​


If your website is not crawled, it won’t appear in search results.

🔹 2. Indexing (Storing the content)


35

●​ After crawling, the content is analyzed and organized in a huge database.


●​ Important data like title, headings, keywords, and links are stored.
●​ Duplicate pages or broken pages may be removed.​

🟢 Why it's important:​


Only indexed pages can be shown in search results.

🔹 3. Ranking (Showing best results to users)


●​ When a user searches, the search engine:​

1.​ Understands the query (uses keywords and context)


2.​ Finds the best matches from its index
3.​ Ranks results based on relevance, content quality, and popularity​

Factors affecting ranking:

●​ Keyword relevance
●​ Page speed
●​ Mobile-friendliness
●​ Number of backlinks
●​ Freshness of content​

🟢 Why it's important:​


Ranking decides which websites show up on top of search results and get more
clicks.

✅ Components of a Search Engine


Component Function

🔸 Web Crawler Software bots (like spiders) that visit and collect data from
web pages

🔸 Database Stores all the indexed data (title, content, links, etc.)
36

🔸 Indexing System Organizes and categorizes the content using keywords


and topics

🔸 Ranking Engine Calculates scores and decides which pages appear first

🔸 Search Interface The user interface (like Google search bar) used to enter
queries

🔸 Query Processor Analyzes the user's input and fetches relevant results from
the index

🔸 Results Display Shows results with titles, links, descriptions, and


Module sometimes ads

Latent Semantic Indexing (LSI) is a technique used by search engines to understand


the meaning behind words in a web page. It helps find related words and topics, even
if the exact keyword is not present.

For example, if the page is about "cars", LSI can understand related words like
"vehicle", "engine", "automobile", etc.

Benefits of LSI :

1. Better Search Results

LSI helps search engines understand the real topic of a page, so users get more
accurate and helpful search results.

2. Helps Avoid Keyword Stuffing

Websites don’t have to repeat the same keyword again and again. Using related words
is enough, which makes the content more natural.

3. Improves SEO Ranking

LSI keywords make the content richer and more relevant, which helps the page rank
higher in search results.

4. Increases Content Quality

Using LSI terms adds variety and depth to the content, making it more useful and
interesting for readers.
37

5. Understands User Intent

LSI helps search engines know what the user is really looking for, even if the exact
words are not typed.

6. Supports Voice Search


LSI helps search engines understand natural language, which is helpful for voice-based
searches like “Where can I buy cheap laptops near me?”

7. Finds Related Content Easily​


It helps users discover more useful pages that are connected to their search topic.​

8. Reduces Spammy Content​


LSI makes it harder for low-quality or fake sites to appear on top just by repeating
words.​

9. Supports Multilingual Search​


It can understand meaning even if the content is in different languages or has mixed
phrases.​

10. Improves Ad Targeting​


Search engines and ads platforms use LSI to show more relevant ads to users based
on page content.

Paper 2

Q1)
a)​ What is opinion mining? List the challenges of opinion mining

Opinion Mining, also called Sentiment Analysis, is a method used to find out what
people feel (positive, negative, or neutral) when they write something, like a review,
comment, or tweet.

📌 Example:​
If someone writes, “I love this phone!”, opinion mining will detect that the feeling is
positive.
38

It is commonly used for:

●​ Analyzing product reviews​

●​ Checking customer feedback​

●​ Understanding social media posts​

Challenges of Opinion Mining (With Simple Examples)

1.​ Sarcasm Detection​


People say the opposite of what they mean.​

🧠 Example:​
“This phone is just amazing… it only hangs 10 times a day.” (This sounds positive, but
it’s actually sarcastic and negative.)

2.​ Negation Handling​


Words like “not” can change the meaning of the sentence.​

🧠 Example:​
“This phone is not good.” → The system must understand this is negative, not
positive.

3.​ Context Understanding​


Words can have different meanings based on the sentence.​

🧠 Example:​
“This movie was dark.”

●​ In a horror movie review, “dark” might be positive.​


39

●​ In a kids' movie review, it might be negative.​

4.​ Multiple Meanings (Ambiguity)​


Some words mean different things in different situations.​

🧠 Example:​
“She is so cool.” → Positive​
“It’s cool outside.” → Just about temperature, neutral

5.​ Spelling and Grammar Mistakes​


People often make typos, especially online.​

🧠 Example:​
“This phone is awesum!” → Should be “awesome” (positive), but the system might not
understand it.

6.​ Emojis and Slang​


People use emojis or casual words not found in regular dictionaries.​

🧠 Example:​
“I 💖 this!” or “This phone is lit!” → These mean positive, but the system must learn
emojis and slang.

7.​ Domain-Specific Sentiment​


Same words mean different things in different industries.​

🧠 Example:​
“Long battery life”

●​ In phones: positive​
40

●​ In electric cars (if it means taking long to charge): could be negative​

8.​ Fake Reviews and Spam​


Some reviews are written by bots or paid users, not real customers.​

🧠 Example:​
“This is the best product ever!!! Buy now!!!” → Sounds too fake or promotional, might
not be real.

9.​ Mixed Sentiments in One Sentence​


One sentence may have both good and bad opinions.​

🧠 Example:​
“I like the screen, but the battery is terrible.”​
→ Part is positive, part is negative – hard to label it as just one.

b) What are the types of spamming techniques? Explain any two techniques in
detail.

Same as Explain various types of web spamming techniques?

c) Compare Hidden Markovian Models (HMM) with Conditional Random Fields


(CRF)

HMM is a statistical model used to predict a sequence of hidden (unknown) states


based on what we can observe.​
It assumes that each state depends only on the previous state and produces an
observable output.

📌 Example:​
Guessing whether someone is happy or sad (hidden state) based on their facial
expressions (observable data), step by step.
41

CRF is a machine learning model used for predicting sequences, where it looks at the
whole sentence or sequence at once to decide the best set of labels.
It considers the relationship between neighboring words and features together.

Example: Named Entity Recognition (NER) using CRF

Sentence:​
John lives in New York.

Goal: Identify names of people and places.

CRF Output (labels):

●​ John → B-PER (Beginning of Person)


●​ lives → O (Other)
●​ in → O
●​ New → B-LOC (Beginning of Location)
●​ York → I-LOC (Inside Location)​

💡 Explanation:​
CRF looks at the whole sentence to decide that "New" and "York" together form a
location, not separately. It learns the pattern and relationships between words.

Point Hidden Markov Model (HMM) Conditional Random Fields (CRF)

1. Type Probabilistic generative model Probabilistic discriminative model

2. Purpose Models both observations and hidden Models the relationship between
states input and output labels

3. Dependencies Assumes each state depends only on Considers the whole sequence
the previous state (Markov property) context for predictions

4. Feature Usage Limited feature usage Can use multiple features for better
accuracy
42

5. Transition Uses transition and emission Uses feature functions to define


probabilities transitions

6. Flexibility Less flexible due to independence More flexible and captures complex
assumptions dependencies

7. Performance Works well for simple sequences Works better for complex sequences
like NLP tasks

8. Common Uses Speech recognition, Part-of-Speech Named Entity Recognition (NER),


(POS) tagging POS tagging

9. Learning Method Uses Maximum Likelihood Estimation Uses Conditional Probability for
(MLE) learning

10. Accuracy May give lower accuracy due to Higher accuracy as it considers the
independence assumptions entire sequence

d) Explain N-gram modeling and its applications.

Same as Explain the concept of language modeling, N gram models and its applications.

e) What are the challenges of social media mining?

Social Media Mining is the process of analyzing and extracting useful information
from social media platforms (like Facebook, Twitter, Instagram) to understand patterns,
behaviors, and trends

1. Big Data Complexity

●​ Definition: Social media generates a huge amount of data very quickly (e.g.,
millions of posts every second). This makes it hard to store, manage, and
analyze.​

2. Data Privacy and Security

●​ Definition: Protecting user information is difficult because there are concerns


about privacy and data leaks. Laws like GDPR and CCPA ensure personal data
43

is safe.​

3. Noisy and Unstructured Data

●​ Definition: Social media data is often messy with things like slang, typos,
emojis, and irrelevant content. It’s also full of fake accounts and spam.​

4. Dynamic and Evolving Content

●​ Definition: Social media trends change fast. It’s hard to keep up because
topics, hashtags, and popular content keep changing, requiring real-time
analysis.​

5. Misinformation and Fake News

●​ Definition: False information spreads easily on social media, causing


confusion. It’s hard to stop or detect fake news before it spreads widely.​

6. Sentiment and Context Understanding

●​ Definition: Understanding how people feel (sentiment) in their posts is hard


because of things like sarcasm, humor, or cultural differences. For example,
"This movie is sick" could mean it's good or bad.​

7. Scalability Issues

●​ Definition: Analyzing millions of posts and user interactions in real-time


requires powerful computers and distributed systems like Hadoop and
Spark.​

8. Data Integration from Multiple Sources

●​ Definition: People use different social media platforms (like Facebook,


Twitter, and Instagram). Gathering and combining all this data into one format is
challenging.​

9. Influence and Community Detection


44

●​ Definition: It’s difficult to find the key influencers in large networks and to
understand how communities form and change over time.​

10. Ethical and Legal Issues

●​ Definition: Using social media data for research or business has to follow laws
and be ethical, including getting user consent and ensuring data is used
responsibly.

Q2)

a)​ Explain with a block diagram Named Entity Recognition application.

NER is a technique in Natural Language Processing (NLP) that helps find and label
names of people, places, organizations, dates, etc. in text.

🔍 Example:
👉
Input sentence:​
"Apple Inc. is located in Cupertino."

NER Output:

●​ Apple Inc. → Organization​

●​ Cupertino → Location

Input Text

Text Preprocessing
(Tokenization, Stopword Removal)

Feature Extraction
(Word Embeddings, POS Tags)

Named Entity Recognition
(NER Model - ML or Deep Learning)

Classified Entities
(Person, Organization, Location, Date, etc.)
45

NER Process – Step-by-Step (Made Easy)

1.​ Input Text​


This is the raw sentence or paragraph you want to analyze.​

2.​ Text Preprocessing​

✔️
Clean and prepare the text.​

✔️ Break into words (tokenization)​

✔️ Remove unwanted words (stopwords)​


Break into sentences​

3.​ Feature Extraction​

✔️
Pull useful information from words.​

✔️ POS Tags: (e.g., Noun, Verb)​


Word Embeddings: Convert words into numbers the computer understands​

4.​ NER Model (Machine Learning or Deep Learning)​


Use trained models (like CRF, RNN, Transformer) to find and tag names.​

5.​ Output: Named Entities​


You get the same sentence, but with important words labeled.​

Applications of NER

1.​ Chatbots & Virtual Assistants – Helps in understanding user queries by


recognizing names, dates, and locations.
2.​ Search Engines – Improves search accuracy by identifying key entities in user
searches.
3.​ Healthcare Industry – Extracts patient names, diseases, and medications from
medical records.
4.​ News Classification – Helps in categorizing news articles based on people,
places, and organizations.
5.​ Fraud Detection – Identifies suspicious activities by analyzing financial
transactions.
6.​ Resume Parsing​
Extracts candidate names, skills, and experience from resumes automatically.​
46

7.​ Legal Document Analysis​


Finds dates, case names, and involved people in legal papers.​

8.​ Social Media Monitoring​


Tracks brands and places people mention in tweets or posts.​

9.​ E-commerce​
Helps identify product names, brands, and prices in reviews and product listings.​

10.​Email Filtering​
Tags important names, dates, or companies in business emails for better
organization.​

NER Methods – Easy Explanation

1.​ Lexicon-Based Method​


Uses a dictionary of names/terms to match with the text.​
Not used much — needs regular updates.​

2.​ Rule-Based Method​


Uses if-then rules and patterns to find entities.​

●​ Pattern-based: Looks at word forms (e.g., capital letters).​

●​ Context-based: Looks at surrounding words.​

3.​ Machine Learning-Based Method​


Trains a model using labeled examples.​


✔ Learns from data​
Needs lots of labeled data and context understanding.​

4.​ Deep Learning-Based Method​


Most powerful method.​
✔ Uses word embeddings to understand word meaning and relationships​
✔ Learns automatically from large data​
✔ Best for handling complex sentences
47

b) Discuss any two Distance-based Clustering Algorithms.

Clustering is a method of grouping similar data points together. Distance-based


clustering uses distance (like how far two points are) to decide if they belong in the
same group. Two common distance-based clustering algorithms are:

1. K-Means Clustering

Concept:​
K-Means groups data into K clusters based on centroids (center points).​
Each data point (e.g., a word, phrase, or user) is assigned to the nearest centroid. The
centroids are updated by averaging the data points in each cluster.

Steps:

1.​ Choose the number of clusters (K).​

2.​ Pick K random points as starting centroids.​

3.​ Assign each data point to the nearest centroid.​

4.​ Update the centroid based on the average of points in the cluster.​

5.​ Repeat until centroids stop changing.​

Distance Used:​
Usually Euclidean distance (straight-line distance).

Example:​
Imagine students grouped by height and weight. K-Means will cluster students with
similar height and weight together.

Application (Words and Phrases):

●​ Words/phrases can be turned into vectors using methods like TF-IDF or


Word2Vec.​
48

●​ K-Means can then group similar words (e.g., {apple, banana, mango} in one
group and {car, bus, bike} in another).​

●​ Used in topic modeling, document grouping, and customer segmentation.​

2. Hierarchical Clustering

Concept:​
It builds a tree-like structure (called a dendrogram) to show how data points (like
documents) are related.

●​ Agglomerative (Bottom-Up): Start with each item as its own group → merge
the most similar groups until one big group remains.​

●​ Divisive (Top-Down): Start with one big group → keep splitting based on
similarity.​

Distance Used:

●​ Single Linkage: Minimum distance between groups.​

●​ Complete Linkage: Maximum distance between groups.​

●​ Average Linkage: Average distance between all pairs.​

●​ KL Divergence: Used for comparing documents based on word distributions.​

Example:​
Think of grouping news articles. Start with each article separately, then merge similar
ones into clusters like Sports, Politics, Technology, etc.

Application (Documents):

●​ Used in document clustering, topic discovery, and organizing libraries or


archives.​
49

●​ Documents are represented as word probabilities (e.g., using LDA), and then
grouped based on how similar their word usage is.​

Conclusion:

●​ K-Means is fast and works well when you know the number of clusters.​

●​ Hierarchical Clustering is better when you want to see the relationship between
clusters in a tree form.​
Both are useful in text mining, document grouping, and language
applications.

Q3)

a)​ What is Latent Semantic Indexing? What are the benefits of Latent
Semantic Indexing?

Same as significance of Latent semantic indexing.

b)​ Explain the working of opinion spam detection application

What is Opinion Spam?

Opinion spam means fake or misleading reviews written to fool people.​


There are two types:

●​ Fake positive reviews – to make a bad product look good.​

●​ Fake negative reviews – to harm a competitor’s product.​

Why Is Opinion Spam Detection Important?


50

●​ Helps customers trust online reviews.​

●​ Stops businesses from cheating or getting unfair reviews.​

●​ Improves the quality of product and service recommendations.​

How Does Opinion Spam Detection Work?

1.​ Collect Reviews​

○​ Reviews are collected from websites like Amazon, Flipkart, Yelp,


TripAdvisor, etc.​

2.​ Text Analysis​

○​ Analyze the text of reviews.​

○​ Look for signs like:​

■​ Repeated phrases across multiple reviews.​

■​ Too many extreme words ("awesome", "worst", "excellent").​

■​ Poor grammar or unnatural patterns.​

3.​ User Behavior Analysis​

○​ Track how users post.​

○​ Suspicious behavior includes:​

■​ Posting many reviews quickly.​

■​ Reviewing similar products again and again.​

■​ Users with no purchase history.​


51

4.​ Sentiment Analysis​

○​ Check if the emotions or opinions in the review match with other real
reviews.​

○​ For example, if most people say a product is bad and one review says it’s
amazing, it might be fake.​

5.​ Machine Learning Models​

○​ Use models like:​

■​ Naïve Bayes​

■​ Decision Trees​

■​ Neural Networks​

○​ These models learn patterns in fake vs real reviews and automatically


classify them.​

6.​ Anomaly Detection​

○​ Catch unusual patterns, like:​

■​ All 5-star reviews from new accounts.​

■​ Too many reviews for a product in a short time.​

■​ Review written before product launch.​

Where Is It Used?

●​ E-commerce sites – Amazon, Flipkart​

●​ Hotel and food reviews – TripAdvisor, Zomato​


52

●​ App Stores – Google Play Store, Apple App Store​

●​ Product comparison websites​

Benefits:

●​ Builds customer trust​

●​ Stops review fraud​

●​ Helps users make better buying decisions​

●​ Helps businesses maintain fair competition​

Q4)

a)​ Write a short note on the K-NN classifier.

K-NN (K-Nearest Neighbors) is a simple machine learning algorithm that classifies


data based on similarity. It finds the K closest data points and assigns the most
common category

✅ Basic Idea (With Example)


Imagine you have dots on a graph—some are red (apples) and some are blue
(oranges).​
Now a new dot appears. To decide if it’s red or blue, K-NN does this:

●​ Looks at the K nearest dots.​

●​ If most of them are red, it labels the new dot as red (apple).​

●​ If most are blue, it labels it blue (orange).​


53

✅ How Does It Work?


1.​ Choose a number ‘K’ (e.g., K = 3 or K = 5).​

2.​ Measure the distance between the new point and all existing data points
(usually using Euclidean distance).​

3.​ Find the K closest neighbors (based on distance).​

4.​ Check the category (label) of these neighbors.​

5.​ Assign the new point to the most common label among them.​

✅ Example:
Suppose we have data about fruits like:

●​ Apples: small, red​

●​ Oranges: medium, orange​


Now, a new fruit comes in with medium size and reddish color.​
K-NN checks the closest fruits in the dataset.​
If more are apples, it labels the new fruit as an apple.​

✅ Applications of K-NN:
●​ Spam Email Detection (spam or not spam)​

●​ Handwriting Recognition (e.g., recognizing digits from images)​

●​ Recommender Systems (suggesting similar movies, products, etc.)​

●​ Medical Diagnosis (e.g., classifying diseases based on symptoms)​


54

✅ Advantages:
●​ Easy to understand and implement​

●​ No training step – just stores the data​

●​ Works well with small datasets​

●​ Makes predictions based on real examples, not assumptions​

✅ Limitations:
●​ Slow for large datasets (because it checks distance from every point)​

●​ Choice of K is important – too small can be noisy, too large can mix categories​

●​ Doesn’t work well if data is not properly scaled​

✅ Key Points:
●​ K-NN is a lazy learner – it doesn’t build a model in advance.​

●​ It depends on distance calculation and majority voting.​

●​ Best used when you have labeled data and need to make simple predictions.​

b) Explain different data sources and the web usage mining process in detail.

Web Usage Mining is the process of analyzing how users behave on a website.​
It helps website owners understand:

●​ What pages users visit​


55

●​ How long they stay​

●​ What they click on​


This helps in improving website design, content, and user experience.

Different Data Sources for Web Usage Mining

1.​ Web Server Logs:​


These are files that record everything a user does on a website – like which
pages they visit and at what time. They help know which pages are most visited.​

2.​ Client-side Data:​


Collected from the user’s browser using cookies or JavaScript. It tracks clicks,
scrolling, and how long users stay on a page.​

3.​ Proxy Server Logs:​


These come from proxy servers (used in schools, offices, etc.). They record user
activity even if the page is loaded from cache.​

4.​ Database Data:​


This includes user information like login details, profiles, and shopping history
stored in the website’s database.​

5.​ Clickstream Data:​


It shows the step-by-step path a user follows by clicking from one page to
another on the website.​

6.​ Cookies & Sessions:​


Cookies store user information to track returning users and their behavior during
visits.​

7.​ User Profiles & Registration Data:​


These are details given by users when they sign up, like name, age, and
interests. It helps in giving personalized content.

Steps in Web Usage Mining


56

🔹 1. Data Preprocessing
(Clean and prepare the data)

●​ Raw data is messy (with errors, duplicates).​

●​ We remove unwanted data.​


●​ Identify who the user is, when they came, and what they did.​
Example: Removing repeated page visits or missing values.​

🔹 2. Pattern Discovery (Data Modeling)


(Find user behavior patterns)

●​ After cleaning, we study how users behave.​


●​ Make patterns/models to show which pages they visit and in what order.​
Example: Many users go from "Home" → "Product" → "Cart".​

🔹 3. Session and Visitor Analysis


(Study each user's visit)

●​ A session means one visit to the website.​


●​ We check how long users stayed, what they clicked, and if they came back.​
Example: A user stays for 10 minutes and views 5 pages.​

🔹 4. Clustering / Grouping Users


(Group similar users together)
57

●​ Put users in groups based on behavior.​


●​ Helpful in giving personalized content or ads.​
Example: Group A – people who buy often, Group B – people who just
browse.​

🔹 5. Association and Correlation


(Find page connections)

●​ Find out which pages are visited together.​


●​ Shows what users like to do.​
Example: Users who visit “Mobile Phones” also visit “Accessories”.​

🔹 6. Sequential Pattern Analysis


(Find visit order)

●​ Study the step-by-step path users follow.​


●​ Helps improve navigation.​
Example: Most users follow this path: Homepage → Search → Product →
Checkout.

7. Classification and Prediction

This step predicts user behavior based on past activity. It helps in identifying user
interests and future actions.

Classification techniques predict user behavior based on past actions.

●​ Neural Networks 🧠
●​ Decision Trees : Classifies users into categories (e.g., buyers vs. non-buyers).
: Identifies complex patterns in user behavior.
58

Q5)
a)​ Explain feature selection techniques for text document classification.

In text classification, documents have lots of words (features).​


Not all words are important.​
Feature selection means picking only the most useful words for classification (like
"virus", "discount", "offer", etc.)​
This makes the model faster and more accurate.

✅ Why Use Feature Selection?


●​ Removes unwanted or repeated words​

●​ Reduces training time​

●​ Increases accuracy​

●​ Helps avoid overfitting​

✏️ Common Feature Selection Techniques:


1. Term Frequency (TF)

●​ Count how many times a word appears in a document.​


●​ More frequent = more important (sometimes).​
Example: In a sports article, the word "goal" may appear often → useful
word.​

2. Document Frequency (DF)

●​ Count how many documents contain the word.​


●​ If a word appears in almost all documents (like "the", "is"), it may not be helpful.​
Example: "Laptop" appears in tech articles only → good feature.​
59

3. TF-IDF (Term Frequency - Inverse Document Frequency)

●​ Combines both TF and DF.​

●​ Gives high score to words that are frequent in one document but rare in


others.​
Example: In a document about “Diabetes,” words like “insulin” get high
TF-IDF.​

4. Chi-Square Test (χ²)

●​ Checks if a word is related to a class (category).​


●​ Measures how much a word and category are dependent.​
Example: Word “complaint” appears mostly in “negative reviews” → selected.​

5. Information Gain (IG)

●​ Measures how much a word helps in predicting the category.​


●​ High IG = Good feature​
Example: The word “free” may help separate spam emails from normal ones.​

6. Mutual Information (MI)

●​ Similar to Information Gain.​


●​ Measures how much knowing a word helps in knowing the category.​
Example: Word “refund” → very useful in identifying complaints.​

✅ Summary Table
Technique What It Does Example Word Use
60

TF Counts word frequency “Goal” in sports news

DF Counts how many docs contain the “Laptop” in tech articles


word

TF-IDF High for rare but important words “Insulin” in medical text

Chi-Square Measures word-category “Complaint” in negative


relationship reviews

Information Finds most useful words for “Free” in spam emails


Gain prediction

Mutual Info Measures word + class dependency “Refund” in complaints

b)​ What are the different types of social media graphs?. Explain
recommendation using social context in detail.

Recommendation Using Social Context

💡 What is It?
Recommendation using social context means giving suggestions to users (like movies,
products, or friends) based on their social connections — like friends, followers, likes,
or group behavior.

Instead of just looking at user preferences, it also uses who the user knows and
interacts with.

📌 Example:
If your friend liked a movie, there’s a high chance you might like it too.​
So the system will recommend that movie to you — using your social connection.

🔹 How It Works (Simple Steps):


1.​ Collect Social Data​
→ User’s friends, followers, likes, group memberships, etc.​
61

✅ Example: Facebook friends or Twitter follows.​


2.​ Track User Activities​
→ What users and their friends are liking, watching, buying, or rating.​

3.​ Analyze Social Influence​


→ If many of your friends like the same thing, it becomes more likely to be
shown to you.​

4.​ Make Recommendations​


→ Suggest items (movies, posts, products) that are popular in your social circle.​

✅ Benefits:
●​ More personalized suggestions​

●​ Helps in cold start problem (when a new user has no data)​

●​ Adds trust (people trust recommendations from friends)​

✅ Real-Life Examples:
●​ Netflix: Shows “Trending among your friends”​

●​ Facebook/Instagram: Suggests pages or people your friends follow​

●​ Amazon: “People you follow bought this”​

Final Line:

Recommendation using social context makes suggestions smarter and more personal
by using not just your choices, but also the influence of your friends and social group.

Q6)
a)​ Explain the working of web search engines.
Same as Explain with block diagram working of web search engine
62

b)​ Explain the supervised techniques of sentiment classification.


Or Explain algorithms for text mining.

It means finding whether a given text (like a tweet or review) shows a positive,
negative, or neutral feeling.

Supervised techniques use labeled data — where each text already has a known
sentiment — to train a model that can predict the sentiment of new text.

🔹 Steps in Supervised Sentiment Classification:


1.​ Collect Labeled Data​
→ Example:​

○​ "This phone is amazing!" → Positive​

○​ "Worst service ever." → Negative​

2.​ Text Preprocessing​


→ Remove unwanted things like punctuation, stopwords, and convert text into a
suitable format.​
(Example: turning "I love it!" into ["love"])​

3.​ Feature Extraction​


→ Convert words into numbers using:​

○​ Bag of Words​

○​ TF-IDF​

○​ Word Embeddings (Word2Vec, etc.)​

4.​ Train a Machine Learning Model​


→ Use labeled data to train a model that learns the pattern of positive/negative
words.​

Supervised Techniques of Sentiment Classification

Supervised learning means we use pre-labeled data (texts that are already marked as
Positive, Negative, or Neutral) to train a model that can predict the sentiment of new
text.
63

🔹 1. Naïve Bayes Classifier


●​ Based on probability and Bayes' Theorem.​

●​ Assumes that all words are independent of each other (which is not always true,
but it works well for text).​

●​ Counts how often each word appears in positive and negative reviews.​

●​ Then uses that count to predict the class of a new review.​

✅ Advantages:
●​ Fast and simple.​

●​ Works well even with less training data.​

●​ Often used in spam detection, sentiment analysis.​

📌 Example:​
If the words “awesome”, “great”, “love” appear mostly in positive texts, then a review
with those words will likely be predicted as positive.

🔹 2. Support Vector Machine (SVM)


●​ SVM tries to draw the best boundary between positive and negative texts in a
high-dimensional space.​

●​ It uses the most important examples (support vectors) to decide this


boundary.​

●​ Good for large feature spaces, like text where there are thousands of words.​

✅ Advantages:
64

●​ Very accurate.​

●​ Works well with high-dimensional text data.​

📌 Example:​
A tweet is converted into numbers (features), and SVM draws a line that separates
positive and negative tweets based on these features.

🔹 3. Logistic Regression
●​ A type of regression that predicts probability of output class (like positive or
negative).​

●​ It uses a sigmoid function to convert the result into a value between 0 and 1.​

●​ Often used when you want to predict binary outcomes (yes/no, true/false,
positive/negative).​

✅ Advantages:
●​ Simple to implement.​

●​ Gives probability, not just class label.​

●​ Works well with clean data.​

📌 Example:​
If a review contains many positive words, the model might output 0.85, meaning there’s
an 85% chance the review is positive.

🔹 4. Decision Tree
●​ Uses a tree-like structure to make decisions.​
65

●​ At each step, it asks a question like:​


"Does the review contain the word ‘bad’?"​
If yes, go left; if no, go right.​

●​ Finally, it reaches a leaf node that gives the class.​

✅ Advantages:
●​ Easy to understand and visualize.​

●​ Handles both numerical and text features.​

📌 Disadvantage:
●​ May overfit (memorize the training data and perform poorly on new data).​

📌 Example:​
If the text contains the word "excellent", go one way; if it contains "terrible", go another
way.

🔹 5. Neural Networks
●​ Made of layers of neurons (input, hidden, and output layers).​

●​ It can learn complex patterns in text data.​

●​ Popular networks include RNN (Recurrent Neural Networks) and LSTM (Long
Short-Term Memory) for text tasks.​

✅ Advantages:
●​ Very powerful and accurate.​

●​ Learns deep patterns that simpler models miss.​

📌 Disadvantages:
66

●​ Needs more data and time to train.​

●​ Harder to understand (black-box model).​

📌 Example:​
Given a review like “I hated the movie, but the ending was great”, a neural network can
understand the mixed tone and possibly classify it as neutral or slightly positive.

Other important topics

Q1) Explain Following processes in Text Preprocessing:


Tokenization, stemming, stop Word Removal, NER

Text preprocessing prepares raw text for analysis in Natural Language Processing
(NLP). Below are four key steps:

1. Tokenization (Splitting text into words or sentences)


●​ Definition: Breaking a sentence into smaller parts (tokens).
●​ Example:
○​ Input: "I love coding!"
○​ Output: ["I", "love", "coding", "!"]

2. Stemming (Reducing words to their root form)


●​ Definition: Cutting words to their base form by removing endings.
●​ Example:
○​ Input: "running", "runner", "runs"
○​ Output: "run"

3. Stop Word Removal (Removing common words that add little meaning)
●​ Definition: Removing words like "is", "the", "and" to keep important words only.
●​ Example:
○​ Input: "I am learning machine learning"
○​ Output: "learning machine learning"

4. Named Entity Recognition (NER) (Identifying important names/places)


●​ Definition: Detecting names of people, places, organizations, etc.
●​ Example:
○​ Input: "Elon Musk founded Tesla in the USA."
67

○​ Output: {Person: "Elon Musk", Organization: "Tesla", Location: "USA"}

These steps clean and structure text data, making it ready for machine learning and
NLP tasks like chatbots, search engines, and sentiment analysis!

Q2) Word and phrase based clustering, Probabilistic clustering

Clustering means grouping similar things together. In text mining, we group similar
words, phrases, or documents. This helps organize large text data without needing
labels.

🔹 1. Word and Phrase Based Clustering


This method groups similar words or phrases based on their meaning or how often
they appear together.

💡 Key Idea:
If words or phrases often appear in the same context, they are similar and grouped into
the same cluster.

✅ Example:
●​ Words like "car", "bike", "truck" can be in one vehicle cluster.​

●​ Words like "laptop", "mouse", "keyboard" in one electronics cluster.​

●​ Phrases like "make a call", "send a message" → communication cluster.​

✅ Used in:
●​ Creating word clouds​

●​ Topic modeling​

●​ Making text summaries​


68

✅ Benefits:
●​ Helps understand hidden topics in large text.​

●​ Makes search engines and chatbots more accurate.​

🔹 2. Probabilistic Clustering (e.g., LDA – Latent Dirichlet Allocation)


This is a more advanced method.​
Instead of putting a document in just one group, it says a document can belong to
multiple clusters with some probability.

💡 Key Idea:
Each document is a mixture of topics, and each topic is a mixture of words.

✅ Example:
●​ A news article might be:​

○​ 70% Politics​

○​ 30% Sports​

●​ Another article could be:​

○​ 60% Technology​

○​ 40% Business​

So it gives probability scores for each topic.

✅ Algorithm Used:
●​ LDA (Latent Dirichlet Allocation) is the most famous algorithm for probabilistic
clustering.​

✅ Used in:
69

●​ Topic detection in news and articles.​

●​ Recommender systems.​

●​ Summarizing large documents.​

✅ Benefits:
●​ Gives a more flexible and realistic view.​

●​ Works well even if topics are mixed in one document.

Q3) Decision tree classifiers, Rule-based classifiers, Rule-based, Probabilistic


based, Proximity-based classifiers

🔹 1. Decision Tree Classifier


📘 What is it?
A decision tree is like a flowchart where each question splits the data into smaller
groups until a final decision is made.

📌 Example:
To classify emails as "spam" or "not spam":

●​ Is there a suspicious link?​


➝ Yes → Spam​
➝ No → Next question​

●​ Does it have too many capital letters?​


➝ Yes → Spam​
➝ No → Not Spam​

✅ Key Points:
●​ Easy to understand and draw.​
70

●​ Uses conditions (if-else) at each step.​

●​ Fast and useful for small to medium datasets.​

🔹 2. Rule-Based Classifier
📘 What is it?
This classifier uses IF-THEN rules to classify data. Rules are made using expert
knowledge or learned from data.

📌 Example:
●​ IF a review contains the word "excellent", THEN it is positive.​

●​ IF a review contains "bad" and "never again", THEN it is negative.​

✅ Key Points:
●​ Easy to write and understand.​

●​ Good when rules are clear.​

●​ Can be combined with other models.​

🔹 3. Probabilistic Classifier
📘 What is it?
It uses probability to classify items. It calculates the chance that the data belongs to a
certain class.

📌 Example:
Using Naive Bayes:
71

●​ Email = "Buy now and win!"​

●​ It calculates:​

○​ P(Spam | email content)​

○​ P(Not Spam | email content)​

●​ Chooses the class with the higher probability.​

✅ Key Points:
●​ Simple but powerful.​

●​ Works well with text (e.g., spam detection).​

●​ Assumes features are independent (in Naive Bayes).​

🔹 4. Proximity-Based Classifier
📘 What is it?
Also called distance-based. It classifies data based on how close (similar) it is to other
data points.

📌 Example:
Using K-Nearest Neighbors (K-NN):

●​ A new movie review comes in.​

●​ The algorithm checks the K most similar reviews.​

●​ If most of them are positive → new review is also positive.​

✅ Key Points:
72

●​ No training needed (“lazy learner”).​

●​ Easy to implement.​

●​ Slower on large data.​

🔹 5. Rule-Based (again, if asked differently)


Sometimes "Rule-Based" might be asked again as a broader category.

📘 Just repeat key points if needed:


●​ Works using simple IF-THEN rules.​

●​ Example: IF user clicks frequently on mobiles, THEN show more mobile ads.​

●​ Easy to explain.​

●​ Used in expert systems and recommender systems.

Q4) Markov random fields, Inverted indices and compression in web mining

🔹 1. Markov Random Fields (MRF)


📘 What is it?
Markov Random Fields are probabilistic models used to find relationships between
nearby elements like words in a document or pixels in an image.

📌 Example in Web Mining:


In text classification, MRFs help by considering not just a word, but also its
neighboring words.​
This improves context understanding, especially in search engines and document
ranking.

✅ Key Points:
73

●​ Works with probability and neighboring data.​

●​ Good for text mining and image analysis.​

●​ Captures the context better than simple models.​

●​ Useful in tasks like document classification and information retrieval.​

🔹 2. Inverted Indices
📘 What is it?
An inverted index is like a book index, but for search engines.​
It stores which words appear in which documents.

📌 Example:
If you search for the word "apple", the search engine looks in the inverted index to find
all documents containing "apple".

✅ Key Points:
●​ Used in search engines like Google.​

●​ Fast way to find documents with specific keywords.​

●​ Stores data as:​


"word" → [list of document IDs where the word appears]​

●​ Helps in fast and accurate search results.​

🔹 3. Compression
📘 What is it?
Compression means reducing the size of data so it takes less space and loads faster.
74

📌 Example in Web Mining:


When storing huge web data like logs or documents, compression helps in saving
space and speeding up processing.

✅ Key Points:
●​ Saves storage space and bandwidth.​

●​ Makes data transfer and mining faster.​

●​ Common techniques: Huffman coding, Run-length encoding.​

●​ Helps in processing big web data efficiently.

Q5) Meta search- Using similarity scores,Rank positions

A Meta Search Engine does not search the web directly.​


Instead, it sends your query to multiple search engines (like Google, Bing, Yahoo),
collects their results, and then combines them into a single list.

📌 Example: Dogpile, DuckDuckGo (partially), StartPage.


🔹 Why Use Meta Search?
●​ To get better and more diverse results.​

●​ Different search engines may rank pages differently.​

●​ Combining them gives a more complete answer.​

🔹 How Results Are Combined?


Meta search engines use two key methods to merge results:
75

1️⃣ Using Similarity Scores

●​ Each search engine returns a similarity score (relevance score) for every
webpage.​

●​ This score shows how closely the page matches the search query.​

●​ Meta search adds or averages the scores from different engines.​

●​ Pages with higher total scores are ranked higher in the final result.​

📌 Example:​
If a page gets 0.8 from Google and 0.7 from Bing → average score = 0.75 → higher
rank.

2️⃣ Using Rank Positions

●​ Each engine ranks pages (1st, 2nd, 3rd, etc.).​

●​ Meta search looks at the rank of each result from different engines.​

●​ It assigns points based on position. For example:​

○​ Rank 1 = 10 points​

○​ Rank 2 = 9 points​

○​ Rank 3 = 8 points, etc.​

●​ The total score of a page is calculated by adding its points from all engines.​

●​ Then results are re-ranked based on total points.​

📌 Example:​
If a webpage is ranked 1st by Bing (10 points) and 3rd by Google (8 points), it gets 18
points → high final rank.
76

🔚 Conclusion:
Meta search improves search quality by combining results from many search engines.​
It uses similarity scores (how relevant a page is) and rank positions (page order in
results) to give better and more complete answers to user

Q6) Explain Techniques to combat spam in web spamming

Web spamming means cheating search engines to get higher rankings by using fake
or unfair techniques, like keyword stuffing or link farms.​
Spammers try to bring low-quality pages to the top of search results.

🔧 Techniques to Combat Web Spam:


1️⃣ Content-Based Filtering

●​ This checks the text and content of a webpage.​

●​ It removes pages with:​

○​ Too many repeated keywords (keyword stuffing).​

○​ Hidden or irrelevant content.​

●​ Uses machine learning to find common patterns in spam content.​

📌 Example: A page repeating "cheap shoes" 100 times is marked as spam.


2️⃣ Link-Based Techniques

●​ Spammers create fake links (link farms) to increase popularity.​

●​ These methods help detect such tricks:​

a) PageRank Algorithm

●​ It checks the quality of links.​


77

●​ If a page is linked by many trusted pages → it's considered genuine.​

●​ Low-quality link networks are ignored.​

b) TrustRank Algorithm

●​ Starts with a small set of trusted pages.​

●​ Then finds pages linked from them.​

●​ Pages far from trusted pages may be spam.​

3️⃣ Machine Learning Classifiers

●​ Algorithms like Naive Bayes, Decision Trees, and SVM are trained to detect
spam.​

●​ They use features like:​

○​ Word frequency​

○​ Number of links​

○​ Length of content​

📌 Used by search engines to automatically classify pages as spam or not.


4️⃣ Blacklist Filtering

●​ Known spam domains or websites are blacklisted.​

●​ If a URL matches the list, it is blocked from search results.​

5️⃣ User Behavior Analysis

●​ Checks how users interact with the website.​


78

●​ Signs of spam:​

○​ Users leave quickly (high bounce rate).​

○​ Many clicks but no engagement.​

●​ Real sites usually keep users engaged longer.​

6️⃣ CAPTCHA & Human Verification

●​ Used in comment sections, sign-ups, and forums to block automated bots.​

●​ Ensures the action is done by a human, not a spam program.

Q7) Opinion Lexicon expansion- Dictionary-based, corpus-based

An opinion lexicon is a list of words with their sentiments (positive, negative, or


neutral).​
Example:

●​ “Good” → Positive​

●​ “Terrible” → Negative​

Sometimes, the basic list doesn’t have all opinion words, so we expand it using two
methods:

1️⃣ Dictionary-Based Method


This method starts with a small set of known opinion words (called seed words), and
then finds related words using a dictionary like WordNet or thesaurus.

➤ Steps:

●​ Start with seed words: "good", "bad"​

●​ Use a dictionary (WordNet) to find:​


79

○​ Synonyms of “good” → great, excellent, nice → add as positive​

○​ Antonyms of “good” → bad, terrible → add as negative​

✅ Features:
●​ Easy and fast to implement​

●​ Works well for common words​

●​ Doesn’t need labeled data​

📌 Example:
“happy” is positive​
→ Synonym = “joyful” → also positive​
→ Antonym = “sad” → negative

2️⃣ Corpus-Based Method


This method uses a large set of real text documents (corpus), like reviews or tweets,
to learn new opinion words from how they are used with seed words.

➤ Steps:

●​ Take a large text dataset (corpus)​

●​ Look at how unknown words are used with known opinion words​

●​ Use patterns like:​

○​ “This phone is ___ and amazing” → if a new word appears near


“amazing”, it may also be positive​

●​ Use statistical methods or co-occurrence patterns​

✅ Features:
●​ More accurate for domain-specific terms (like tech or movies)​
80

●​ Can find context-based sentiments​

●​ Needs more data and computing​

📌 Example:
In mobile reviews:​
"Stylish" often appears near "great" → assume "stylish" is positive

✅ Conclusion:
Method Uses Strength

Dictionary-Base Word relationships Simple & fast


d (synonym/antonym)

Corpus-Based Real-world usage in text More accurate &


domain-aware

Both methods help in growing the opinion word list, which improves sentiment
analysis accuracy.

Q8) Opinion spam detection- Supervised learning, Abnormal Behaviors,Group


spam detection

Opinion spam refers to fake or misleading reviews written to boost or harm a product
or service.

1️⃣ Supervised Learning

Supervised learning uses labeled reviews (real or fake) to train a machine learning
model that can detect spam reviews.

➤ How it works:

●​ Step 1: Collect a dataset of reviews with labels (fake or genuine).​

●​ Step 2: Extract features like:​


81

○​ Number of positive/negative words​

○​ Review length​

○​ Reviewer history​

●​ Step 3: Train algorithms like:​

○​ Naive Bayes​

○​ Support Vector Machines (SVM)​

○​ Random Forest​

●​ Step 4: Use the model to classify new reviews as real or spam.​

✅ Advantage:
●​ Can achieve high accuracy if trained on a good dataset.​

📌 Example:
If a review says: “Amazing! Great! Super! Love it!” and it’s very short, it might be fake →
Model will flag it.

2️⃣ Abnormal Behaviors

This technique does not need labeled data. It detects fake reviews by finding strange
patterns in user behavior.

➤ Signals of Abnormal Behavior:

●​ Posting many reviews in a short time​

●​ Always giving 5-star or 1-star reviews​

●​ Reviewer never verified a purchase​

●​ All reviews written in similar language​

✅ Advantage:
82

●​ Good for detecting suspicious users​

●​ Works even without labeled data​

📌 Example:
A user posts 10 five-star reviews in one hour → system flags them as abnormal.

3️⃣ Group Spam Detection

Sometimes, spammers work in groups to make fake reviews look real. This method
finds coordinated behavior.

➤ Signs of Group Spamming:

●​ Multiple users review the same product at the same time​

●​ All give the same rating and write similar reviews​

●​ Group members often review the same set of products​

✅ Advantage:
●​ Detects organized spam attacks​

●​ Finds hidden patterns among users​

📌 Example:
5 users give 5-star reviews to the same phone on the same day using similar words →
suspected spam group.

You might also like