AIML Sem 8
AIML Sem 8
Paper 1
Q1)
Stemming means cutting off the ends of words to remove prefixes or suffixes and get
the base/root form of a word.
🧹 Example:
Words like:
2. Decreases vocabulary size: Makes the dataset smaller and easier to analyze.
3. Improves model performance: Makes machine learning models faster and
more accurate.
4. Makes comparison easy: All similar words are treated as one.
5. Improves search results: In search engines, stemming helps match different
forms of a word. For example, searching for "connect" will also find "connected",
"connecting", etc.
6. Supports text mining and NLP tasks: Stemming helps in tasks like sentiment
analysis, classification, and information retrieval by reducing noise from word
variations.
Maximum Likelihood Estimation (MLE) is a method used to find the best values for
the parameters of a model by maximizing the likelihood of observing the given data.
2
MLE picks the model that gives the highest probability of producing the data we
actually have.
How it works:
2. Use your data to build a likelihood function — a formula showing how likely the
data is for different parameter values.
Simple Example:
Suppose you flip a coin 10 times and get 7 heads and 3 tails.
2. The likelihood of getting this result (7 heads, 3 tails) depends on the value of p.
3. Try different values of p (like 0.1, 0.5, 0.7, etc.) and calculate how likely this result
is.
4. The value of p that gives the highest likelihood is the MLE.
🎯 In our case:
If we try p = 0.7, it gives the highest chance of getting 7 heads and 3 tails.
So, MLE = 0.7 → our best estimate of the probability of heads is 0.7.
Conclusion:
MLE finds the value of a parameter (like probability) that makes the observed data
most likely.
It is used in machine learning and statistics to build the best models from data.
A Bayesian Network is a type of diagram (graph) that shows how different variables
are related using probability.
🧩 Structure:
● It is a Directed Acyclic Graph (DAG)
(arrows go one way, and there are no loops).
● Each variable has a Conditional Probability Table (CPT) that tells the chance
of that variable, given the values of its parent(s).
✅ Example:
Imagine 3 variables:
● Rain
● Sprinkler
● Wet Grass
This shows:
5
● The network tells us how likely the grass is wet based on those causes.
Conclusion:
Bayesian Networks use a graph to visually and mathematically show how variables
influence each other using probabilities.
They are useful for prediction, decision making, and handling uncertainty in data.
Definition:
It shows the probability of all variables happening together.
Example:
For variables Rain and Traffic:
Yes No 0.2
No Yes 0.1
No No 0.4
Definition:
Inference means finding unknown probabilities based on known evidence.
Example:
If you know it rained, what’s the chance traffic is bad?
Definition:
Two variables are conditionally independent if knowing a third variable makes
them unrelated.
Example:
Let’s say:
Here, Rain and Sprinkler are conditionally independent if we know the grass is
wet.
👉 That means:
If we already know the Wet Grass = Yes, then Rain and Sprinkler don’t affect each
other anymore.
7
Sentiment Analysis
👉
negative, or neutral feeling.
Example: "I love this phone" → Positive
Opinion Mining
Opinion Mining is a method to find out what people think about something, like a
👉
product or service, including which part they like or dislike.
Example: "The phone's battery is bad" → Opinion on battery is negative
12. Used For General emotion tracking Business and market research
e) What is social media mining? Explain various types of social media graphs.
Social Media Mining is the process of collecting and analyzing data from social
media platforms (like Facebook, Twitter, Instagram) to find useful patterns, opinions,
or trends.
Example:
If a company checks thousands of tweets to see what people are saying about their new
product, and finds that most people like it, that’s social media mining.
1. Undirected Graph
3. Weighted Graph
● Definition: A graph where the connections have weights that show the strength
of the relationship between users.
● Use Case: Helps identify strong and weak connections between users.
4. Multigraph
● Use Case: Helps understand different ways users engage with content.
● Definition: A graph that changes over time. Nodes and edges can be added or
removed as relationships or interactions evolve.
10
● Example: A social network where friendships form and break over time.
● Use Case: Used for studying trends, how communities grow, and how content
spreads.
● Definition: A graph that focuses on a single user (ego) and shows all their
direct connections (alters).
● Example: A LinkedIn user's connections—this shows all the people they are
directly connected to.
Q2)
a) Write a short note on web usage mining.
Web Usage Mining is the process of analyzing user behavior on websites by looking
at the data generated when people use a website.
It helps website owners understand how users interact with their site, such as:
1. Web Server Files stored on a server that record every page a user visits on a
Logs website. Example: time of visit, IP address, pages clicked.
11
2. Cookies Small files stored in the user's browser to track repeated visits and
preferences.
3. Browser The list of websites and pages visited by the user, which can help
History understand their interests.
4. Clickstream The sequence of clicks or paths the user follows while using the
Data website.
2. ✅ Personalization:
Shows users content or products based on their past behavior (like YouTube or
Amazon recommendations).
📌 Example:
If a lot of users leave the site without buying, web usage mining might show they all left
on the payment page → the business can then fix that page to increase sales.
📝 Conclusion:
Web Usage Mining is a powerful tool to understand users, improve website
performance, and make smarter business decisions using data from user activity.
🔷 1. K-Means Clustering
✅ Concept (Simple):
K-Means groups data into K clusters based on how close they are to the center (called
centroid) of each group.
It works by assigning items to the nearest centroid, then updating centroids based
on the average of all items in the group.
✅ Steps of K-Means:
1. Choose the number of clusters (K).
✅ Distance Used:
Mostly Euclidean distance (straight-line distance).
✅ Example (Simple):
If we have students with height and weight:
✅ Text Applications:
Words or phrases are changed into vectors using TF-IDF or Word2Vec.
Then, K-Means groups similar meaning words together.
📘 Used for:
● Topic modeling
● Word grouping
● Customer segmentation
🔷 2. Hierarchical Clustering
✅ Concept (Simple):
This algorithm builds a tree of clusters. It can work in two ways:
● Bottom-Up (Agglomerative): Start with each item alone, and merge similar
ones step-by-step.
● Top-Down (Divisive): Start with all items in one group and keep splitting them.
✅ Distance Used:
To decide how close two clusters are, we use:
14
✅ Example (Simple):
If we have news articles:
📘 Used for:
● Organizing large text collections
● Document clustering
● Topic modeling
✅ Final Comparison:
Feature K-Means Hierarchical Clustering
📝 Conclusion:
Both algorithms are useful for finding hidden patterns in data.
Q3)
a) Explain classical recommendation algorithms based on social media.
● Friends
● Posts
● Videos
● Pages or Products
These suggestions are based on your activity, likes, interests, and connections.
● ✅ User-Based CF:
🧠 Example: If you and another person like the same posts, you might also like
Finds users like you and recommends what they like.
● ✅ Item-Based CF:
🧠 Example: If you liked a fitness video, the app suggests other fitness videos.
Suggests items similar to what you liked.
● ✅ Matrix Factorization:
🧠 Example: Helps platforms recommend posts even if there’s little direct data
Finds hidden patterns in user preferences.
about you.
🔷 2. Content-Based Filtering
Definition: Recommends content similar to what you’ve liked before based on
🧠
keywords, tags, etc.
Example: If you liked travel posts, you will see more posts with hashtags like
#travel, #beach.
🧠
suggestions.
Example: Netflix uses your watch history + what others like to recommend movies.
🧠
comments).
Example: TikTok shows videos based on what you watch and how long you watch.
🔹 7. Popularity-Based Recommendation
🧠
Definition: Recommends trending or popular posts.
Example: Instagram shows viral Reels that are liked by many people.
🔹 8. Demographic-Based Recommendation
🧠
Definition: Recommends based on age, gender, or location.
Example: Teenagers may get different music recommendations than adults.
🔹 9. Knowledge-Based Recommendation
🧠
Definition: Suggests based on user preferences or needs, even without past activity.
Example: A user selects “I like action movies” → app suggests action films directly.
These algorithms help make social media smarter by learning what users like and
suggesting the most relevant content, improving user engagement and satisfaction.
b) Explain the concept of language modeling, N gram models and its
applications.
18
Language modeling is the process of predicting the next word in a sentence or checking
how likely a sentence is in a language.
It learns from large amounts of text to predict the next word or fill in missing words.
These use math and counting (like N-grams) to find patterns in text.
Example: Bigram model predicts "you" after "thank" if "thank you" appears often.
These use machine learning and deep learning to understand more complex
language patterns.
Example: LSTM or Transformer models (like GPT) that can write long, meaningful
sentences.
○ Translators
○ Voice assistants
○ Autocorrect tools
○ Search engines
🟩 N-Gram Modeling predicts the next word by looking at the previous (N-1) words.
🔹 How it Works:
1. Collect a large amount of text (called a corpus).
🧠 Example (Bigram/2-gram):
If the bigram “thank you” appears often in text, and you type “thank”, the model will
suggest “you”.
7. Text Generation: Generates new text that resembles the style and content of a
training corpus.
8. Information Retrieval: Improves search results by considering the context of
search queries.
9. Plagiarism Detection: Identifies similarities between documents by comparing
their N-gram distributions.
10.Chatbots: Helps chatbots generate more natural and contextually relevant
responses
N-gram models are simple but powerful tools in NLP that help machines read, write, and
understand text like humans.
Q4)
a) Explain various types of web spamming techniques?
Web spamming means using unfair tricks to fool search engines and get a higher
ranking for a website. These tricks make the website look important, but often give a
bad experience to users.
📌 1. Content Spamming
What it is: Adding useless, copied, or repeated content just to match keywords.
Example: “Buy cheap phone. Cheap phone deals. Phone online cheap…” — same
sentence repeated with keywords.
Why it's bad: Confuses users and doesn’t provide real value.
📌 2. Link Spamming
What it is: Creating many fake or unrelated links to your website to trick search
engines.
Example: Posting your site link again and again in blog comments or forums that are
not related.
Why it's bad: Misleads search engines with false popularity.
21
📌 3. Keyword Stuffing
What it is: Repeating the same keyword many times unnaturally.
Example: “Shoes online. Buy shoes. Cheap shoes. Best shoes. Shoes sale…”
Why it's bad: Makes the page look spammy and hard to read.
📌 4. Cloaking
What it is: Showing different content to users and search engines.
Example: You see a news article, but search engine sees only keyword-loaded junk.
Why it's bad: It hides the real page from search engines.
📌 6. Doorway Pages
What it is: Special pages made to trap search engines and redirect users to another
page.
Example: A user clicks a result for “best laptops” and is taken to an unrelated ad page.
Why it's bad: Misleads users by not giving what was promised.
📌 7. Scraping Content
What it is: Copying content from other websites without permission.
Example: A blog that steals articles from other blogs to get more traffic.
Why it's bad: It’s unethical and gives no original value.
22
📌 8. Clickbait Titles
What it is: Using catchy or fake titles just to make people click.
Example: “You won’t believe what happened next!” but the article is boring or
unrelated.
Why it's bad: Tricks users and wastes their time.
These spamming methods are used to trick search engines but hurt real users. Search
engines like Google now use smart algorithms to detect and punish such spammy
behavior.
b) Explain various behavior analytics techniques used for social media
mining.
Behavior analytics means studying what users do on social media — like what they
click, like, share, or comment — to understand their interests, predict trends, and
improve user experience.
📌 1. Sentiment Analysis
What it does: Checks if a post or comment is positive, negative, or neutral.
Example: If someone writes, “I love this movie,” it detects the positive sentiment.
Used for: Product reviews, customer feedback, public opinion.
Techniques Used:
📌 2. Clickstream Analysis
23
What it does: Tracks where users click, what they view, and how long they stay.
Example: Knowing that most people click “like” on food posts.
Used for: Improving website/app layout, recommendations.
Techniques Used:
● Graph Theory: Represents social networks as graphs where nodes = users and
edges = relationships.
● Centrality Measures:
o Degree Centrality: Number of connections a user has.
o Betweenness Centrality: How often a user acts as a bridge in the
network.
o Closeness Centrality: How fast a user can reach others in the network.
📌 4. Anomaly Detection
What it does: Finds unusual behavior, like spam or fake accounts.
Example: A new account posting 100 links in 5 minutes.
Used for: Detecting bots, fraud, or fake followers.
📌 5. Topic Modeling
What it does: Finds main topics or themes in lots of posts.
Example: Grouping posts into topics like "sports," "politics," or "travel."
Used for: Trend tracking, organizing content.
What it does: Measures how people interact with posts — likes, shares, comments.
Example: A post with 1,000 likes and 200 comments is highly engaging.
Used for: Knowing which content works best.
📌 7. Emotion Recognition
What it does: Detects specific emotions like happiness, anger, sadness.
Example: A comment like “I’m so angry with this service” shows anger.
Used for: Mental health monitoring, emotional analysis.
Techniques Used:
📌 9. Churn Prediction
What it does: Predicts if a user is about to stop using the app or platform.
Example: A user hasn't liked or posted anything in 10 days.
Used for: Sending re-engagement messages or offers.
Q5)
a) Discuss the techniques used for information extraction from text.
Information extraction means pulling out useful facts like names, dates, locations, and
relationships from unstructured text (plain sentences). It is used in AI, search engines,
chatbots, and data analysis.
2. Relation Extraction
● What it does? Identifies relationships between words/entities in a sentence.
● Example:
○ Text: "Barack Obama was born in Hawaii."
○ Relation: (Person → Birthplace → Location)
○ Output: (Barack Obama, born in, Hawaii)
● Used in: Knowledge graphs, automatic question answering.
✨ 4. Coreference Resolution
What it does: Finds which words refer to the same thing.
Example:
Text: "Priya bought a car. She loves it."
Output: “She” = Priya, “it” = car
Used in: Chatbots, summarization, question answering.
✨ 7. Event Extraction
What it does: Detects events or actions in a sentence.
Example:
Text: "India won the World Cup in 2011."
Output: (India, won, World Cup, 2011)
Used in: News summarization, historical data extraction.
✨ 8. Template Filling
What it does: Extracts specific data to fill predefined fields.
Example:
Text: "Samsung launched Galaxy S23 in 2024."
Output (Template):
● Company: Samsung
● Product: Galaxy S23
● Launch Year: 2024
Used in: Reports, product catalogs.
b) Explain the working of sentiment analysis systems and its application for
business intelligence
Example:
2. Preprocessing:
Clean the text — remove emojis, stop words, punctuations.
Example: "I loved the movie!" → "loved movie"
3. Tokenization:
Break text into words or phrases.
Example: "bad product" → ["bad", "product"]
○ Naïve Bayes
29
○ Positive
○ Negative
○ Neutral
Q6)
a) Explain in detail rule based and probabilistic classifiers for text
classification.
1. Rule-Based Classifiers
🔸 What is it?
Rule-based classifiers use "if-then" rules to classify text. These rules are manually
created or automatically generated based on patterns in the text.
🔸 How it works?
● The system checks for specific keywords, phrases, or patterns in the text.
✅ Example:
● Rule:
If a sentence contains words like “free”, “win”, “prize” → classify as spam.
Used in:
● Spam filters
● Medical text classification
● Simple chatbot intent detection
🔸 Pros:
31
● Easy to understand
● Gives exact reasons for classification
🔸 Cons:
● Hard to maintain many rules
● Doesn’t handle unseen or complex cases well
Easy Definition:
It asks a series of "yes/no" questions to decide the final class.
How it works:
● It builds a tree where each node checks for a feature (like a word).
Example:
If text contains “buy” → yes
If text contains “free” → yes → spam
Else → not spam
Easy Definition:
It uses manually written rules or regular expressions to match patterns.
How it works:
● Rules like: If the text has “refund” and “delay” → label as complaint
32
Example:
“If text has 'cancel' and 'booking' → class = cancellation request”
2. Probabilistic Classifiers
🔸 What is it?
Probabilistic classifiers use math and probability to predict the most likely category for
a text. The most common example is the Naive Bayes Classifier.
🔸 How it works?
● It learns from a training dataset with labeled examples.
● It calculates the probability of each class given the words in the text.
Used in:
🔸 Pros:
● Works well even with small data
● Fast and simple to implement
🔸 Cons:
● Assumes words are independent (not always true)
● Can struggle with sarcasm or complex language
33
Easy Definition:
It calculates a score using a formula and converts it to a probability between 0 and 1.
How it works:
Example:
Text: “Amazing product”
Weights: “amazing” = +2 → High score = positive review
b) Explain with block diagram working of web search engines and
significance of semantic indexing.
34
Search engines are programs that allow users to search and retrieve information
from the vast amount of content available on the internet. They use algorithms to
index and rank web pages based on relevance to a user’s query, providing a list of
results for users to explore. Popular search engines include Google, Bing, and Yahoo.
● These bots follow links and collect information from web pages such as text,
titles, images, and keywords.
● Keyword relevance
● Page speed
● Mobile-friendliness
● Number of backlinks
● Freshness of content
🔸 Web Crawler Software bots (like spiders) that visit and collect data from
web pages
🔸 Database Stores all the indexed data (title, content, links, etc.)
36
🔸 Ranking Engine Calculates scores and decides which pages appear first
🔸 Search Interface The user interface (like Google search bar) used to enter
queries
🔸 Query Processor Analyzes the user's input and fetches relevant results from
the index
For example, if the page is about "cars", LSI can understand related words like
"vehicle", "engine", "automobile", etc.
Benefits of LSI :
LSI helps search engines understand the real topic of a page, so users get more
accurate and helpful search results.
Websites don’t have to repeat the same keyword again and again. Using related words
is enough, which makes the content more natural.
LSI keywords make the content richer and more relevant, which helps the page rank
higher in search results.
Using LSI terms adds variety and depth to the content, making it more useful and
interesting for readers.
37
LSI helps search engines know what the user is really looking for, even if the exact
words are not typed.
Paper 2
Q1)
a) What is opinion mining? List the challenges of opinion mining
Opinion Mining, also called Sentiment Analysis, is a method used to find out what
people feel (positive, negative, or neutral) when they write something, like a review,
comment, or tweet.
📌 Example:
If someone writes, “I love this phone!”, opinion mining will detect that the feeling is
positive.
38
🧠 Example:
“This phone is just amazing… it only hangs 10 times a day.” (This sounds positive, but
it’s actually sarcastic and negative.)
🧠 Example:
“This phone is not good.” → The system must understand this is negative, not
positive.
🧠 Example:
“This movie was dark.”
🧠 Example:
“She is so cool.” → Positive
“It’s cool outside.” → Just about temperature, neutral
🧠 Example:
“This phone is awesum!” → Should be “awesome” (positive), but the system might not
understand it.
🧠 Example:
“I 💖 this!” or “This phone is lit!” → These mean positive, but the system must learn
emojis and slang.
🧠 Example:
“Long battery life”
● In phones: positive
40
🧠 Example:
“This is the best product ever!!! Buy now!!!” → Sounds too fake or promotional, might
not be real.
🧠 Example:
“I like the screen, but the battery is terrible.”
→ Part is positive, part is negative – hard to label it as just one.
b) What are the types of spamming techniques? Explain any two techniques in
detail.
📌 Example:
Guessing whether someone is happy or sad (hidden state) based on their facial
expressions (observable data), step by step.
41
CRF is a machine learning model used for predicting sequences, where it looks at the
whole sentence or sequence at once to decide the best set of labels.
It considers the relationship between neighboring words and features together.
Sentence:
John lives in New York.
💡 Explanation:
CRF looks at the whole sentence to decide that "New" and "York" together form a
location, not separately. It learns the pattern and relationships between words.
2. Purpose Models both observations and hidden Models the relationship between
states input and output labels
3. Dependencies Assumes each state depends only on Considers the whole sequence
the previous state (Markov property) context for predictions
4. Feature Usage Limited feature usage Can use multiple features for better
accuracy
42
6. Flexibility Less flexible due to independence More flexible and captures complex
assumptions dependencies
7. Performance Works well for simple sequences Works better for complex sequences
like NLP tasks
9. Learning Method Uses Maximum Likelihood Estimation Uses Conditional Probability for
(MLE) learning
10. Accuracy May give lower accuracy due to Higher accuracy as it considers the
independence assumptions entire sequence
Same as Explain the concept of language modeling, N gram models and its applications.
Social Media Mining is the process of analyzing and extracting useful information
from social media platforms (like Facebook, Twitter, Instagram) to understand patterns,
behaviors, and trends
● Definition: Social media generates a huge amount of data very quickly (e.g.,
millions of posts every second). This makes it hard to store, manage, and
analyze.
is safe.
● Definition: Social media data is often messy with things like slang, typos,
emojis, and irrelevant content. It’s also full of fake accounts and spam.
● Definition: Social media trends change fast. It’s hard to keep up because
topics, hashtags, and popular content keep changing, requiring real-time
analysis.
7. Scalability Issues
● Definition: It’s difficult to find the key influencers in large networks and to
understand how communities form and change over time.
● Definition: Using social media data for research or business has to follow laws
and be ethical, including getting user consent and ensuring data is used
responsibly.
Q2)
NER is a technique in Natural Language Processing (NLP) that helps find and label
names of people, places, organizations, dates, etc. in text.
🔍 Example:
👉
Input sentence:
"Apple Inc. is located in Cupertino."
NER Output:
● Cupertino → Location
Input Text
↓
Text Preprocessing
(Tokenization, Stopword Removal)
↓
Feature Extraction
(Word Embeddings, POS Tags)
↓
Named Entity Recognition
(NER Model - ML or Deep Learning)
↓
Classified Entities
(Person, Organization, Location, Date, etc.)
45
✔️
Clean and prepare the text.
✔️
Pull useful information from words.
Applications of NER
9. E-commerce
Helps identify product names, brands, and prices in reviews and product listings.
10.Email Filtering
Tags important names, dates, or companies in business emails for better
organization.
❗
Uses a dictionary of names/terms to match with the text.
Not used much — needs regular updates.
❗
✔ Learns from data
Needs lots of labeled data and context understanding.
1. K-Means Clustering
Concept:
K-Means groups data into K clusters based on centroids (center points).
Each data point (e.g., a word, phrase, or user) is assigned to the nearest centroid. The
centroids are updated by averaging the data points in each cluster.
Steps:
4. Update the centroid based on the average of points in the cluster.
Distance Used:
Usually Euclidean distance (straight-line distance).
Example:
Imagine students grouped by height and weight. K-Means will cluster students with
similar height and weight together.
● K-Means can then group similar words (e.g., {apple, banana, mango} in one
group and {car, bus, bike} in another).
2. Hierarchical Clustering
Concept:
It builds a tree-like structure (called a dendrogram) to show how data points (like
documents) are related.
● Agglomerative (Bottom-Up): Start with each item as its own group → merge
the most similar groups until one big group remains.
● Divisive (Top-Down): Start with one big group → keep splitting based on
similarity.
Distance Used:
Example:
Think of grouping news articles. Start with each article separately, then merge similar
ones into clusters like Sports, Politics, Technology, etc.
Application (Documents):
● Documents are represented as word probabilities (e.g., using LDA), and then
grouped based on how similar their word usage is.
Conclusion:
● K-Means is fast and works well when you know the number of clusters.
● Hierarchical Clustering is better when you want to see the relationship between
clusters in a tree form.
Both are useful in text mining, document grouping, and language
applications.
Q3)
a) What is Latent Semantic Indexing? What are the benefits of Latent
Semantic Indexing?
○ Check if the emotions or opinions in the review match with other real
reviews.
○ For example, if most people say a product is bad and one review says it’s
amazing, it might be fake.
■ Naïve Bayes
■ Decision Trees
■ Neural Networks
Where Is It Used?
Benefits:
Q4)
● If most of them are red, it labels the new dot as red (apple).
2. Measure the distance between the new point and all existing data points
(usually using Euclidean distance).
5. Assign the new point to the most common label among them.
✅ Example:
Suppose we have data about fruits like:
✅ Applications of K-NN:
● Spam Email Detection (spam or not spam)
✅ Advantages:
● Easy to understand and implement
✅ Limitations:
● Slow for large datasets (because it checks distance from every point)
● Choice of K is important – too small can be noisy, too large can mix categories
✅ Key Points:
● K-NN is a lazy learner – it doesn’t build a model in advance.
● Best used when you have labeled data and need to make simple predictions.
b) Explain different data sources and the web usage mining process in detail.
Web Usage Mining is the process of analyzing how users behave on a website.
It helps website owners understand:
🔹 1. Data Preprocessing
(Clean and prepare the data)
✅
● Identify who the user is, when they came, and what they did.
Example: Removing repeated page visits or missing values.
✅
● Make patterns/models to show which pages they visit and in what order.
Example: Many users go from "Home" → "Product" → "Cart".
✅
● We check how long users stayed, what they clicked, and if they came back.
Example: A user stays for 10 minutes and views 5 pages.
✅
● Helpful in giving personalized content or ads.
Example: Group A – people who buy often, Group B – people who just
browse.
✅
● Shows what users like to do.
Example: Users who visit “Mobile Phones” also visit “Accessories”.
✅
● Helps improve navigation.
Example: Most users follow this path: Homepage → Search → Product →
Checkout.
This step predicts user behavior based on past activity. It helps in identifying user
interests and future actions.
● Neural Networks 🧠
● Decision Trees : Classifies users into categories (e.g., buyers vs. non-buyers).
: Identifies complex patterns in user behavior.
58
Q5)
a) Explain feature selection techniques for text document classification.
● Increases accuracy
✅
● More frequent = more important (sometimes).
Example: In a sports article, the word "goal" may appear often → useful
word.
✅
● If a word appears in almost all documents (like "the", "is"), it may not be helpful.
Example: "Laptop" appears in tech articles only → good feature.
59
● Gives high score to words that are frequent in one document but rare in
✅
others.
Example: In a document about “Diabetes,” words like “insulin” get high
TF-IDF.
✅
● Measures how much a word and category are dependent.
Example: Word “complaint” appears mostly in “negative reviews” → selected.
✅
● High IG = Good feature
Example: The word “free” may help separate spam emails from normal ones.
✅
● Measures how much knowing a word helps in knowing the category.
Example: Word “refund” → very useful in identifying complaints.
✅ Summary Table
Technique What It Does Example Word Use
60
TF-IDF High for rare but important words “Insulin” in medical text
b) What are the different types of social media graphs?. Explain
recommendation using social context in detail.
💡 What is It?
Recommendation using social context means giving suggestions to users (like movies,
products, or friends) based on their social connections — like friends, followers, likes,
or group behavior.
Instead of just looking at user preferences, it also uses who the user knows and
interacts with.
📌 Example:
If your friend liked a movie, there’s a high chance you might like it too.
So the system will recommend that movie to you — using your social connection.
✅ Benefits:
● More personalized suggestions
✅ Real-Life Examples:
● Netflix: Shows “Trending among your friends”
Final Line:
Recommendation using social context makes suggestions smarter and more personal
by using not just your choices, but also the influence of your friends and social group.
Q6)
a) Explain the working of web search engines.
Same as Explain with block diagram working of web search engine
62
It means finding whether a given text (like a tweet or review) shows a positive,
negative, or neutral feeling.
Supervised techniques use labeled data — where each text already has a known
sentiment — to train a model that can predict the sentiment of new text.
○ Bag of Words
○ TF-IDF
Supervised learning means we use pre-labeled data (texts that are already marked as
Positive, Negative, or Neutral) to train a model that can predict the sentiment of new
text.
63
● Assumes that all words are independent of each other (which is not always true,
but it works well for text).
● Counts how often each word appears in positive and negative reviews.
✅ Advantages:
● Fast and simple.
📌 Example:
If the words “awesome”, “great”, “love” appear mostly in positive texts, then a review
with those words will likely be predicted as positive.
● Good for large feature spaces, like text where there are thousands of words.
✅ Advantages:
64
● Very accurate.
📌 Example:
A tweet is converted into numbers (features), and SVM draws a line that separates
positive and negative tweets based on these features.
🔹 3. Logistic Regression
● A type of regression that predicts probability of output class (like positive or
negative).
● It uses a sigmoid function to convert the result into a value between 0 and 1.
● Often used when you want to predict binary outcomes (yes/no, true/false,
positive/negative).
✅ Advantages:
● Simple to implement.
📌 Example:
If a review contains many positive words, the model might output 0.85, meaning there’s
an 85% chance the review is positive.
🔹 4. Decision Tree
● Uses a tree-like structure to make decisions.
65
✅ Advantages:
● Easy to understand and visualize.
📌 Disadvantage:
● May overfit (memorize the training data and perform poorly on new data).
📌 Example:
If the text contains the word "excellent", go one way; if it contains "terrible", go another
way.
🔹 5. Neural Networks
● Made of layers of neurons (input, hidden, and output layers).
● Popular networks include RNN (Recurrent Neural Networks) and LSTM (Long
Short-Term Memory) for text tasks.
✅ Advantages:
● Very powerful and accurate.
📌 Disadvantages:
66
📌 Example:
Given a review like “I hated the movie, but the ending was great”, a neural network can
understand the mixed tone and possibly classify it as neutral or slightly positive.
Text preprocessing prepares raw text for analysis in Natural Language Processing
(NLP). Below are four key steps:
3. Stop Word Removal (Removing common words that add little meaning)
● Definition: Removing words like "is", "the", "and" to keep important words only.
● Example:
○ Input: "I am learning machine learning"
○ Output: "learning machine learning"
These steps clean and structure text data, making it ready for machine learning and
NLP tasks like chatbots, search engines, and sentiment analysis!
Clustering means grouping similar things together. In text mining, we group similar
words, phrases, or documents. This helps organize large text data without needing
labels.
💡 Key Idea:
If words or phrases often appear in the same context, they are similar and grouped into
the same cluster.
✅ Example:
● Words like "car", "bike", "truck" can be in one vehicle cluster.
✅ Used in:
● Creating word clouds
● Topic modeling
✅ Benefits:
● Helps understand hidden topics in large text.
💡 Key Idea:
Each document is a mixture of topics, and each topic is a mixture of words.
✅ Example:
● A news article might be:
○ 70% Politics
○ 30% Sports
○ 60% Technology
○ 40% Business
✅ Algorithm Used:
● LDA (Latent Dirichlet Allocation) is the most famous algorithm for probabilistic
clustering.
✅ Used in:
69
● Recommender systems.
✅ Benefits:
● Gives a more flexible and realistic view.
📌 Example:
To classify emails as "spam" or "not spam":
✅ Key Points:
● Easy to understand and draw.
70
🔹 2. Rule-Based Classifier
📘 What is it?
This classifier uses IF-THEN rules to classify data. Rules are made using expert
knowledge or learned from data.
📌 Example:
● IF a review contains the word "excellent", THEN it is positive.
✅ Key Points:
● Easy to write and understand.
🔹 3. Probabilistic Classifier
📘 What is it?
It uses probability to classify items. It calculates the chance that the data belongs to a
certain class.
📌 Example:
Using Naive Bayes:
71
● It calculates:
✅ Key Points:
● Simple but powerful.
🔹 4. Proximity-Based Classifier
📘 What is it?
Also called distance-based. It classifies data based on how close (similar) it is to other
data points.
📌 Example:
Using K-Nearest Neighbors (K-NN):
✅ Key Points:
72
● Easy to implement.
● Example: IF user clicks frequently on mobiles, THEN show more mobile ads.
● Easy to explain.
Q4) Markov random fields, Inverted indices and compression in web mining
✅ Key Points:
73
🔹 2. Inverted Indices
📘 What is it?
An inverted index is like a book index, but for search engines.
It stores which words appear in which documents.
📌 Example:
If you search for the word "apple", the search engine looks in the inverted index to find
all documents containing "apple".
✅ Key Points:
● Used in search engines like Google.
🔹 3. Compression
📘 What is it?
Compression means reducing the size of data so it takes less space and loads faster.
74
✅ Key Points:
● Saves storage space and bandwidth.
● Each search engine returns a similarity score (relevance score) for every
webpage.
● This score shows how closely the page matches the search query.
● Pages with higher total scores are ranked higher in the final result.
📌 Example:
If a page gets 0.8 from Google and 0.7 from Bing → average score = 0.75 → higher
rank.
● Meta search looks at the rank of each result from different engines.
○ Rank 1 = 10 points
○ Rank 2 = 9 points
● The total score of a page is calculated by adding its points from all engines.
📌 Example:
If a webpage is ranked 1st by Bing (10 points) and 3rd by Google (8 points), it gets 18
points → high final rank.
76
🔚 Conclusion:
Meta search improves search quality by combining results from many search engines.
It uses similarity scores (how relevant a page is) and rank positions (page order in
results) to give better and more complete answers to user
Web spamming means cheating search engines to get higher rankings by using fake
or unfair techniques, like keyword stuffing or link farms.
Spammers try to bring low-quality pages to the top of search results.
a) PageRank Algorithm
b) TrustRank Algorithm
● Algorithms like Naive Bayes, Decision Trees, and SVM are trained to detect
spam.
○ Word frequency
○ Number of links
○ Length of content
● Signs of spam:
● “Good” → Positive
● “Terrible” → Negative
Sometimes, the basic list doesn’t have all opinion words, so we expand it using two
methods:
➤ Steps:
✅ Features:
● Easy and fast to implement
📌 Example:
“happy” is positive
→ Synonym = “joyful” → also positive
→ Antonym = “sad” → negative
➤ Steps:
● Look at how unknown words are used with known opinion words
✅ Features:
● More accurate for domain-specific terms (like tech or movies)
80
📌 Example:
In mobile reviews:
"Stylish" often appears near "great" → assume "stylish" is positive
✅ Conclusion:
Method Uses Strength
Both methods help in growing the opinion word list, which improves sentiment
analysis accuracy.
Opinion spam refers to fake or misleading reviews written to boost or harm a product
or service.
Supervised learning uses labeled reviews (real or fake) to train a machine learning
model that can detect spam reviews.
➤ How it works:
○ Review length
○ Reviewer history
○ Naive Bayes
○ Random Forest
✅ Advantage:
● Can achieve high accuracy if trained on a good dataset.
📌 Example:
If a review says: “Amazing! Great! Super! Love it!” and it’s very short, it might be fake →
Model will flag it.
This technique does not need labeled data. It detects fake reviews by finding strange
patterns in user behavior.
✅ Advantage:
82
📌 Example:
A user posts 10 five-star reviews in one hour → system flags them as abnormal.
Sometimes, spammers work in groups to make fake reviews look real. This method
finds coordinated behavior.
✅ Advantage:
● Detects organized spam attacks
📌 Example:
5 users give 5-star reviews to the same phone on the same day using similar words →
suspected spam group.