Unit II
Unit II
Module: I &II
detection.
Web Mining:
patterns.
Applications of Web Mining:
• Web mining is the process of discovering patterns, structures,
and relationships in web data. It involves using data mining
techniques to analyze web data and extract valuable insights.
• The applications of web mining are wide-ranging
and include:
Personalized marketing
E-commerce
Search engine optimization
Fraud detection
Sentiment analysis
Web content analysis
Customer service
Healthcare
Process of Web Mining:
Data Mining is the process that attempts to Web Mining is the process of data mining
Definition discover pattern and hidden knowledge techniques to automatically discover and extract
in large data sets in any system. information from web documents.
Data Mining is very useful for web page Web Mining is very useful for a particular
Application
analysis. website and e-service.
Target Users Data scientist and data engineers. Data scientists along with data analysts.
Access Data Mining access data privately. Web Mining access data publicly.
It includes tools like machine learning Special tools for web mining are Scrapy,
Tools
algorithms. PageRank and Apache logs.
It includes approaches for data cleansing, It includes application level knowledge, data
Skills machine learning algorithms. Statistics engineering with mathematical modules like
and probability. statistics and probability.
Basic Concepts of Information Retrieval:
• Information retrieval (IR) is the study of helping users to find
information that matches their information needs.
• Technically, IR studies the acquisition, organization, storage, retrieval,
and distribution of information.
General architecture of an IR system:
• The user with information need issues a query (user query) to the retrieval
system through the query operations module.
• The retrieval module uses the document index to retrieve those documents
that contain some query terms (such documents are likely to be relevant to
the query), compute relevance scores for them, and then rank the retrieved
documents according to the scores.
• The ranked documents are then presented to the user.
• The document collection is also called the text database, which is
indexed by the indexer for efficient retrieval.
This is the most complex case, and also the ideal case. The
user expresses his/her information need as a natural language
question. The system then finds the answer.
However, such queries are still hard to handle due to the
difficulty of natural language understanding. Nevertheless, this
is an active research area, called question answering.
Definition questions are usually easier to answer because there
are strong linguistic patterns indicating definition sentences.
The query operations module can range from very simple
to very complex. In the simplest case, it does nothing but just
pass the query to the retrieval engine after some simple pre-
processing, e.g., removal of stopwords (words that occur very
frequently in text but have little meaning, e.g., “the”, “a”, “in”,
etc.).
It may also accept user feedback and use it to expand and
refine the original queries.
The indexer is the module that indexes the original raw
documents in some data structures to enable efficient retrieval.
The result is the document index.
The retrieval system computes a relevance score for each
indexed document to the query. According to their relevance
scores, the documents are ranked and presented to the user.
IR Models:
An IR model governs how a document and a query are
represented and how the relevance of a document to a user
query is defined.
There are four main IR models:
1. Boolean model,
2. Vector space model,
3. Language model,
4. Probabilistic model.
The most commonly used models in IR systems and on the
Web are the first three models.
Although these three models represent documents and queries
differently, they use the same framework.
They all treat each document or query as a “bag” of words or
terms.
That is, a document is described by a set of distinctive terms.
Each term is associated with a weight. (similarity between each
document stored in the system and user query).
Given a collection of documents D, let V = {t1, t2, ..., t|V|} be
the set of distinctive terms in the collection, Where, ti is a
term.
The set V is usually called the vocabulary of the collection,
and |V| is its size,
i.e., the number of terms in V.
A weight wij > 0 is associated with each term ti of a
document dj ∈ D.
Queries:
A query q is represented in exactly the same way as a
document in the document collection.
The term weight w of each term t in q can also be
iq i
computed in the same way as in a normal document, or slightly
differently.
For example, Salton and Buckley suggested the following:
Document Retrieval and Relevance Ranking:
Instead, the documents are ranked according to their
degrees of relevance to the query.
One way to compute the degree of relevance is to calculate
the similarity of the query q to each document dj in the
document collection D.
There are many similarity measures.
The most well known one is the cosine similarity, which is
the cosine of the angle between the query vector q and the
document vector dj,
Let the original query vector be q, the set of relevant documents selected
Both these methods are simple and efficient to compute, and usually
produce good results.
1. Machine Learning Methods
2. Pseudo-Relevance Feedback
1. Machine Learning Methods:
Since we have a set of relevant and irrelevant documents, we can
construct a classification model from them. Then the relevance feedback
problem becomes a learning problem.
Any supervised learning method may be used, e.g., naïve Bayesian
classification and SVM.
Building a Rocchio classifier is done by constructing a prototype
vector ci for each class i, which is either relevant or irrelevant in this
case
That is, each test document d is compared with every prototype c based
t i
Fig.: Training and testing of a Rocchio classifier
Apart from the above classic methods, the following learning techniques
are also applicable:
1. Learning from Labeled and Unlabeled Examples (LU Learning) -
small labeled training set.
2. Learning from Positive and Unlabeled Examples (PU Learning) -
implicit feedback.
3. Using Ranking SVM and Language Models - implicit feedback
setting, a technique called ranking SVM.
2. Pseudo-Relevance Feedback: (Blind Feedback)
A. Order-Unaware Metrics:
1. Precision@k:
This metric quantifies how many items in the top-K results were
relevant.
Mathematically, this is given by:
A limitation of precision@k is that it doesn’t consider the position of
the relevant items.
Consider two models A and B that have the same number of relevant
results i.e. 3 out of 5.
For model A, the first three items were relevant, while for model B, the
last three items were relevant.
Precision@5 would be the same for both of these models even though
model A is better.
2. Recall@k:
This metric gives how many actual relevant results were shown out of all
actual relevant results for the query.
Mathematically, this is given by:
B. Order Aware Metrics:
While precision, recall, and F1 give us a single-value metric, they don’t
consider the order in which the returned search results are sent.
To solve that limitation, people have devised order-aware metrics given
below:
Text and Web Page Pre-Processing:
In the first sentence the word writing represents a noun, while writes in
the second sentence represents a verb.
If your ML models stems both writing and writes to the base write the
difference in their respective parts of speech is overlooked causing some
information to be lost in the process of analysing the text.
Hyphenss:
Breaking hyphens are usually applied to deal with inconsistency of
usage. For example, some people use “state-of-the-art”, but others use
“state of the art”.
Punctuation Marks:
Punctuation can be dealt with similarly as hyphens.
Case of Letters:
All the letters are usually converted to either the upper or lower case.
Web Page Pre-Processing:
Some important of Web Page Pre-Processing,
1. Identifying different text fields
2. Identifying anchor text
3. Removing HTML tags
4. Identifying main content blocks
1. Identifying different text fields:
In HTML, there are different text fields, e.g., title, metadata, and body.
Identifying them allows the retrieval system to treat terms in different
fields differently.
In the body text, those emphasized terms (e.g., under header tags <h1>,
<h2>, …, bold tag <b>, etc.) are also given higher weights.
2. Identifying anchor text:
Anchor text associated with a hyperlink is treated specially in search
engines because the anchor text often represents a more accurate
description of the information contained in the page pointed to by its
link.
Sn(d1) and Sn(d2), the Jaccard coefficient can be used to compute the