0% found this document useful (0 votes)
33 views73 pages

Unit II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views73 pages

Unit II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

19EAI441: Web Mining

Module: I &II

Information Retrieval and


Web Search
Syllabus

Module: I Information Retrieval and Web Search -

Basic Concepts of Information Retrieval, IR Models - Boolean Model, Vector

Space Model, Statistical Language Model, Relevance Feedback, Evaluation

Measures. Text and Web Page Pre-Processing-Stop word removal, Stemming,

other Pre-Processing Tasks for Text, Web Page Pre-Processing, Duplicate

detection.
Web Mining:

• Web Mining is the process of Data Mining techniques to


automatically discover and extract information from Web
documents and services.
• The main purpose of web mining is discovering useful
information from the World-Wide Web and its usage

patterns.
Applications of Web Mining:
• Web mining is the process of discovering patterns, structures,
and relationships in web data. It involves using data mining
techniques to analyze web data and extract valuable insights.
• The applications of web mining are wide-ranging
and include:
 Personalized marketing
 E-commerce
 Search engine optimization
 Fraud detection
 Sentiment analysis
 Web content analysis
 Customer service
 Healthcare
Process of Web Mining:

 Web mining can be broadly divided into three different


types of techniques of mining:
1. Web Content Mining,
2. Web Structure Mining,
3. Web Usage Mining.
Categories of Web Mining:
Comparison Between Data mining and Web mining:
Points Data Mining Web Mining

Data Mining is the process that attempts to Web Mining is the process of data mining
Definition discover pattern and hidden knowledge techniques to automatically discover and extract
in large data sets in any system. information from web documents.

Data Mining is very useful for web page Web Mining is very useful for a particular
Application
analysis. website and e-service.

Target Users Data scientist and data engineers. Data scientists along with data analysts.

Access Data Mining access data privately. Web Mining access data publicly.

In Web Mining get the information from


In Data Mining get the information from
Structure structured, unstructured and semi-structured
explicit structure.
web pages.

Clustering, classification, regression,


Problem Type Web content mining, Web structure mining.
prediction, optimization and control.

It includes tools like machine learning Special tools for web mining are Scrapy,
Tools
algorithms. PageRank and Apache logs.

It includes approaches for data cleansing, It includes application level knowledge, data
Skills machine learning algorithms. Statistics engineering with mathematical modules like
and probability. statistics and probability.
Basic Concepts of Information Retrieval:
• Information retrieval (IR) is the study of helping users to find
information that matches their information needs.
• Technically, IR studies the acquisition, organization, storage, retrieval,
and distribution of information.
General architecture of an IR system:
• The user with information need issues a query (user query) to the retrieval
system through the query operations module.
• The retrieval module uses the document index to retrieve those documents
that contain some query terms (such documents are likely to be relevant to
the query), compute relevance scores for them, and then rank the retrieved
documents according to the scores.
• The ranked documents are then presented to the user.
• The document collection is also called the text database, which is
indexed by the indexer for efficient retrieval.

Fig. A general IR system architecture


A user query represents the user’s information needs, which is in
one of the following forms:
1. Keyword queries
2. Boolean queries
3. Phrase queries
4. Proximity queries
5. Full document queries
6. Natural language questions
1. Keyword queries: The user expresses his/her information needs with a
list of (at least one) keywords (or terms) aiming to find documents that contain
some (at least one) or all the query terms. The terms in the list are assumed to
be connected with a “soft” version of the logical AND.
 For example, if one is interested in finding information about Web mining,
one may issue the query ‘Web mining’ to an IR or search engine system.
‘Web mining’ is retreated as ‘Web AND mining’.
 The retrieval system then finds those likely relevant documents and ranks
them suitably to present to the user.
2. Boolean Queries:
 The user can use Boolean operators, AND, OR, and NOT to
construct complex queries. Thus, such queries consist of terms
and Boolean operators.
 For example: ‘data OR Web’ is a Boolean query, which
requests documents that contain the word ‘data’ or ‘Web.
 A page is returned for a Boolean query if the query is logically
true in the page (i.e., exact match). Although one can write
complex Boolean queries using the three operators, users
seldom write such queries.
 Search engines usually support a restricted version of Boolean
queries.
3. Phrase Queries:
 Such a query consists of a sequence of words that makes up a
phrase.
 Each returned document must contain at least one instance of
the phrase.
 In a search engine, a phrase query is normally enclosed with
double quotes.
 For Example: one can issue the following phrase query
(including the double quotes), “Web mining techniques and
applications” to find documents that contain the exact phrase.
4. Proximity Queries:

 The proximity query is a relaxed version of the phrase query


and can be a combination of terms and phrases. Proximity
queries seek the query terms within close proximity to each
other. The closeness is used as a factor in ranking the returned
documents or pages.
 For Example: A document that contains all query terms close
together is considered more relevant than a page in which the
query terms are far apart. Some systems allow the user to
specify the maximum allowed distance between the query
terms.
5. Full Document Queries:

 When the query is a full document, the user wants to find


other documents that are similar to the query document.
 Some search engines (e.g., Google) allow the user to issue
such a query by providing the URL of a query page.
 Additionally, in the returned results of a search engine, each
snippet may have a link called “more like this” or “similar
pages.”
 When the user clicks on the link, a set of pages similar to the
page in the snippet is returned.
6. Natural Language Questions:

 This is the most complex case, and also the ideal case. The
user expresses his/her information need as a natural language
question. The system then finds the answer.
 However, such queries are still hard to handle due to the
difficulty of natural language understanding. Nevertheless, this
is an active research area, called question answering.
 Definition questions are usually easier to answer because there
are strong linguistic patterns indicating definition sentences.
 The query operations module can range from very simple
to very complex. In the simplest case, it does nothing but just
pass the query to the retrieval engine after some simple pre-
processing, e.g., removal of stopwords (words that occur very
frequently in text but have little meaning, e.g., “the”, “a”, “in”,
etc.).
 It may also accept user feedback and use it to expand and
refine the original queries.
 The indexer is the module that indexes the original raw
documents in some data structures to enable efficient retrieval.
The result is the document index.
 The retrieval system computes a relevance score for each
indexed document to the query. According to their relevance
scores, the documents are ranked and presented to the user.
IR Models:
 An IR model governs how a document and a query are
represented and how the relevance of a document to a user
query is defined.
 There are four main IR models:
1. Boolean model,
2. Vector space model,
3. Language model,
4. Probabilistic model.
 The most commonly used models in IR systems and on the
Web are the first three models.
 Although these three models represent documents and queries
differently, they use the same framework.
 They all treat each document or query as a “bag” of words or
terms.
 That is, a document is described by a set of distinctive terms.
 Each term is associated with a weight. (similarity between each
document stored in the system and user query).
 Given a collection of documents D, let V = {t1, t2, ..., t|V|} be
the set of distinctive terms in the collection, Where, ti is a
term.
 The set V is usually called the vocabulary of the collection,
and |V| is its size,
i.e., the number of terms in V.
 A weight wij > 0 is associated with each term ti of a
document dj ∈ D.

 For a term that does not appear in document dj, wij = 0.

 Each document dj is thus represented with a term vector,


d = (w , w , ..., w ),
1. Boolean Model:
 The Boolean model is one of the earliest and simplest
information retrieval models.
 It uses the notion of exact matching to match documents to
the user query.
 Both the query and the retrieval are based on Boolean
algebra.
Document Representation:
 In the Boolean model, documents and queries are
represented as sets of terms.
 That is, each term is only considered present or absent in a
document. Using the vector representation of the document
above, the weight wij (∈ {0, 1}) of term ti in document dj is 1
if ti appears in document dj, and 0 otherwise, i.e.,
Boolean Queries:
 Query terms are combined logically using the Boolean
operators AND, OR, and NOT.
 Thus, a Boolean query has a precise semantics.
 For instance, the query, ((x AND y) AND (NOT z)) says that
a retrieved document must contain both the terms x and y but
not z.
 As another example, the query expression (x OR y) means
that at least one of these terms must be in each retrieved
document.
Document Retrieval:
 Given a Boolean query, the system retrieves every document
that makes the query logically true.
 Thus, the retrieval is based on the binary decision criterion,
i.e., a document is either relevant or irrelevant. This is called
exact match.
 For example, the following query can be issued to Google,
‘mining – data + “equipment price”’,
where, + (inclusion) and – (exclusion) are similar to Boolean
operators AND and NOT respectively.
 The operator OR may be supported as well.
Advantages of the Boolean Mode:

 The simplest model, which is based on sets.


 Easy to understand and implement.
 It only retrieves exact matches.
 It gives the user, a sense of control over the system.
Disadvantages of the Boolean Model:

 No ranking for retrieved documents.


 The model’s similarity function is Boolean. Hence, there
would be no partial matches.
 In this model, the Boolean operator usage has much more
influence than a critical word.
 The query language is expressive, but it is complicated too.
2. Vector space model:
 A document in the vector space model is represented
as a weight vector, in which each component weight is
computed based on some variation of TF or TF-IDF
scheme.

 The weight w of term t in document d is no longer in {0,


ij i j
1} as in the Boolean model, but can be any number.
Term Frequency (TF) Scheme:
 In this method, the weight of a term ti in document dj is the
number of times that ti appears in document dj, denoted by
fij.

 Normalization may also be applied,


TF-IDF Scheme: (Term frequency - Inverse Document Frequency)
 This is the most well known weighting scheme,
There are several variations of this scheme. Here we
only give the most basic one.
 Then, the normalized term frequency (denoted by tfij) of t
i
in dj is given by,

 where the maximum is computed over all terms that appear


in document dj.

 If term ti does not appear in dj then tfij = 0. Recall that |V| is


the vocabulary size of the collection.
 The inverse document frequency (denoted by idfi) of term
ti is given by:
 The final TF-IDF term weight is given by:

Queries:
 A query q is represented in exactly the same way as a
document in the document collection.
 The term weight w of each term t in q can also be
iq i
computed in the same way as in a normal document, or slightly
differently.
 For example, Salton and Buckley suggested the following:
Document Retrieval and Relevance Ranking:
 Instead, the documents are ranked according to their
degrees of relevance to the query.
 One way to compute the degree of relevance is to calculate
the similarity of the query q to each document dj in the
document collection D.
 There are many similarity measures.
 The most well known one is the cosine similarity, which is
the cosine of the angle between the query vector q and the
document vector dj,

 Cosine similarity is also widely used in text/document


clustering.
 The dot product of the two vectors is another similarity
measure,

 Ranking of the documents is done using their similarity


values. The top ranked documents are regarded as more
relevant to the query.
 Another way to assess the degree of relevance is to directly
compute a relevance score for each document to the query.
 The Okapi method and its variations are popular
techniques in this setting.
 It has been shown that Okapi variations are more effective
than cosine for short query retrieval.
 Since it is easier to present the formula directly using the “bag” of words
notation of documents than vectors, document dj will be denoted by dj
and query q will be denoted by q.
 Additional notations are as follows:
 t is a term
i

 f is the raw frequency count of term t in document d


ij i j

 f is the raw frequency count of term t in query q


iq i
 N is the total number of documents in the collection
 df is the number of documents that contain the term t
i i

 dl is the document length (in bytes) of d


j j
 avdl is the average document length of the collection
 The Okapi relevance score of a document dj for a query q
 Yet another score function is the pivoted normalization
weighting (pnw) score function, denoted by

 Where, s is a parameter (usually set to 0.2).


 Note that these are empirical functions based on intuitions and
experimental evaluations.
3. Statistical language models (OR) Language
model:
 This model based on probability and have foundations in
statistical theory.
 Information retrieval using language models was first
proposed by Ponte and Croft.
 Let the query q be a sequence of terms, q = q1q2…qm and
the document collection D be a set of documents, D = {d1,
d2, …, dN}.
 In the language modeling approach, we consider the
probability of a query q as being “generated” by a
probabilistic model based on a document dj, i.e., Pr(q|dj).
 To rank documents in retrieval, we are interested in
estimating the posterior probability Pr(dj|q).
 Using the Bayes rule, we have

 For ranking, Pr(q) is not needed as it is the same for every


document.
 Pr(d ) is usually considered uniform and thus will not affect
j
ranking. We only need to compute Pr(q|dj).
 Based on the multinomial distribution and the unigram
model, we have

 where f is the number of times that term t occurs in q, and


iq i
 The retrieval problem is reduced to estimating Pr(t |d ),
i j
which can be the relative frequency,

 Recall that f is the number of times that term t occurs in


ij i
document dj.

 |d | denotes the total number of words in d .


j j
 However, one problem with this estimation is that a term that
does not appear in dj has the probability of 0, which
underestimates the probability of the unseen term in the
document.
 A non-zero probability is typically assigned to each unseen
term in the document, which is called smoothing.
 The name smoothing comes from the fact that these
techniques tend to make distributions more uniform, by
adjusting low probabilities such as zero probabilities
upward, and high probabilities downward.
 Traditional additive smoothing is,

 When == 1, it is the Laplace smoothing and


When 0 < < 1, it is the Lidstone smoothing.
Relevance Feedback:
 It is a process where the user identifies some relevant and irrelevant
documents in the initial list of retrieved documents, and the system then
creates an expanded query by extracting some additional terms from the
sample relevant and irrelevant documents for a second round of retrieval.
 The relevance feedback process may be repeated until the user is satisfied
with the retrieved result.
The Rocchio Method:
 Uses the user-identified relevant and irrelevant documents to expand the
original query.

 Let the original query vector be q, the set of relevant documents selected

by the user be Dr, and the set of irrelevant documents be D ir.


 The expanded query q is computed as follows,
e

Where, α, β and γ are parameters.


 The original query q is still needed because it directly reflects the user’s
information need.
 Relevant documents are considered more important than irrelevant
documents.
 The subtraction is used to reduce the influence of those terms that are
not discriminative (i.e., they appear in both relevant and irrelevant
documents), and those terms that appear in irrelevant documents only.

 Both these methods are simple and efficient to compute, and usually
produce good results.
1. Machine Learning Methods
2. Pseudo-Relevance Feedback
1. Machine Learning Methods:
 Since we have a set of relevant and irrelevant documents, we can
construct a classification model from them. Then the relevance feedback
problem becomes a learning problem.
 Any supervised learning method may be used, e.g., naïve Bayesian
classification and SVM.
 Building a Rocchio classifier is done by constructing a prototype
vector ci for each class i, which is either relevant or irrelevant in this
case

 In classification, cosine similarity is applied.

 That is, each test document d is compared with every prototype c based
t i
Fig.: Training and testing of a Rocchio classifier

Apart from the above classic methods, the following learning techniques
are also applicable:
1. Learning from Labeled and Unlabeled Examples (LU Learning) -
small labeled training set.
2. Learning from Positive and Unlabeled Examples (PU Learning) -
implicit feedback.
3. Using Ranking SVM and Language Models - implicit feedback
setting, a technique called ranking SVM.
2. Pseudo-Relevance Feedback: (Blind Feedback)

 Pseudo-relevance feedback is another technique used to improve retrieval


effectiveness.
 It provides a method for automatic local analysis.
 The approach simply assumes that the top-ranked documents are
likely to be relevant.
 Advantage:

Dos not require assessors like in explicit relevance feedback system.


Evaluation Measures:
 Most software products we encounter today have some form of search
functionality integrated into them.
 We search for content on Google, videos on YouTube, products on
Amazon, messages on Slack, emails on Gmail, people on Facebook,
and so on.
 We can search for items by writing our queries in a search box and
the ranking model in their system gives us back the top-N most
relevant results.

How do we evaluate how good the top-N results are?


 The above question by explaining the common offline metrics used in
learning to rank problems.
 These metrics are useful not only for evaluating search results but also
for problems like keyword extraction and item recommendation.

Problem Setup 1: Binary Relevance


 Let’s take a simple toy example to understand the details and trade-
offs of various evaluation metrics.
 We have a ranking model that gives us back 5-most relevant results for
a certain query.
 The first, third, and fifth results were relevant as per our ground-truth
annotation.

A. Order-Unaware Metrics:
1. Precision@k:
 This metric quantifies how many items in the top-K results were
relevant.
Mathematically, this is given by:
 A limitation of precision@k is that it doesn’t consider the position of
the relevant items.
 Consider two models A and B that have the same number of relevant
results i.e. 3 out of 5.
 For model A, the first three items were relevant, while for model B, the
last three items were relevant.
 Precision@5 would be the same for both of these models even though
model A is better.
2. Recall@k:
 This metric gives how many actual relevant results were shown out of all
actual relevant results for the query.
Mathematically, this is given by:
B. Order Aware Metrics:
 While precision, recall, and F1 give us a single-value metric, they don’t
consider the order in which the returned search results are sent.
 To solve that limitation, people have devised order-aware metrics given
below:
Text and Web Page Pre-Processing:

What is text pre-processing?


 Text pre-processing is the process of transforming unstructured text to
structured text to prepare it for analysis.
 When you pre-process text before feeding it to algorithms, you increase
the accuracy and efficiency of said algorithms by removing noise and
other inconsistencies in the text that can make it hard for the computer to
understand.
 Making the text easier to understand also helps to reduce the time and
resources required for the computer to pre-process data.
Processes involved in text pre-processing:
1. Stop-word removal
2. Stemming
1. Stop-word removal:
 Stop-words are words with no meaning. They don't add any
additional value to data.
 Words like a, about, an, are, as, at, be, by, for, from, how, in, is, of,
on, or, that, the, these, this, to, was, what, when, where, who, will,
with are called stop-words.
 Stop-word removal also helps to increase the efficiency of your
model.
 Since it reduces the size of our dataset, it makes it more
manageable and increases the accuracy of NLP tasks.
2. Stemming:
 Stemming words like coding, coder, and coded all have the same base
word which is code.
 ML models most-often-than-not understand that these words are all
derived from one base word.
 They can work with your text without the tenses, prefixes, and suffixes
that we as humans would normally need to make sense of it.
 Stemming your texts not only helps to reduce the number of words the
model has to work with, and by extension improves the efficiency of the
model.
 Although the efficiency of a model is increased with this technique, it
also removes important information from your text and could cause
some words to be wrongly categorised by the model.
 An example of this would be the difference
between writing and write in the sentences below:

 In the first sentence the word writing represents a noun, while writes in
the second sentence represents a verb.
 If your ML models stems both writing and writes to the base write the
difference in their respective parts of speech is overlooked causing some
information to be lost in the process of analysing the text.

Other Pre-Processing Tasks for Text:


1. Digit
2. Hyphenss
3. Punctuation Marks
4. Case of Letters
Digit:
 Numbers and terms that contain digits are removed in traditional IR
systems except some specific types. For Exampe: dates, times.

Hyphenss:
 Breaking hyphens are usually applied to deal with inconsistency of
usage. For example, some people use “state-of-the-art”, but others use
“state of the art”.

Punctuation Marks:
 Punctuation can be dealt with similarly as hyphens.

Case of Letters:
 All the letters are usually converted to either the upper or lower case.
Web Page Pre-Processing:
Some important of Web Page Pre-Processing,
1. Identifying different text fields
2. Identifying anchor text
3. Removing HTML tags
4. Identifying main content blocks
1. Identifying different text fields:
 In HTML, there are different text fields, e.g., title, metadata, and body.
 Identifying them allows the retrieval system to treat terms in different
fields differently.
 In the body text, those emphasized terms (e.g., under header tags <h1>,
<h2>, …, bold tag <b>, etc.) are also given higher weights.
2. Identifying anchor text:
 Anchor text associated with a hyperlink is treated specially in search
engines because the anchor text often represents a more accurate
description of the information contained in the page pointed to by its
link.

3. Removing HTML tags:


 The removal of HTML tags can be dealt with similarly to punctuation.
 One issue needs careful consideration, which affects proximity queries
and phrase queries.
 For example, The Figure “cite this article” at the bottom of the left
column will join “Main Page” on the right, but they should not be joined.
4. Identifying main content blocks:
 A typical Web page, especially a commercial page, contains a large amount
of information that is not part of the main content of the page.
For example, it may contain banner ads, navigation bars, copyright notices,
etc., which can lead to poor results for search and mining.
We briefly discuss two techniques for finding such blocks in Web pages.
1. Partitioning based on visual cues:
 This method uses visual information to help find main content blocks in a
page. Visual or rendering information of each HTML element in a page can
be obtained from the Web browser.
2. Tree matching: This method is based on the observation that in most
commercial Web sites pages are generated by using some fixed templates.
Duplicate Detection:
 There are different types of duplication of pages and contents on the
Web.
 Copying a page is usually called duplication or replication, and
copying an entire site is called mirroring.
 Duplicate pages and mirror sites are often used to improve efficiency of
browsing and file downloading worldwide due to limited bandwidth across
different geographic regions and poor or unpredictable network
performances.
 Several methods can be used to find duplicate information. The
simplest method is to hash the whole document, e.g., using the MD5
algorithm, or computing an aggregated number (e.g., checksum).
 One efficient duplicate detection technique is based on n-grams (also
called shingles).
 An n-gram is simply a consecutive sequence of words of a fixed
window size n.
 For example, the sentence, “John went to school with his brother,” can be
represented with five 3-gram phrases “John went to”, “went to school”, “to
school with”, “school with his”, and “with his brother”.

 Let S (d) be the set of distinctive n-grams (or shingles) contained in


n
document d. Each n-gram may be coded with a number or a MD5 hash.

 Given the n-gram representations of the two documents d1 and d2,

Sn(d1) and Sn(d2), the Jaccard coefficient can be used to compute the

similarity of the two documents,

 A threshold is used to determine whether d1 and d2 are likely to be


UNIT II

CO2: Composing queries and integrating


spanning techniques.
Inverted Index :
• An inverted index is an index data
structure storing a mapping from content,
such as words or numbers, to its locations
in a document or a set of documents. In
simple words, it is a hashmap-like data
structure that directs you from a word to a
document or a web page.
inverted indexing for text retrieval:
• The inverted index is a data structure that
allows efficient, full-text searches in the
database. It is a very important part of
information retrieval systems and search
engines that stores a mapping of words (or
any type of search terms) to their locations
in the database table or document.
purpose of the inverted index
• The purpose of an inverted index is to
allow fast full-text searches, at a cost of
increased processing when a document is
added to the database. The inverted file
may be the database file itself, rather than
its index.
Difference between hash index and
inverted index
• The hash index is just a mapping from an
index key to the exact location of the given
row in memory (primarily used for memory
optimized tables in relational databases)
whereas an inverted index is actually the
mapping from a word to the documents in
which it is contained.
Types of indexing
• There are three types of indexing
namely Ordered, Single-level, and multi-
level. Single Level Indexing is divided into
three types namely Primary(index table is
created using primary keys),
Secondary(index table is created using
candidate keys), and Clustered(index table
is created using non-key values).
Advantages of inverted index:
• One of the main advantages of using
inverted indexes for text retrieval is that
they allow fast and efficient query
processing. By storing the term-document
associations in a compact and sorted way,
inverted indexes can quickly retrieve the
documents that match a query, without
scanning the entire collection.
• Does Google use inverted index?
• Searching through individual pages for
keywords and topics would be a very slow
process for search engines to identify
relevant information. Instead, search
engines (including Google) use an
inverted index, also known as a reverse
index.

You might also like