Unit I

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

19EAI441: Web Mining

Module: I

Information Retrieval and


Web Search
Syllabus
Module: I Information Retrieval and Web Search -

Basic Concepts of Information Retrieval, IR Models-Boolean Model, Vector

Space Model, Statistical Language Model, Relevance Feedback, Evaluation

Measures. Text and Web Page Pre-Processing-Stop word removal, Stemming,

other Pre-Processing Tasks for Text, Web Page Pre-Processing, Duplicate

detection.
Web Mining:

• Web Mining is the process of Data Mining techniques to


automatically discover and extract information from Web
documents and services.

• The main purpose of web mining is discovering useful


information from the World-Wide Web and its usage
patterns.
Applications of Web Mining:
• Web mining is the process of discovering patterns, structures,
and relationships in web data. It involves using data mining
techniques to analyze web data and extract valuable insights.
• The applications of web mining are wide-ranging
and include:
 Personalized marketing
 E-commerce
 Search engine optimization
 Fraud detection
 Sentiment analysis
 Web content analysis
 Customer service
 Healthcare
Process of Web Mining:

 Web mining can be broadly divided into three different


types of techniques of mining:
1. Web Content Mining,
2. Web Structure Mining,
3. Web Usage Mining.
Categories of Web Mining:
Comparison Between Data mining and Web mining:
Points Data Mining Web Mining

Data Mining is the process that attempts to Web Mining is the process of data mining
Definition discover pattern and hidden knowledge techniques to automatically discover and extract
in large data sets in any system. information from web documents.

Data Mining is very useful for web page Web Mining is very useful for a particular
Application
analysis. website and e-service.

Target Users Data scientist and data engineers. Data scientists along with data analysts.

Access Data Mining access data privately. Web Mining access data publicly.

In Web Mining get the information from


In Data Mining get the information from
Structure structured, unstructured and semi-structured
explicit structure.
web pages.

Clustering, classification, regression,


Problem Type Web content mining, Web structure mining.
prediction, optimization and control.

It includes tools like machine learning Special tools for web mining are Scrapy, PageRank
Tools
algorithms. and Apache logs.

It includes approaches for data cleansing, It includes application level knowledge, data
Skills machine learning algorithms. Statistics engineering with mathematical modules like
and probability. statistics and probability.
Basic Concepts of Information Retrieval:
• Information retrieval (IR) is the study of helping users to find information
that matches their information needs.
• Technically, IR studies the acquisition, organization, storage, retrieval,
and distribution of information.
• Historically, IR is about document retrieval, emphasizing document as the
basic unit.
General architecture of an IR system:
• The user with information need issues a query (user query) to the retrieval
system through the query operations module.

• The retrieval module uses the document index to retrieve those documents
that contain some query terms (such documents are likely to be relevant to
the query), compute relevance scores for them, and then rank the retrieved
documents according to the scores.
• The ranked documents are then presented to the user.

• The document collection is also called the text database, which is


indexed by the indexer for efficient retrieval.

Fig. A general IR system architecture


A user query represents the user’s information needs, which is in
one of the following forms:
1. Keyword queries
2. Boolean queries
3. Phrase queries
4. Proximity queries
5. Full document queries
6. Natural language questions

1. Keyword queries: The user expresses his/her information needs with a


list of (at least one) keywords (or terms) aiming to find documents that
contain some (at least one) or all the query terms. The terms in the list are
assumed to be connected with a “soft” version of the logical AND.
• For example, if one is interested in finding information about Web mining,
one may issue the query ‘Web mining’ to an IR or search engine system.
‘Web mining’ is retreated as ‘Web AND mining’.
• The retrieval system then finds those likely relevant documents and ranks
them suitably to present to the user. Note that a retrieved document does
Thank you

You might also like