0% found this document useful (0 votes)
25 views13 pages

DMDW-Unit V

Uploaded by

Devika G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views13 pages

DMDW-Unit V

Uploaded by

Devika G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

DATA MINING AND WAREHOUSING- 18UITE64

UNIT V WEB MINING


5. Introduction
Web mining is the application of data mining techniques to discover patterns from
the World Wide Web. As the name proposes, this is information gathered by mining the
web. It makes utilization of automated apparatuses to reveal and extricate data from
servers and web reports, and it permits organizations to get to both organized and
unstructured information from browser activities, server logs, website and link structure,
page content and different sources.

5.1 Web Mining


Web Mining is the process of Data Mining techniques to automatically discover and
extract information from Web documents and services. The main purpose of web mining
is discovering useful information from the World-Wide Web and its usage patterns.
Applications of Web Mining:
1. Web mining helps to improve the power of web search engine by classifying the
web documents and identifying the web pages.
2. It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching e.g.,
FatLens, Become etc.
3. Web mining is used to predict user behavior.
4. Web mining is very useful of a particular Website and e-service e.g., landing
page optimization

1 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

There are three types of Web mining


o Web content mining (Text, Image, Records etc.,)
o Web structure mining (Hyperlink, Tag etc.,)
o Web usage mining. (http logs, App Server logs etc.,)

2 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

5.2 Web Content Mining


Web content mining mainly focuses on the structure of inner-document, while Web
structure mining tries to discover the link structure of the hyperlinks at the inter-document level.
Based on the topology of the hyperlinks, Web structure mining will categorize the Web pages and
generate the information, such as the similarity and relationship between different Web sites
 Web content mining can be used for mining of useful data, information and knowledge
from web page content. Web content could compases a very broad range of data.
 Web structure mining helps to find useful knowledge or information pattern from the
structure of hyperlinks.
 Due to heterogeneity and absence of structure in web data, automated discovery of new
knowledge pattern can be challenging to some extent.
 Web content mining performs scanning and mining of the text, images and groups of web
pages according to the content of the input (query), by displaying the list in search engines

For example:
If an user wants to search for a particular book, then search engine provides the list of
suggestions.

3 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

5.3 Web structure mining


Web structure mining is the application of discovering structure information from the web.
The structure of the web graph consists of web pages as nodes, and hyperlinks as edges connecting
related pages. Structure mining basically shows the structured summary of a particular website. It
identifies relationship between web pages linked by information or direct link connection. To
determine the connection between two commercial websites, Web structure mining can be very
useful. The goal of Web structure mining is to generate structural summary about the Web site and
Web page
Example: Web structure mining can be very useful to companies to determine the connection
between two commercial websites.
Uses:
The model can be used to classify web pages.

• Helpful to create information such as the similarity and relationship between different
websites.

• Useful for discovering website type.

Web Structure

4 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

A. Algorithms for Web Structure Mining


i) PageRank algorithm (Google Founders)
• Google search engine ranks documents as a function of both the query terms and the
hyperlink structure of the web.
• Looks at number of links to a website and importance of referring links
• Computed before the user enters the query

A page will have high page rank if:

• There are many pages pointing to it.

• There are some pages pointing to it which have high page ranks. In other words:

• Pages well sited from around the web are worth looking at.

• Pages that only have one citation from high rating web page is worth looking at.

Damping Factor • The PageRank theory holds that even an imaginary surfer who is randomly
clicking on links will eventually stop clicking. The probability, at any step, that the person will
continue is a damping factor d.

5 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

Damping Factor d The damping factor is subtracted from 1 and this term is then added to the
product of the damping factor and the sum of the incoming PageRank scores. So any page's
PageRank is derived in large part from the PageRanks of other pages. The damping factor
adjusts the derived value downward.

Computing PageRank The PageRank of a page u is computed as follows: where,


OutDegree(v)represents the number of links going out of the page v and parameter d be a
damping factor, which can be a real number between 0 and 1. The value of d is generally taken
as 0.85.

Example: .

The PageRank of a page u is computed as follows:

where, OutDegree(v)represents the number of links going out of the page v and
parameter d be a damping factor, which can be a real number between 0 and 1. The value of d is
generally taken as 0.8

ii) HITS algorithm (Hyperlinked Induced Topic Search)

• User receives two lists of pages for query (authority and link pages)

• Computations are done after the user enters the query.

A. Social Networks
• Directed graph with weights assigned to its edges

6 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

• Nodes represent documents and the edges – citations from one document to other
documents.
• Prestige can be associated with the number of input edges to a node (in-degree).
• Prestige has a recursive nature. depends on the authority (or again, the prestige) of citations
i) Adjacency matrix
• if document cites document

• otherwise

ii) Prestige score

In transverse link , the link between pages with different domain names.
In intrinsic link , the link between pages with same domain names.
iii) Ranking pages with Index node and Reference node
 Index node : It is one whose outdegree is significantly larger than the average outdegree of
the graph.
 Reference node: It is a node whose indegree is significantly larger than the average
indegree of graph.
iv) Clustering and Determining similar pages
 Bibliographic Coupling -For pair of nodes, p and q the bibliographic coupling is equal to
the number of nodes that have links from p and q.
 Co- citation – For pair of nodes, p and q, the co-citaion is the number of nodes that have
links from both p and q
5.4 Web Usage Mining
Web usage mining is used for mining the web log records (access information of
web pages) and helps to discover the user access patterns of web pages.
 Web server registers a web log entry for every web page.
 Analysis of similarities in web log records can be useful to identify the potential
customers for e-commerce companies.

7 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

i)General Access Pattern Tracking


ii) Customized Usage Tracking

Two Approaches:

8 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

5. 5 Text Mining
Text mining is a component of data mining that deals specifically with unstructured text
data. It involves the use of natural language processing (NLP) techniques to extract useful
information and insights from large amounts of unstructured text data. Text mining can be used as
a preprocessing step for data mining or as a standalone process for specific tasks

9 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

5. 6 Hierarchy of Categories
When a user enters a query into a search engine, the system often brings back many
different pages. It is then necessary to organize the documents into meaningful groups. There are
many different ways in which we can show how a set of documents are related to one another. One
way is to group together all documents written by the same author, or all documents written in the
same year, or published by the same publisher. We can group them according to subject matter as
well.

10 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

A problem with assigning documents to single categories within a hierarchy


(as seen in, for example, Yahoo), is that most documents discuss several different topics
simultaneously. A better is to describe documents by a set of categories as well as attributes (such
as source, date, genre, and author), and provide good interfaces for manipulating these labels.
For this purpose, Feldman et al proposed an elegant data structure of concept
hierarchy. Concept hierarchy is a directed acyclic graph of concepts, where each of the concepts
hierarchy. Concept hierarchy is a directed acyclic graph of concepts, where each of the concepts is
identified by a unique name. An arc from concept A to B denotes that A is a more general concept
than B. we can tag the text with concepts. Each text document is tagged by a set of concepts that
correspond to its content.
Tagging a document with a concept implicitly entails its tagging with all the
ancestors of the concept hierarchy. It is, therefore, desired that a document should be tagged with
the lowest concepts possible. The method to automatically tag the document to the hierarchy is a
top-down approach. An evaluation function determines whether a document currently tagged to a
node can also be tagged to any of its child nodes. If so, then the tag moves down the hierarchy till it
cannot be moved any further.
The outcome of this process is a hierarchy of documents and, at each node,
there is a set of documents having a common concept associated with the node. The hierarchy of
documents resulting from the tagging process is useful for many text mining process It is assumed
that the hierarchy of concepts is known as priori. We can even have such a hierarchy of documents
without a concept hierarchy, by using any hierarchical clustering algorithm which results in such a
hierarchy.
Popescul et al posed a related problem of tagging key words to the set of
documents arranged in a hierarchy. The method is a two-phase principle. It starts with a bag of key
words at the leaf level and moves up the hierarchy. The set of key words for a non-leaf node is
obtained by combining all the key words to all its child nodes. After finding the set of key words
for the root node, the process starts with a top-down approach. If a key word at any node is also
equally probable for all of its child nodes. Otherwise, if the key wors is more probable for a child
node, it is moved down to the most probable set of child nodes.

11 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

5. 7 Text Clustering

12 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64

PART B

1 Classify the different types of Datamining


2 Explain Web content mining
3 Explain the key points Web usage mining
4 Explain the key points of Text Mining
5 Write about Hierarchy of Categories
6 Elaborate the concept of Text Clustering

PART C

1. Brief about Web mining


2. Categorize Webmining with diagrammatic representation
3. Discuss about Web structure mining

13 CS Department MTNC

You might also like