DMDW-Unit V
DMDW-Unit V
1 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
2 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
For example:
If an user wants to search for a particular book, then search engine provides the list of
suggestions.
3 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
• Helpful to create information such as the similarity and relationship between different
websites.
Web Structure
4 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
• There are some pages pointing to it which have high page ranks. In other words:
• Pages well sited from around the web are worth looking at.
• Pages that only have one citation from high rating web page is worth looking at.
Damping Factor • The PageRank theory holds that even an imaginary surfer who is randomly
clicking on links will eventually stop clicking. The probability, at any step, that the person will
continue is a damping factor d.
5 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
Damping Factor d The damping factor is subtracted from 1 and this term is then added to the
product of the damping factor and the sum of the incoming PageRank scores. So any page's
PageRank is derived in large part from the PageRanks of other pages. The damping factor
adjusts the derived value downward.
Example: .
where, OutDegree(v)represents the number of links going out of the page v and
parameter d be a damping factor, which can be a real number between 0 and 1. The value of d is
generally taken as 0.8
• User receives two lists of pages for query (authority and link pages)
A. Social Networks
• Directed graph with weights assigned to its edges
6 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
• Nodes represent documents and the edges – citations from one document to other
documents.
• Prestige can be associated with the number of input edges to a node (in-degree).
• Prestige has a recursive nature. depends on the authority (or again, the prestige) of citations
i) Adjacency matrix
• if document cites document
• otherwise
In transverse link , the link between pages with different domain names.
In intrinsic link , the link between pages with same domain names.
iii) Ranking pages with Index node and Reference node
Index node : It is one whose outdegree is significantly larger than the average outdegree of
the graph.
Reference node: It is a node whose indegree is significantly larger than the average
indegree of graph.
iv) Clustering and Determining similar pages
Bibliographic Coupling -For pair of nodes, p and q the bibliographic coupling is equal to
the number of nodes that have links from p and q.
Co- citation – For pair of nodes, p and q, the co-citaion is the number of nodes that have
links from both p and q
5.4 Web Usage Mining
Web usage mining is used for mining the web log records (access information of
web pages) and helps to discover the user access patterns of web pages.
Web server registers a web log entry for every web page.
Analysis of similarities in web log records can be useful to identify the potential
customers for e-commerce companies.
7 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
Two Approaches:
8 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
5. 5 Text Mining
Text mining is a component of data mining that deals specifically with unstructured text
data. It involves the use of natural language processing (NLP) techniques to extract useful
information and insights from large amounts of unstructured text data. Text mining can be used as
a preprocessing step for data mining or as a standalone process for specific tasks
9 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
5. 6 Hierarchy of Categories
When a user enters a query into a search engine, the system often brings back many
different pages. It is then necessary to organize the documents into meaningful groups. There are
many different ways in which we can show how a set of documents are related to one another. One
way is to group together all documents written by the same author, or all documents written in the
same year, or published by the same publisher. We can group them according to subject matter as
well.
10 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
11 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
5. 7 Text Clustering
12 CS Department MTNC
DATA MINING AND WAREHOUSING- 18UITE64
PART B
PART C
13 CS Department MTNC