Unit7 Advance Topics Unit 8 Search Engines

Uploaded by

Nishan shah Thakuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views6 pages

Unit7 Advance Topics Unit 8 Search Engines

Uploaded by

Nishan shah Thakuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Chapter- 7 Advanced Application

A. Web Mining

Web mining is the application of data mining techniques to extract knowledge from Web data, i.e.
Web Content, Web Structure and Web Usage data.

Web Mining Taxonomy

Web Mining can be broadly divided into three distinct categories, according to the kinds of data
to be mined.
a. Web Content Mining:
- Web Content Mining is the process of extracting useful information from the contents of
Web documents.
- Content data corresponds to the collection of facts a Web page was designed to convey to
the users.
- May consist of text, images, audio, video, or structured records such as lists and tables.
- Web content has been the most widely researched. Issues addressed in text mining are,
topic discovery, extracting association patterns, clustering of web documents and
classification of Web Pages.

b. Web Structure Mining:

- The structure of a typical Web graph consists of Web pages as nodes, and hyperlinks as
edges connecting related pages.
- Web Structure Mining is the process of discovering structure information from the Web.
This can be further divided into two kinds based on the kind of structure information used.
 Hyperlinks: A Hyperlink is a structural unit that connects a location in a Web
page to different location, either within the same Web page or on a different Web
page. A hyperlink that connects to a different part of the same page is called an
Intra-Document Hyperlink, and a hyperlink that connects two different pages is
called an Inter-Document Hyperlink.
 Document Structure: In addition, the content within a Web page can also be
organized in a tree-structured format, based on the various HTML and XML tags
within the page. Mining efforts here have focused on automatically extracting
document object model structures out of documents.

c. Web Usage Mining:

- Web Usage Mining is the application of data mining techniques to discover interesting
usage patterns from Web data, in order to understand and better serve the needs of Web-
based applications.

5
- Usage data captures the identity or origin of Web users along with their browsing behavior
at a Web site.
- Web usage mining itself can be classified further depending on the kind of usage data
considered:
 Web Server Data: The user logs are collected by Web server. Typical data
includes IP address, page reference and access time.
 Application Server Data: Commercial application servers such as Web logic
Story Server have significant features to enable E-commerce applications to be
built on top of them with little effort. A key feature is the ability to track
various kinds of business events and log them in application server logs.
 Application Level Data: New kinds of events can be defined in an application,
and logging can be turned on for them - generating histories of these specially
defined events. It must be noted however that many end applications require a
combination of one or more of the techniques applied in the above the
categories.

Challenges:
i. Too huge for effective data warehousing and data mining.
ii. Too complex and heterogeneous.
iii. Growing and changing rapidly
iv. Broad diversity of user communities.
v. Only small portion of the information on the web is truly relevant or useful.

The Page Rank Algorithm

The original Page Rank algorithm was described by Lawrence Page and Sergey Brin in
several publications. It is given by

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

6
where
PR(A) is the Page Rank of page A,
PR(Ti) is the Page Rank of pages Ti which link to page A,
C(Ti) is the number of outbound links on page Ti and
d is a damping factor which can be set between 0 and 1.

- Page Rank does not rank web sites as a whole, but is determined for each page
individually. Further, the Page Rank of page A is recursively defined by the Page Ranks of
those pages which link to page A.
- The Page Rank of pages Ti which link to page A does not influence the PageRank of page
A uniformly. Within the Page Rank algorithm, the Page Rank of a page T is always
weighted by the number of outbound links C(T) on page T. This means that the more
outbound links a page T has, the less will page A benefit from a link to it on page T.
- The weighted Page Rank of pages Ti is then added up. The outcome of this is that an
additional inbound link for page A will always increase page A's Page Rank.
- Finally, the sum of the weighted Page Ranks of all pages Ti is multiplied with a damping
factor d which can be set between 0 and 1. Thereby, the extend of PageRank benefit for a
page by another page linking to it is reduced.

A Different Notation of the PageRank Algorithm

Lawrence Page and Sergey Brin have published two different versions of their Page Rank
algorithm in different papers. In the second version of the algorithm, the Page Rank of page A
is given as

PR(A) = (1-d) / N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Where N is the total number of all pages on the web. The second version of the algorithm,
indeed, does not differ fundamentally from the first one.

The Characteristics of Page Rank

The characteristics of Page Rank shall be illustrated by a small example.

We regard a small web consisting of three pages A, B and C, whereby page A links to the
pages B and C, page B links to page C and page C links to page A. According to Page and
Brin, the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to
0.5. The exact value of the damping factor d admittedly has effects on Page Rank, but it does
not influence the fundamental principles of Page Rank. So, we get the following equations for
the Page Rank calculation:
PR(A) = 0.5 + 0.5 PR(C)

7
PR(B) = 0.5 + 0.5 (PR(A) / 2)
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

These equations can easily be solved. We get the following Page Rank values for the single
pages:

PR(A) = 14/13 = 1.07692308

PR(B) = 10/13 = 0.76923077
PR(C) = 15/13 = 1.15384615

It is obvious that the sum of all pages' Page Ranks is 3 and thus equals the total number of web
pages. As shown above this is not a specific result for our simple example. For our simple three-
page example it is easy to solve the according equation system to determine Page Rank values. In
practice, the web consists of billions of documents and it is not possible to find a solution by
inspection.

The Iterative Computation of Page Rank

Because of the size of the actual web, the Google search engine uses an approximate, iterative
computation of Page Rank values. Each page is assigned an initial starting value and the Page
Ranks of all pages are then calculated in several computation circles based on the equations
determined by the Page Rank algorithm. The iterative calculation shall again be illustrated by our
three-page example, whereby each page is assigned a starting Page Rank value of 1.
Iteration PR(A) PR(B) PR(C)
0 1 1 1
1 1 0.75 1.125
2 1.0625 0.765625 1.1484375
3 1.07421875 0.76855469 1.15283203
4 1.07641602 0.76910400 1.15365601
5 1.07682800 0.76920700 1.15381050
6 1.07690525 0.76922631 1.15383947
7 1.07691973 0.76922993 1.15384490
8 1.07692245 0.76923061 1.15384592
9 1.07692296 0.76923074 1.15384611
10 1.07692305 0.76923076 1.15384615
11 1.07692307 0.76923077 1.15384615
12 1.07692308 0.76923077 1.15384615

We see that we get a good approximation of the real Page Rank values after only a few iterations.
8
B. Time Series Data Mining

- Consists of sequences of values or events obtained over repeated measurement of time at equal
time interval in most of the time.
- Used in application such as stock prediction, economic analysis etc.
- In general, there are two goals in time series analysis.
i. Modeling Time Series: Generating the time series with underlying mechanism.
ii. Forecasting Time Series: Predict the future values of the time series variables.

Major components for trend analysis in time series data

i. Trend or Long term Movements: Indicates the general direction in which a
time series is moving over long or short interval of time through trend curve or
trend line.
ii. Cyclic Movement or Cyclic Variations: Long term oscillations about a trend
curve or line which may or may not be periodic.
iii. Seasonal Movements or Variations: These are systematic or calendar related.
Eg. Sudden rise in sales of sweets in Tihar.
iv. Irregular or Random Movements: Series due to random or chance events.
Eg. Price rise in crisis of supply.

Approaches for time series data analysis:

9
- Regression analysis is commonly used for find trend in time series data.
- Seasonal Index is used for analysis to adjust the reative values of a variable during the
time series.
- Autocorrelation analysis is applied between iith element of the series and the (i-k)th
element to detect seasonal patterns. Where K is referred to as the log.
- Calculating the moving average of order n is the common method for determining trend.
Eg:
Original Data: 3 7 2 0 4 5 9 7 2
Moving average of order3: (3 + 7 + 2)/3 = 4, 3 2 3 6 7 6
Weighted (1, 4, 1) average: ((1*3 +4*7 +1*2)/(1+4 +1))= 5.5, 2.5 1 3.5 5.5 8 6.5
- Free hand method is used to draw approximate curve or line to fit a set of data based on
user’s judgment.
- Least square method is used to fit best curve.

C. Object/ Image/ Multimedia Mining:

- Multimedia database system stores and manages a large collection of multimedia data such
as audio, video, images, graphics, speech, text etc.
- Image/multimedia mining deals with extraction of implicit knowledge, data relationship or
other patterns not explicitly stored in images/multimedia
- The fundamental challenges in images mining is to determine the low-level pixel
representation contained in an image or image sequence and cane be effectively and
efficiently processed to identify high level spatial objects and relationships.
- Typical image/multimedia processing involves preprocessing, transformations and feature
extraction mining, evaluation and interpretation of the knowledge.
- Different data mining techniques can be used such as association rules, clustering.

Fmtoc
No ratings yet
Fmtoc
9 pages
Clustering of Hub and Authority Web Docu
No ratings yet
Clustering of Hub and Authority Web Docu
5 pages
Implementation and Analysis of Google's Page Rank Algorithm Using Network Dataset
No ratings yet
Implementation and Analysis of Google's Page Rank Algorithm Using Network Dataset
5 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
Issues in Sequential Web Page Ranking Algorithms
No ratings yet
Issues in Sequential Web Page Ranking Algorithms
5 pages
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
No ratings yet
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
6 pages
Probability Distribution: Additional Reading
No ratings yet
Probability Distribution: Additional Reading
41 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
Probability Distribution: Additional Reading
No ratings yet
Probability Distribution: Additional Reading
41 pages
Network Analysis and Mining: Pagerank
No ratings yet
Network Analysis and Mining: Pagerank
17 pages
Module VI Link Analysis Final
No ratings yet
Module VI Link Analysis Final
104 pages
Googel Page Rank
No ratings yet
Googel Page Rank
17 pages
PageRank Algorithm - The Mathematics of Google Search
No ratings yet
PageRank Algorithm - The Mathematics of Google Search
8 pages
Experiment 9: Web Mining
No ratings yet
Experiment 9: Web Mining
9 pages
Google PageRank Algorithm
No ratings yet
Google PageRank Algorithm
10 pages
Enhancing Link Evaluation Through A Coor
No ratings yet
Enhancing Link Evaluation Through A Coor
21 pages
Implementation of Web Page Ranking Algorithms: Presented By
No ratings yet
Implementation of Web Page Ranking Algorithms: Presented By
15 pages
Deep Crawling of Web Sites Using Frontier Technique: Samantula Hemalatha
No ratings yet
Deep Crawling of Web Sites Using Frontier Technique: Samantula Hemalatha
11 pages
Pagerank Explained Correctly With Examples - WWW - Cs.princeton - Edu - Chazelle - Courses - BIB - Pagerank
No ratings yet
Pagerank Explained Correctly With Examples - WWW - Cs.princeton - Edu - Chazelle - Courses - BIB - Pagerank
18 pages
DMDW-Unit V
No ratings yet
DMDW-Unit V
13 pages
Datamining
No ratings yet
Datamining
21 pages
Page Rank With 13 Cases
No ratings yet
Page Rank With 13 Cases
72 pages
6 WebMining
No ratings yet
6 WebMining
45 pages
Page Rank of Google Search: The Algorithm That Organizes The Web
No ratings yet
Page Rank of Google Search: The Algorithm That Organizes The Web
8 pages
Untitled Presentation-2
No ratings yet
Untitled Presentation-2
4 pages
EB Ining: Dvanced Opics
0% (1)
EB Ining: Dvanced Opics
48 pages
Lab 4-2
No ratings yet
Lab 4-2
4 pages
Chapter 06
No ratings yet
Chapter 06
24 pages
Web Mining: BY: Anitha K 17EUEE017
No ratings yet
Web Mining: BY: Anitha K 17EUEE017
19 pages
Analysis of Web Usage Mining: International Journal of Application or Innovation in Engineering & Management (IJAIEM)
No ratings yet
Analysis of Web Usage Mining: International Journal of Application or Innovation in Engineering & Management (IJAIEM)
7 pages
Lecture 7 - The Web As A Graph
No ratings yet
Lecture 7 - The Web As A Graph
29 pages
IRS Unit4
No ratings yet
IRS Unit4
10 pages
A New Approach For Web Usage Mining Using Artificial Neural Network
No ratings yet
A New Approach For Web Usage Mining Using Artificial Neural Network
5 pages
19 Web Mining 2
No ratings yet
19 Web Mining 2
41 pages
Unit 4 (DWDM)
No ratings yet
Unit 4 (DWDM)
27 pages
Web Page Rank Prediction With PCA and EM Clustering: Zaharouli06, Mvazirg @
No ratings yet
Web Page Rank Prediction With PCA and EM Clustering: Zaharouli06, Mvazirg @
12 pages
9 Link Analysis
No ratings yet
9 Link Analysis
86 pages
PageRank Report
No ratings yet
PageRank Report
3 pages
Information Networks and World Wide Web
No ratings yet
Information Networks and World Wide Web
37 pages
The $25,000,000,000 Eigenvector The Linear Algebra Behind Google
No ratings yet
The $25,000,000,000 Eigenvector The Linear Algebra Behind Google
11 pages
Link Analysis
No ratings yet
Link Analysis
47 pages
Lecture 3 - Page Rank
No ratings yet
Lecture 3 - Page Rank
7 pages
Dm-Unit Advanced Concepts
No ratings yet
Dm-Unit Advanced Concepts
57 pages
Vimala
No ratings yet
Vimala
47 pages
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
No ratings yet
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
33 pages
Unit Iv, V
No ratings yet
Unit Iv, V
35 pages
Page Rank and HITS
No ratings yet
Page Rank and HITS
39 pages
Data Mining and Semantic Web
No ratings yet
Data Mining and Semantic Web
25 pages
6 Pagerank
No ratings yet
6 Pagerank
7 pages
PageRank Algorithm Journal
No ratings yet
PageRank Algorithm Journal
8 pages
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
No ratings yet
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
46 pages
Search Engines and SEO (IT302)
No ratings yet
Search Engines and SEO (IT302)
42 pages
Google PageRank - The Math Behind The Search Engine - Rebecca S Wills
No ratings yet
Google PageRank - The Math Behind The Search Engine - Rebecca S Wills
15 pages
Unit 2
No ratings yet
Unit 2
14 pages
Google PageRank
No ratings yet
Google PageRank
22 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Pagerank Prediction
No ratings yet
Pagerank Prediction
4 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Manufacturing: Engineering, Management and Marketing
From Everand
Manufacturing: Engineering, Management and Marketing
S.O.T Ogaji
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Earned Schedule
From Everand
Earned Schedule
Walter Lipke
No ratings yet
Iot Basics
No ratings yet
Iot Basics
43 pages
Seatalk Nmea
No ratings yet
Seatalk Nmea
27 pages
Touch Panel Designer - Manual v1.0.6.0
No ratings yet
Touch Panel Designer - Manual v1.0.6.0
14 pages
ML-2 Quick Start Guide 4189340603 UK
No ratings yet
ML-2 Quick Start Guide 4189340603 UK
23 pages
52bluetooth WS9625ABSCF DATASHEET V1.1 EN
No ratings yet
52bluetooth WS9625ABSCF DATASHEET V1.1 EN
19 pages
Gopal Passmanagement
No ratings yet
Gopal Passmanagement
44 pages
Success Stories: Code Care
No ratings yet
Success Stories: Code Care
98 pages
BK3120
No ratings yet
BK3120
86 pages
Iot Solved Q.paper by @SD
No ratings yet
Iot Solved Q.paper by @SD
40 pages
JD ETL Analyst & Developer
No ratings yet
JD ETL Analyst & Developer
2 pages
Robot Arm
No ratings yet
Robot Arm
3 pages
EPassport - #AI7748518585 - Xavier McBride, Sr.
No ratings yet
EPassport - #AI7748518585 - Xavier McBride, Sr.
1 page
Exam 1: CS 447: Computer Organization and Assembly Language Programming Date: 10/18/01 Fall 2001 Jason D. Bakos
No ratings yet
Exam 1: CS 447: Computer Organization and Assembly Language Programming Date: 10/18/01 Fall 2001 Jason D. Bakos
8 pages
Abstract Data Type Is A Definition of New Type, Describes Its Data Structure Is An Implementation of ADT. Many ADT
No ratings yet
Abstract Data Type Is A Definition of New Type, Describes Its Data Structure Is An Implementation of ADT. Many ADT
12 pages
(Share) Simple Rsi Bull - Bear Strategy
No ratings yet
(Share) Simple Rsi Bull - Bear Strategy
7 pages
Physical Design
100% (1)
Physical Design
12 pages
E Mesh Monitor Brochure
No ratings yet
E Mesh Monitor Brochure
4 pages
Pamela D Babinski
No ratings yet
Pamela D Babinski
3 pages
COT Convert Old TableV1.4
No ratings yet
COT Convert Old TableV1.4
8 pages
SAP Table Authorizations
No ratings yet
SAP Table Authorizations
12 pages
Omega OM-CP-RTDTEMP2000
No ratings yet
Omega OM-CP-RTDTEMP2000
5 pages
VPC (Virtual Private Cloud)
No ratings yet
VPC (Virtual Private Cloud)
59 pages
2021 10 08 - Log
No ratings yet
2021 10 08 - Log
190 pages
RSA
No ratings yet
RSA
25 pages
How To Create A New Academic Year
No ratings yet
How To Create A New Academic Year
9 pages
Aspire T120: User's Manual
No ratings yet
Aspire T120: User's Manual
20 pages
Network Design: Draft v3.1
No ratings yet
Network Design: Draft v3.1
55 pages
Department of Computer Science and Engineering: A) Year-1 Semester-1
No ratings yet
Department of Computer Science and Engineering: A) Year-1 Semester-1
6 pages
CND12
No ratings yet
CND12
11 pages