0% found this document useful (0 votes)
11 views6 pages

Unit7 Advance Topics Unit 8 Search Engines

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views6 pages

Unit7 Advance Topics Unit 8 Search Engines

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Chapter- 7 Advanced Application

A. Web Mining

Web mining is the application of data mining techniques to extract knowledge from Web data, i.e.
Web Content, Web Structure and Web Usage data.

Web Mining Taxonomy


Web Mining can be broadly divided into three distinct categories, according to the kinds of data
to be mined.
a. Web Content Mining:
- Web Content Mining is the process of extracting useful information from the contents of
Web documents.
- Content data corresponds to the collection of facts a Web page was designed to convey to
the users.
- May consist of text, images, audio, video, or structured records such as lists and tables.
- Web content has been the most widely researched. Issues addressed in text mining are,
topic discovery, extracting association patterns, clustering of web documents and
classification of Web Pages.

b. Web Structure Mining:


- The structure of a typical Web graph consists of Web pages as nodes, and hyperlinks as
edges connecting related pages.
- Web Structure Mining is the process of discovering structure information from the Web.
This can be further divided into two kinds based on the kind of structure information used.
 Hyperlinks: A Hyperlink is a structural unit that connects a location in a Web
page to different location, either within the same Web page or on a different Web
page. A hyperlink that connects to a different part of the same page is called an
Intra-Document Hyperlink, and a hyperlink that connects two different pages is
called an Inter-Document Hyperlink.
 Document Structure: In addition, the content within a Web page can also be
organized in a tree-structured format, based on the various HTML and XML tags
within the page. Mining efforts here have focused on automatically extracting
document object model structures out of documents.

c. Web Usage Mining:


- Web Usage Mining is the application of data mining techniques to discover interesting
usage patterns from Web data, in order to understand and better serve the needs of Web-
based applications.

5
- Usage data captures the identity or origin of Web users along with their browsing behavior
at a Web site.
- Web usage mining itself can be classified further depending on the kind of usage data
considered:
 Web Server Data: The user logs are collected by Web server. Typical data
includes IP address, page reference and access time.
 Application Server Data: Commercial application servers such as Web logic
Story Server have significant features to enable E-commerce applications to be
built on top of them with little effort. A key feature is the ability to track
various kinds of business events and log them in application server logs.
 Application Level Data: New kinds of events can be defined in an application,
and logging can be turned on for them - generating histories of these specially
defined events. It must be noted however that many end applications require a
combination of one or more of the techniques applied in the above the
categories.

Challenges:
i. Too huge for effective data warehousing and data mining.
ii. Too complex and heterogeneous.
iii. Growing and changing rapidly
iv. Broad diversity of user communities.
v. Only small portion of the information on the web is truly relevant or useful.

The Page Rank Algorithm

The original Page Rank algorithm was described by Lawrence Page and Sergey Brin in
several publications. It is given by

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))


6
where
PR(A) is the Page Rank of page A,
PR(Ti) is the Page Rank of pages Ti which link to page A,
C(Ti) is the number of outbound links on page Ti and
d is a damping factor which can be set between 0 and 1.

- Page Rank does not rank web sites as a whole, but is determined for each page
individually. Further, the Page Rank of page A is recursively defined by the Page Ranks of
those pages which link to page A.
- The Page Rank of pages Ti which link to page A does not influence the PageRank of page
A uniformly. Within the Page Rank algorithm, the Page Rank of a page T is always
weighted by the number of outbound links C(T) on page T. This means that the more
outbound links a page T has, the less will page A benefit from a link to it on page T.
- The weighted Page Rank of pages Ti is then added up. The outcome of this is that an
additional inbound link for page A will always increase page A's Page Rank.
- Finally, the sum of the weighted Page Ranks of all pages Ti is multiplied with a damping
factor d which can be set between 0 and 1. Thereby, the extend of PageRank benefit for a
page by another page linking to it is reduced.

A Different Notation of the PageRank Algorithm

Lawrence Page and Sergey Brin have published two different versions of their Page Rank
algorithm in different papers. In the second version of the algorithm, the Page Rank of page A
is given as

PR(A) = (1-d) / N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Where N is the total number of all pages on the web. The second version of the algorithm,
indeed, does not differ fundamentally from the first one.

The Characteristics of Page Rank

The characteristics of Page Rank shall be illustrated by a small example.

We regard a small web consisting of three pages A, B and C, whereby page A links to the
pages B and C, page B links to page C and page C links to page A. According to Page and
Brin, the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to
0.5. The exact value of the damping factor d admittedly has effects on Page Rank, but it does
not influence the fundamental principles of Page Rank. So, we get the following equations for
the Page Rank calculation:
PR(A) = 0.5 + 0.5 PR(C)

7
PR(B) = 0.5 + 0.5 (PR(A) / 2)
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

These equations can easily be solved. We get the following Page Rank values for the single
pages:

PR(A) = 14/13 = 1.07692308


PR(B) = 10/13 = 0.76923077
PR(C) = 15/13 = 1.15384615

It is obvious that the sum of all pages' Page Ranks is 3 and thus equals the total number of web
pages. As shown above this is not a specific result for our simple example. For our simple three-
page example it is easy to solve the according equation system to determine Page Rank values. In
practice, the web consists of billions of documents and it is not possible to find a solution by
inspection.

The Iterative Computation of Page Rank

Because of the size of the actual web, the Google search engine uses an approximate, iterative
computation of Page Rank values. Each page is assigned an initial starting value and the Page
Ranks of all pages are then calculated in several computation circles based on the equations
determined by the Page Rank algorithm. The iterative calculation shall again be illustrated by our
three-page example, whereby each page is assigned a starting Page Rank value of 1.
Iteration PR(A) PR(B) PR(C)
0 1 1 1
1 1 0.75 1.125
2 1.0625 0.765625 1.1484375
3 1.07421875 0.76855469 1.15283203
4 1.07641602 0.76910400 1.15365601
5 1.07682800 0.76920700 1.15381050
6 1.07690525 0.76922631 1.15383947
7 1.07691973 0.76922993 1.15384490
8 1.07692245 0.76923061 1.15384592
9 1.07692296 0.76923074 1.15384611
10 1.07692305 0.76923076 1.15384615
11 1.07692307 0.76923077 1.15384615
12 1.07692308 0.76923077 1.15384615

We see that we get a good approximation of the real Page Rank values after only a few iterations.
8
B. Time Series Data Mining

- Consists of sequences of values or events obtained over repeated measurement of time at equal
time interval in most of the time.
- Used in application such as stock prediction, economic analysis etc.
- In general, there are two goals in time series analysis.
i. Modeling Time Series: Generating the time series with underlying mechanism.
ii. Forecasting Time Series: Predict the future values of the time series variables.

Major components for trend analysis in time series data


i. Trend or Long term Movements: Indicates the general direction in which a
time series is moving over long or short interval of time through trend curve or
trend line.
ii. Cyclic Movement or Cyclic Variations: Long term oscillations about a trend
curve or line which may or may not be periodic.
iii. Seasonal Movements or Variations: These are systematic or calendar related.
Eg. Sudden rise in sales of sweets in Tihar.
iv. Irregular or Random Movements: Series due to random or chance events.
Eg. Price rise in crisis of supply.

Approaches for time series data analysis:

9
- Regression analysis is commonly used for find trend in time series data.
- Seasonal Index is used for analysis to adjust the reative values of a variable during the
time series.
- Autocorrelation analysis is applied between iith element of the series and the (i-k)th
element to detect seasonal patterns. Where K is referred to as the log.
- Calculating the moving average of order n is the common method for determining trend.
Eg:
Original Data: 3 7 2 0 4 5 9 7 2
Moving average of order3: (3 + 7 + 2)/3 = 4, 3 2 3 6 7 6
Weighted (1, 4, 1) average: ((1*3 +4*7 +1*2)/(1+4 +1))= 5.5, 2.5 1 3.5 5.5 8 6.5
- Free hand method is used to draw approximate curve or line to fit a set of data based on
user’s judgment.
- Least square method is used to fit best curve.

C. Object/ Image/ Multimedia Mining:

- Multimedia database system stores and manages a large collection of multimedia data such
as audio, video, images, graphics, speech, text etc.
- Image/multimedia mining deals with extraction of implicit knowledge, data relationship or
other patterns not explicitly stored in images/multimedia
- The fundamental challenges in images mining is to determine the low-level pixel
representation contained in an image or image sequence and cane be effectively and
efficiently processed to identify high level spatial objects and relationships.
- Typical image/multimedia processing involves preprocessing, transformations and feature
extraction mining, evaluation and interpretation of the knowledge.
- Different data mining techniques can be used such as association rules, clustering.

10

You might also like