Unit7 Advance Topics Unit 8 Search Engines
Unit7 Advance Topics Unit 8 Search Engines
A. Web Mining
Web mining is the application of data mining techniques to extract knowledge from Web data, i.e.
Web Content, Web Structure and Web Usage data.
5
- Usage data captures the identity or origin of Web users along with their browsing behavior
at a Web site.
- Web usage mining itself can be classified further depending on the kind of usage data
considered:
Web Server Data: The user logs are collected by Web server. Typical data
includes IP address, page reference and access time.
Application Server Data: Commercial application servers such as Web logic
Story Server have significant features to enable E-commerce applications to be
built on top of them with little effort. A key feature is the ability to track
various kinds of business events and log them in application server logs.
Application Level Data: New kinds of events can be defined in an application,
and logging can be turned on for them - generating histories of these specially
defined events. It must be noted however that many end applications require a
combination of one or more of the techniques applied in the above the
categories.
Challenges:
i. Too huge for effective data warehousing and data mining.
ii. Too complex and heterogeneous.
iii. Growing and changing rapidly
iv. Broad diversity of user communities.
v. Only small portion of the information on the web is truly relevant or useful.
The original Page Rank algorithm was described by Lawrence Page and Sergey Brin in
several publications. It is given by
- Page Rank does not rank web sites as a whole, but is determined for each page
individually. Further, the Page Rank of page A is recursively defined by the Page Ranks of
those pages which link to page A.
- The Page Rank of pages Ti which link to page A does not influence the PageRank of page
A uniformly. Within the Page Rank algorithm, the Page Rank of a page T is always
weighted by the number of outbound links C(T) on page T. This means that the more
outbound links a page T has, the less will page A benefit from a link to it on page T.
- The weighted Page Rank of pages Ti is then added up. The outcome of this is that an
additional inbound link for page A will always increase page A's Page Rank.
- Finally, the sum of the weighted Page Ranks of all pages Ti is multiplied with a damping
factor d which can be set between 0 and 1. Thereby, the extend of PageRank benefit for a
page by another page linking to it is reduced.
Lawrence Page and Sergey Brin have published two different versions of their Page Rank
algorithm in different papers. In the second version of the algorithm, the Page Rank of page A
is given as
Where N is the total number of all pages on the web. The second version of the algorithm,
indeed, does not differ fundamentally from the first one.
We regard a small web consisting of three pages A, B and C, whereby page A links to the
pages B and C, page B links to page C and page C links to page A. According to Page and
Brin, the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to
0.5. The exact value of the damping factor d admittedly has effects on Page Rank, but it does
not influence the fundamental principles of Page Rank. So, we get the following equations for
the Page Rank calculation:
PR(A) = 0.5 + 0.5 PR(C)
7
PR(B) = 0.5 + 0.5 (PR(A) / 2)
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
These equations can easily be solved. We get the following Page Rank values for the single
pages:
It is obvious that the sum of all pages' Page Ranks is 3 and thus equals the total number of web
pages. As shown above this is not a specific result for our simple example. For our simple three-
page example it is easy to solve the according equation system to determine Page Rank values. In
practice, the web consists of billions of documents and it is not possible to find a solution by
inspection.
Because of the size of the actual web, the Google search engine uses an approximate, iterative
computation of Page Rank values. Each page is assigned an initial starting value and the Page
Ranks of all pages are then calculated in several computation circles based on the equations
determined by the Page Rank algorithm. The iterative calculation shall again be illustrated by our
three-page example, whereby each page is assigned a starting Page Rank value of 1.
Iteration PR(A) PR(B) PR(C)
0 1 1 1
1 1 0.75 1.125
2 1.0625 0.765625 1.1484375
3 1.07421875 0.76855469 1.15283203
4 1.07641602 0.76910400 1.15365601
5 1.07682800 0.76920700 1.15381050
6 1.07690525 0.76922631 1.15383947
7 1.07691973 0.76922993 1.15384490
8 1.07692245 0.76923061 1.15384592
9 1.07692296 0.76923074 1.15384611
10 1.07692305 0.76923076 1.15384615
11 1.07692307 0.76923077 1.15384615
12 1.07692308 0.76923077 1.15384615
We see that we get a good approximation of the real Page Rank values after only a few iterations.
8
B. Time Series Data Mining
- Consists of sequences of values or events obtained over repeated measurement of time at equal
time interval in most of the time.
- Used in application such as stock prediction, economic analysis etc.
- In general, there are two goals in time series analysis.
i. Modeling Time Series: Generating the time series with underlying mechanism.
ii. Forecasting Time Series: Predict the future values of the time series variables.
9
- Regression analysis is commonly used for find trend in time series data.
- Seasonal Index is used for analysis to adjust the reative values of a variable during the
time series.
- Autocorrelation analysis is applied between iith element of the series and the (i-k)th
element to detect seasonal patterns. Where K is referred to as the log.
- Calculating the moving average of order n is the common method for determining trend.
Eg:
Original Data: 3 7 2 0 4 5 9 7 2
Moving average of order3: (3 + 7 + 2)/3 = 4, 3 2 3 6 7 6
Weighted (1, 4, 1) average: ((1*3 +4*7 +1*2)/(1+4 +1))= 5.5, 2.5 1 3.5 5.5 8 6.5
- Free hand method is used to draw approximate curve or line to fit a set of data based on
user’s judgment.
- Least square method is used to fit best curve.
- Multimedia database system stores and manages a large collection of multimedia data such
as audio, video, images, graphics, speech, text etc.
- Image/multimedia mining deals with extraction of implicit knowledge, data relationship or
other patterns not explicitly stored in images/multimedia
- The fundamental challenges in images mining is to determine the low-level pixel
representation contained in an image or image sequence and cane be effectively and
efficiently processed to identify high level spatial objects and relationships.
- Typical image/multimedia processing involves preprocessing, transformations and feature
extraction mining, evaluation and interpretation of the knowledge.
- Different data mining techniques can be used such as association rules, clustering.
10