SEMINAR
ON
WEB DATA MANAGEMENT
BY
BASSEY ELIZABETH U.
HD 2019/107051/1/CS
DEPARTMENT OF COMPUTER SCIENCE
SCHOOL OF SCIENCE AND INDUSTRIAL TECHNOLOGY
(S.S.I.T.)
ABIA STATE POLYTECHNIC, ABA.
IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF
HIGHER NATIONAL DIPLOMA (HND) IN COMPUTER SCIENCE
SUPERVISED BY: ENGR MRS C.M. NDUKWE
MARCH,2022
ABSTRACT
The enormous growth in the number of documents circulated, over the
web necessitates the need for improved web data management system
in order to evaluate the performance of these systems various
simulation approaches must be used more specifically. We survey the
most resent simulation approaches for web data representation and
storage as well as web trace evaluation.
1.0 INTRODUCTION
The World Wide Web is growing so fast that the need of effective diet
data management system has become mandatory. This rapid growth is
expected to persist as the number of web users continues to increase
and as a new application such as electronic commerce became widely
used. Currently the web circulated more than seven billions document
and this enormous size has transformed communication and business
models, so that the speed and accuracy. This emergency of the web has
changed our daily practice by proving information exchange and
business transactions. Therefore supportive approaches in data
information and knowledge exchange becomes the key. Many research
efforts have used various simulation approaches.
1.1 WEB DATA PRESENTATION
Due to exposure growth of the web, it is essential to represent it
appropriate. One solution could be to simulate the web as a directed
graph. Graph used for web representation provide adequate structure,
considering both the pages and their links as element of a graph.
I. WEB DATA ACCESSING
This simulation efforts include a collection of physical or analytical
techniques used to reveal new friends and patterns in web data
accessing records.
II. WEB DATA STORAGE
Since web storage has a major effect on the performance of web
application new implementations such as web data caching and web
data base have emerged. These implementations can be considered as
one of the most beneficial approaches to accommodate numbert of
documents and service providing also a remarkable improvement on
the qualities of services (QOS) which has been introduced to describe
certain technical characteristics such as performance, stability, relaibilty
and speed.
2.0 LITERATURE REVIEW
According to Abiteboul & Nanolescu (2012), web is an extension of the
World Wide Web. It represents data in such a way that it can be
accessible in different forms for variety of different applications. This
semantic web makes data globally available and it can be thought of as
globally data web.
2.1 WEB DOCUMENT STRUCTURE
The amount of publicity available information on a the web as rapidly
increasing (together with the number of users that reqest this
information) various types of data (such as text, images, videos and
sound or animation) participate in web documents. This information
can be designed by using a mark up language such as (HTML or XML2)
retrieved through protocols (such as HTTP or HTTPs) and presented
using a browser such as internet explorer or Netscape communicator
Statistic the content of a statistic document is created manually and
does not depend on users request. It is not recommended in
applications which requires frequent context changes. The hand coded
HTML web page processed by simple plan text editors (as well as the
HTML) documents, creates by more sophisticated authority tools) are
examples o static web document and they define the first web
generation.
Dynamic: Dynamic contents include web page built as a resent of a
specific users request (i.e they could be different user access).
However, once a dynamically created page sent to the client, it does
not change. This approach enables authors to develop web application
that access data bases, using programming languages (CGI,PHP,ASP
etc). in order to present the requested document. In this way we can
serve documents with same structure or up to date content. However,
the dynamic content increases the server laod as well as the response
time.
Active: Active document can change their content and display in
response to the user request (without referring back to the server).
More specifically active pages include code that is executed at the
client side and usually implemented by using code such as Java script.
This active content does not require severs resources, but it runs quite
slowly since the browser has to interpret every line of its code. Both
dynamic, generation cohere the content is machine generation (18).
The common feature between these two web generation is that they
both design and present information with a human oriented manner.
This refers to the fact that web pages are handled directly by humans
who either read the static content or produce and active content.
2.2 THE WEB AS A GRAPH
Rigaux Rouset (2010) said that a web structure includes pages which
have both the web content and the hyper text link (that connect one
page to another). An effective method of study, the web is to consider
it as a directed graph, which simulates both its content and content
inter connection in particular, in the web graph each node corresponds
to web pages and correspond to link between the pages. The actual
web graph in huge and appears to grow exponentially over time more
specifically in June 2010, it was estimated that it consist of about 2.1
billion nodes and 1.5 billion edges since the average node has roughly
hyper text links. Furthermore, approximately 7.3 million pages are
added everyday and many other are modified or removed, so that web
graph might currently contain more than seven billions node and about
fifty billion edges in all studying the web graph, two important element
should be considered. A giant size and it rapid revolution as it is
impossible to work.
LOCAL APPROACHES
In this case, we can detect structure with an unusually high density of
links among a small set of pages which is an indication that they may be
topically related local structure are of great necessity for “cyber-
community” detection and thus for improving search enquiries
techniques, a characteristic pattern in such communities contains a
collection of authorities on a topic of common interest.
GLOBAL APPROACHES
At global level, a recent study defines a bowtie structure of the web.
Particularly, an experiment on a 200 million nodes graph with 1.5 billion
links retrieved from a crawl of the web.
TRACE-BASES APPROACH
The most popular way to characterize the work load of web data is
analyzing the past web servers log files. In a detailed work load
characterization study which uses past logs is presented for World Wide
Web servers most of these tools are downloaded free from the web. It
is common to analyze the web server logs for reporting traffic patterns.
In addition, many tools have been developed for characterizing web
data workload.
ANALYTICAL APPROACH
Another idea is for the web data workload characterization to use
traces that do not currently exist.
This kind of workload is called synthetic workload and it is defined by
using mathematical models, which are usually based on statistical
methods for the work load characteristics. The main advantage of the
analytical approach is that it offers great flexibility. These are several
workload generation tools developed to study web proxies.
Finally, another approach for synthesizing web workload is to process
the current request using a live set of requests to produce experiment
that cannot be reproducible. The advantage of usury current request as
the high real load, so the hardware and software many have difficulties
handling this load.
2.3 CAPTURING WEB USER PATTERNS
The incredible growth in the size and use of the web has created
difficulties in both the design of web site (to meet a great variety of
user requirements) and the browsing (through vast web structures of
pages and links). Most web sites are set up with little knowledge on the
navigational pattern can be proved to be valuable both to the web sites
designers and to the web site visitor for example, constructing dynamic
interfaces based on visitors behavior, presences or profile has already
been very attractive to several application such as e-commerce,
adversary, e-business etc. when web users interact with a site data
recording their behavior is stored in sites called web servers (incase of a
medium size site). A relatively recent research discipline called web
usage mining applies data mining techniques to the data in order to
captive interesting usage pottery so far, there have so far been two
main approaches to mining for users navigation pattern from log
records.
i. DIRECT METHOD
In this case techniques have been developed which can be invoked
directly on the raw web servers log data. The most common approach
to extract information about usage of web sites is statistical analysis,
several open source packages that most frequently entry and exist
point of navigation time average views of a page (or the loudly
distribution of access). This type of knowledge could be taken into
consideration during system improvement or site modification tasks for
example decisions about calling policies could be based on detecting
traffic behavior while their sessions is important for site designers to
improve their content.
ii. INDIRECT METHOD
In this case the collected raw web data are transformed into data
abstractions (during a pre-processing phase) appropriate for the
pattern discovery produce. According to the type of data that can be
used for capturing interesting users patterns are classified into the
content structure usage and user profile data such data can be
collected from different sources (e.g sever log files). Client levels can be
collected from different sources keep information about multiple users
who access a single site. However the collected data might not be
reliable source the called pages requests are not logged in the file.
Another problem is the indentifying of individual user since most cases
the web access is not authorized on the other hand client level
collected data reflects the access of multiple website by a single user
and overcomes difficulties related to page caching users and session
identification in web usage mining process, association rules discover
set of pages accessed together (without these pages being necessarily
connected directly through hyperlinks) for example, at a cinemas chain
website it could be found that users who visited pages about comedies
also accessed pages about thriller files.
SIMULATION OF WEB DATA CACHING
Web data caching techniques are used to store the data in order to
retrieve them with low communication cost.
3.0 CACHING METER DEFINATIONS
Hit rate is defined as the ratio of document obtained through using the
caching mechanism versus the total document requested. A high rate
reflects an effective cache policy. Byte Hit rate is as defined as the ratio
of the number of bytes accessed saved bandwidth. This metric tries to
qualify the decrease in the number of bytes retrieved from the origin
severs. It is directly related with byte hit rate users response time, the
time a user waits for the system to retrieve a requested document.
SYSTEM UTILLIZATION: It is defined as the fraction of time that the
system is busy. Latency is defined as the interval between the times of
which it appears in the user in the user browser. However,, at which it
appears in the object being cached exceeds the available space, the
proxy will be needed to replace an object from the cache.
3.1 WEB DATA CACHING
The explosive growth of the World Wide Web data in recent years has
resulted in major network traffic and congestion. As a result, the web
has become a victim of its own success. These demands for increased
performance have driven the innovation of new approaches such as the
web caching it is recognized that destroying web caching can make the
World Wide Web less expensive and better performing in particular it
can reduce the band width consumption (server requests and
responses that need to go over the network). Web caching has many
similarities with a memory system caching. A cache stores frequently
users information in a suitable location so that it can be accused quickly
and easily for future uses.
3.2 SIMULATION CACHING APPROACHES
It is useful to evaluate the performance of proxy caches both of web
data managers (selecting the essential system for a particular situation)
and also for developers (working on alternative caching mechanism)
simulating the web data will help in the effective data management on
the web. In this context new simulation approaches are needed for
describing the web in an encouraging development for simulating the
web is presented in this paper, the (PDF) techniques for constructing
appropriate models for the World Wide Web. More especially, they use
the scalable simulation, frame work (SSF). When is being developed by
co-operating system cooperation. SSI provides an interface for
constructing process oriented event oriented and hybrid simulation,
SSF provides also some mechanism for constructing PDFs. That can
scale to million of web objects. Therefore this frame works in
conjunction with scaled parallel simulations makes it possible to
analyze the behavior of the complicated web models.
3.3 SIMULATIONS USING CAPTURE LOGS
This kind of simulation is the most popular and is directly related with
web performance many research effort have used trace devices
simulation to evaluate the effort of various replacement, threshold and
partitioning policies on the performance of a web server. The work load
traces for the simulations came from web services access logs. They
include access information, configuration errors and resource
consumption in this approach the logs are the basic components and
they should be recorded and processed carefully in general a trace
collection, trace reduction and trace processing collection is the
process of determining the sequence of web data that made by some
workload. Because the trace can be very large. Trace reduction
techniques are often used to simulate the behavior of a system,
producing some useful metrics such as hit rate, byte hit rate etc.
3.4 SIMULATION USING SYNTHETIC WORKLOADS
According the Semellort & Williamson (2011). Synthetic traces are
usually used to generate workload that do not currently exist. Authors
in propose a new cache algorithm (RBC) which uses synthetic traces for
the simulation of caching media traffic. The selected work load has a
predefined distribution. In particular the objects are ranging other from
3 to 6 HB (for object of audio/video).
3.5 SIMULATION USING CURRENT REQUESTS.
This kind of simulation utilizes current request of a live network, the
advantage is that the cache is tested on a producible (especially when
corrected with live network system). Finally, many research effort have
used a combination of the approaches which are often called as
hybrids. According to these approaches research efforts are turning to
evaluate the web data management systems using with captured logs
and synthetic workloads.
4.0 CONCLUSION
This paper presents a study of simulation in the web data management
process. The extremely large volume of the web documents has
increase the need for advanced management software
implementations that offer an improvement on the quality of web
services. Selection of an appropriate evolution methodology for web
data management has been developed during the last years. Firstly,
these approaches are focused on simulating the structure of web. Web
graphs are the most common implementation for web data
representation. Secondly, its essential to simulate the web data
workloads. This can be implemented using mining techniques. These
techniques study carefully this structure of web data and find new
trend and patterns that fit well with a statistical model. Finally, various
system have been developed for simulating web caching approaches .
these approaches are used for an effective storage. All the previous
simulation approaches in conjunction with the emergence of search
engines, try to improve both the management of web data (on the
server side) and the overall web performance (on the user side).
REFRENCE
M. Abrams et al (2013). Caching proxis: Limitation and potentials. Proc
of the 4th international www conference. Pp. 119-153.
M. Arlith, C. Williamson (2010). Internet Web Service: workload.
Characterization and performance Implications IEEE/APM transaction
on networking vol.5, No.5 pp 631-645
P. Rigaux and MC Ronsset (2010). World Wide Web caching : Trend and
techniques IEEE communication magazine, vol. 38 No.5 pp. 178- 185
P. Senellart, C. Williamson (2011). A synthetic workload generation tool
for simulation evaluation of web proxy cache: Computer networks, vol.
38 No.6, pp779-794.
S. Abiteboul, I. Maloneseu (2012). 9 th conference (www)/computer
network vol 33, No 1-6, pp. 309-280.