Ieee Conference Christy
Ieee Conference Christy
Abstract: Today web mining is a challenging task in request, the visitor data, line of hit, the request method,
organization. Every organization generated vast amount location and name of the requested file, the HTTP status code,
of data from various source. Web mining is the process of the size of the requested file.
extracting useful knowledge from web resources. Log files
Log files can be classified into categories depending on the
are maintained by the web server. The challenging task
location of their storage that is web server logs and application
for E-commerce companies is to know their customer
server logs. A web server [3] maintains two types of log files:
behavior to improve the business by analyzing web log
Access log and Error log. The access log records all requests
files. E-commerce website can generate tens of peta bytes
that were made of this server. The error log records all request
of data in their web log files. This paper discuss about the
that failed and the reason for the failure as recorded by the
importance of log files in E-commerce world. The
application. A log files have lot of parameters which are very
analysis of log files is used for learning the user behavior in
useful to recognizing user browsing patterns [4, 5 6]. Mining
E-commerce system. The analysis of such large web log
the web log file will helpful to server and E-commerce for
files need parallel processing and reliable data storage
predicting the behavior of their online customer. Every day
system. The Hadoop framework provides reliable storage
increasing online customers as well as increasing the size of
by Hadoop Distributed File System and parallel processing
web access log [7].. In large websites handling millions of
system for large database using MapReduce programming
simultaneous visitors can generate hundred of peta bytes of
model. These mechanisms help to process log data in
logs per day. The existing data mining techniques store web
parallel manner and computes results efficiently. This
log files in traditional DBMS and analyze. RDBMS system
approach reduces the response time as well as load on the
cannot store and manage the peta bytes of heterogeneous
end system. This work proposes a predictive prefetching
dataset. So, to analyze such big web log file efficiently and
system based on preprocessing of web logs using Hadoop
effectively we need to develop faster, efficient and effective
MapReduce, which will provide accurate results in
parallel and scalable data mining algorithm. Also need a
minimum response time for E-commerce business
cluster of storage devices to store peta bytes of web log data
activities.
and parallel computing model for analyzing huge amount of
Keywords: E-commerce, Preprocessing, Hadoop, Map data. Hadoop framework provides reliable clusters of storage
Reduce, Web log, prediction process facility to keep our large web log file data in a distributed
I. INTRODUCTION manner and parallel processing features to process a large web
log file data efficiently and effectively[8,9]. The preprocessed
Web mining is the application of data mining techniques to web logs by Hadoop MapReduce environment is further
extract useful knowledge from web data that includes web processed for prediction of user next request without
document, hyperlink between documents, usage logs of web disturbing them to increase the interest and to reduce the
sites etc. Web usage mining is the process of applying data response time with ecommerce system.
mining techniques to discover usage pattern from the web II. RELATED WORK
data. It is one of the techniques to find personalization of web
pages. The collection of web usage data gathered from In [10] proposed a new approach for preprocessing of web log
different levels such as server level, client level and proxy data and the association rules are being employed to extract
level and also from different resources through the web the useful patterns. Log files are the best source to predict the
browser and web server interaction using the HTTP user behavior, to analyze usage pattern through these two
protocol [1]. But in the current scenario the number of online phases such as pattern discovery and analysis phase. In [11]
customer’s increases day by day and each click from a web proposed data mining techniques like the first phase of
page creates on the order hundred bytes data in typical website preprocessing and to discover user access patterns from web
log file. When a web user submits request to web server at the log. They discussed field extraction and data cleaning
same time user activities are recorded in server side. These algorithms and proved web log mining can be used for various
types of web access logs are called log file. Request applications such as web personalization, site
information sent by the user via protocol to the web server recommendation, site improvement etc. In [12] analyzed
which is recorded in log file. The log files [2] are contains some important aspects like data exploration, activity and
some entries like ip address of which computer making the preferences of users. In [13] discussed to discover the frequent
usage by the client and their experimental study finds some This paper discussed a method that uses optimal time to search
interesting patterns through association rule mining algorithm the server. The proposed model of this section is as follows:
and FP growth algorithm. They proved association rule mining In section 3.1 importance of preprocessing process is
have some limitation and suitable for least amount of data set discussed and in section 3.2 an algorithm is proposed for
but FP growth have minimum limitation and suitable for large optimal search process.
data set with out any user interaction. In [14] discussed the
Consider the sample logs from recipe log file in table. Each
importance of data preprocessing methods and user session
entry contains object id, name, ingredients, url for this page,
identification methods for any transaction files. In [15]
image, time spent, date, source, recipe yield, date published,
proposed some filtering methods based on statistical attributes
cook time, prep time, description, status code and ip address
to discover rules or patterns. They proved tools do not indicate
for each and every entry.
which web usage mining algorithms are used but provide
effective graphical visualizations of the results. In paper [16]
applied k-means clustering algorithm first then applied
association rule mining technique on clustered data for pattern
discovery. They find out drawbacks, generation of irrelevant
rules. In [17] proposed method to find same user session.
They were grouped the similar data based on two similarities
such as user similarity and session similarity and may useful
for group the same web users. In [18] web log analyzer tool is
used for analyzing usage pattern from web logs. They proved
those tools are helpful to web administrator in order to
improve the web site performance through the improvements
of contents, structure, presentation and delivery.
The paper structure is as follows. In section3 proposed
architecture is discussed. Section4 shows experimental results
analysis and section 4 concludes the paper. Table .1. Sample recipe logs
III. PROPOSED MODEL 3.1. A. Preprocessing Phase:
The proposed system as depicted in figure1 is composed of This phase is an important to remove unwanted log entries
three stages are involving fist one is log preprocessing, second form input log files. Using web logs we can predict user’s next
is analysis and predictive perfecting is last stage of the request without distributing them [19]. But not all details in
proposed architecture. web logs are appropriate for the purpose of mining navigation
patterns. So log needs cleaning before it can be used for
prediction. Consider the cleaning algorithm (CLE_ ALG).
The main purpose of log preprocessing is to reduce quantity
of data set from original quantity and decrease the prediction
process time.
In this environment, input file is split into four file and each
file is stored in different nodes (like node1, node 2, node 3 and
node 4). The same file will be stored in different nodes. Here
failure of any node never leads to data lose. Data can be
shared from any other node. The functionality of Mapper is
input problem slice into smaller sub problems and distributes
these to worker nodes. Reducer step master node takes the
answer to sub problems combines them into original problem.
MapReduce algorithmic step for counting the cook time 3.2. Prediction Process
frequency from recipe log files is shown below (AHMR).
Consider the algorithm for prediction process (HM_PP).
Threshold Values