0% found this document useful (0 votes)
10 views5 pages

Ieee Conference Christy

This paper discusses the use of Hadoop MapReduce for big data analysis in e-commerce systems, focusing on the importance of web log files in understanding customer behavior. It proposes a predictive prefetching system that preprocesses web logs to enhance prediction accuracy and reduce response time for online businesses. The study demonstrates that Hadoop's parallel processing capabilities can efficiently handle large datasets, improving the overall performance of e-commerce applications.

Uploaded by

Krishna Koushik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

Ieee Conference Christy

This paper discusses the use of Hadoop MapReduce for big data analysis in e-commerce systems, focusing on the importance of web log files in understanding customer behavior. It proposes a predictive prefetching system that preprocesses web logs to enhance prediction accuracy and reduce response time for online businesses. The study demonstrates that Hadoop's parallel processing capabilities can efficiently handle large datasets, improving the overall performance of e-commerce applications.

Uploaded by

Krishna Koushik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Big Data Analysis in E-commerce System Using Hadoop MapReduce

Dr. S. Suguna1 M. Vithya2 J.I. Christy Eunaicy3


Assistant Professor (Research Scholar) (Research Scholar)
Sri Meenakshi Govt.Arts College for Women (A) Madurai Kamaraj University. Head, Dept of Information Technology
Madurai-2, Tamil Nadu. Madurai-2, Tamil Nadu Madurai Sivakasi nadir pioneer Meenakshi Women’s College,
[email protected] [email protected] Madurai-2, Tamil Nadu.
[email protected]

Abstract: Today web mining is a challenging task in request, the visitor data, line of hit, the request method,
organization. Every organization generated vast amount location and name of the requested file, the HTTP status code,
of data from various source. Web mining is the process of the size of the requested file.
extracting useful knowledge from web resources. Log files
Log files can be classified into categories depending on the
are maintained by the web server. The challenging task
location of their storage that is web server logs and application
for E-commerce companies is to know their customer
server logs. A web server [3] maintains two types of log files:
behavior to improve the business by analyzing web log
Access log and Error log. The access log records all requests
files. E-commerce website can generate tens of peta bytes
that were made of this server. The error log records all request
of data in their web log files. This paper discuss about the
that failed and the reason for the failure as recorded by the
importance of log files in E-commerce world. The
application. A log files have lot of parameters which are very
analysis of log files is used for learning the user behavior in
useful to recognizing user browsing patterns [4, 5 6]. Mining
E-commerce system. The analysis of such large web log
the web log file will helpful to server and E-commerce for
files need parallel processing and reliable data storage
predicting the behavior of their online customer. Every day
system. The Hadoop framework provides reliable storage
increasing online customers as well as increasing the size of
by Hadoop Distributed File System and parallel processing
web access log [7].. In large websites handling millions of
system for large database using MapReduce programming
simultaneous visitors can generate hundred of peta bytes of
model. These mechanisms help to process log data in
logs per day. The existing data mining techniques store web
parallel manner and computes results efficiently. This
log files in traditional DBMS and analyze. RDBMS system
approach reduces the response time as well as load on the
cannot store and manage the peta bytes of heterogeneous
end system. This work proposes a predictive prefetching
dataset. So, to analyze such big web log file efficiently and
system based on preprocessing of web logs using Hadoop
effectively we need to develop faster, efficient and effective
MapReduce, which will provide accurate results in
parallel and scalable data mining algorithm. Also need a
minimum response time for E-commerce business
cluster of storage devices to store peta bytes of web log data
activities.
and parallel computing model for analyzing huge amount of
Keywords: E-commerce, Preprocessing, Hadoop, Map data. Hadoop framework provides reliable clusters of storage
Reduce, Web log, prediction process facility to keep our large web log file data in a distributed
I. INTRODUCTION manner and parallel processing features to process a large web
log file data efficiently and effectively[8,9]. The preprocessed
Web mining is the application of data mining techniques to web logs by Hadoop MapReduce environment is further
extract useful knowledge from web data that includes web processed for prediction of user next request without
document, hyperlink between documents, usage logs of web disturbing them to increase the interest and to reduce the
sites etc. Web usage mining is the process of applying data response time with ecommerce system.
mining techniques to discover usage pattern from the web II. RELATED WORK
data. It is one of the techniques to find personalization of web
pages. The collection of web usage data gathered from In [10] proposed a new approach for preprocessing of web log
different levels such as server level, client level and proxy data and the association rules are being employed to extract
level and also from different resources through the web the useful patterns. Log files are the best source to predict the
browser and web server interaction using the HTTP user behavior, to analyze usage pattern through these two
protocol [1]. But in the current scenario the number of online phases such as pattern discovery and analysis phase. In [11]
customer’s increases day by day and each click from a web proposed data mining techniques like the first phase of
page creates on the order hundred bytes data in typical website preprocessing and to discover user access patterns from web
log file. When a web user submits request to web server at the log. They discussed field extraction and data cleaning
same time user activities are recorded in server side. These algorithms and proved web log mining can be used for various
types of web access logs are called log file. Request applications such as web personalization, site
information sent by the user via protocol to the web server recommendation, site improvement etc. In [12] analyzed
which is recorded in log file. The log files [2] are contains some important aspects like data exploration, activity and
some entries like ip address of which computer making the preferences of users. In [13] discussed to discover the frequent
usage by the client and their experimental study finds some This paper discussed a method that uses optimal time to search
interesting patterns through association rule mining algorithm the server. The proposed model of this section is as follows:
and FP growth algorithm. They proved association rule mining In section 3.1 importance of preprocessing process is
have some limitation and suitable for least amount of data set discussed and in section 3.2 an algorithm is proposed for
but FP growth have minimum limitation and suitable for large optimal search process.
data set with out any user interaction. In [14] discussed the
Consider the sample logs from recipe log file in table. Each
importance of data preprocessing methods and user session
entry contains object id, name, ingredients, url for this page,
identification methods for any transaction files. In [15]
image, time spent, date, source, recipe yield, date published,
proposed some filtering methods based on statistical attributes
cook time, prep time, description, status code and ip address
to discover rules or patterns. They proved tools do not indicate
for each and every entry.
which web usage mining algorithms are used but provide
effective graphical visualizations of the results. In paper [16]
applied k-means clustering algorithm first then applied
association rule mining technique on clustered data for pattern
discovery. They find out drawbacks, generation of irrelevant
rules. In [17] proposed method to find same user session.
They were grouped the similar data based on two similarities
such as user similarity and session similarity and may useful
for group the same web users. In [18] web log analyzer tool is
used for analyzing usage pattern from web logs. They proved
those tools are helpful to web administrator in order to
improve the web site performance through the improvements
of contents, structure, presentation and delivery.
The paper structure is as follows. In section3 proposed
architecture is discussed. Section4 shows experimental results
analysis and section 4 concludes the paper. Table .1. Sample recipe logs
III. PROPOSED MODEL 3.1. A. Preprocessing Phase:

The proposed system as depicted in figure1 is composed of This phase is an important to remove unwanted log entries
three stages are involving fist one is log preprocessing, second form input log files. Using web logs we can predict user’s next
is analysis and predictive perfecting is last stage of the request without distributing them [19]. But not all details in
proposed architecture. web logs are appropriate for the purpose of mining navigation
patterns. So log needs cleaning before it can be used for
prediction. Consider the cleaning algorithm (CLE_ ALG).
The main purpose of log preprocessing is to reduce quantity
of data set from original quantity and decrease the prediction
process time.

Cleaning Algorithm (CLE- ALG)

Input: log file with unwanted detailed


Output: cleaned data
Begin
I. Delete logs generated for extremely
long user sessions by search
engine.
II. Delete logs having codes other than 200 with GET
method.
III. Delete entries related to web robots request and
system request.
IV. Includes Logs with .jpg & .html extension in user
requested pages.
End.
3.1.B. Analysis Phase The input to this function is a Recipe log file. For each cook
time in the Recipe site, a line will be added into the Recipe
In this phase cleaned logs are processed to generate logs with Log file. In the Mapper function, each block of the Recipe log
counts based on recipe id and recipe preparation time using file is given as an input to a map function which in turn parse
Hadoop and Mapreduce. Log files are collected [20,21] from each line using regular expression and emits the Recipe Item
many different types of server are fetched via Apache flume as a key along with the value 1 (cook time 1,1),(cook time 2,1)
and loaded into a Hadoop cluster. Jobs are scheduled to …..(cook time n,1). After mapping the shuffling collects all
analyze the logs and generate aggregated summary metrics the (Key, Value) pairs which are having the same cook time
and visualization using business intelligent tools. MapReduce from different Mapping function’s and forms a group. After
processes these blocks in a parallel manner. Figure 2 shows this process, Group1 entries will be (cook time 1, 1), (cook
how the log files are distributed by MapReduce. time 1, 1) and so an. Group 2 entries will be (cook time 2, 1),
(cook time 2, 1) and so on. Then the reducer function
calculates the sum for each cook time group. The result of the
reduce function is (cook time 1, sum)….. (Cook time n, sum)
and for each ip address it calculates (obj.id1, no. of references)
…….. ,(obj.id m, no.of references). Figure 3 shows how
MapReduce framework works.

In this environment, input file is split into four file and each
file is stored in different nodes (like node1, node 2, node 3 and
node 4). The same file will be stored in different nodes. Here
failure of any node never leads to data lose. Data can be
shared from any other node. The functionality of Mapper is
input problem slice into smaller sub problems and distributes
these to worker nodes. Reducer step master node takes the
answer to sub problems combines them into original problem.
MapReduce algorithmic step for counting the cook time 3.2. Prediction Process
frequency from recipe log files is shown below (AHMR).
Consider the algorithm for prediction process (HM_PP).

IV. RESULTS AND DISCUSSION

To calculate the total number of Recipe based on cook time


received by each Recipe Item, a single node Hadoop cluster is
set up with the configurations of Ubuntu 14.04 operating
system, Hadoop version 2.6.0, and single node cluster
192.168.2.1 and dataset Amazon Recipe Logs of 1 Terabyte.
Before executing the MapReduce code in the single node
cluster environment, the Recipe log file is loaded into the
HDFS of Hadoop framework. MapReduce function is used to
count the total number of recipe with same cook time and also
count the total number of references to the recipe id from each
and every ip address. Figure 4 shows the contents of the
output directory named number of cook time by recipe items
in HDFS. The output is stored in a file called part r-000000.
Figure 5 shows a chunk of the output file which is generated
when the MapReduce code for calculating the number of same
cook time by each Recipe item. The Hadoop MapReduce
algorithm executed in 52423 milliseconds using Map Reduce
Environment. The number of Mapper task launched is 5 and
Reduce task launched is 2. Time taken by map task is 22
seconds and reducer task is 32 seconds. Analysis of Recipe
item based on cook is shown in figure 6. Analysis of hits count If requested document is in the cache, the request can be
based on Recipe id and Ip address is shown in figure 7. The satisfied immediately, which is called a hit; otherwise the
performance analysis of Hadoop environment is shown in document has to be fetched from the original server (or
figure 8. proxy), which is termed as miss.

Number of requests that hit in the cache


Hit Rate =
Total number of requests
Waste ratio refers to the percentage of undesired documents
that are prefetched into the cache. Figure 9 shows the accuracy
of our algorithm HM_PP using hit ratio analysis for various
prefetching threshold values 0.25, and 0.5. Greedy dual size _
frequency caching policy (GDSF) is utilized in our approach.

Threshold Values

Figure 9. Analysis of Cache Hit Ratio

The analysis shows that HM_PP algorithm leads to good


response time and accuracy for processing huge amount of
web logs towards predictive prefetching in ecommerce world.
4. CONCLUSION

This paper describes a detailed view of processing big data


such as Recipe log file with one tera bytes of logs using
Hadoop frame work. This paper shows how to process log file
using MapReduce and how Hadoop framework is used for
parallel computation of log files. Data collected from various
resources are loaded into HDFS for facilitating MapReduce
and Hadoop framework. We proved that processing big data
with the help of Hadoop environment leads to minimum
computation and response time and also our HM_PP
algorithm leads to good accuracy in prediction of user
preferred pages. So, one can easily access the ecommerce Journal of Emerging Trends & Technology in Computer Science, vol 3, issue
system with the help of big data analytics tools with less 3 ,May 20143.
response time and good prediction accuracy. In future log [12]. Theint Aye, “Web Log Cleaning for Mining of Web Usage Patterns” ,
analysis can be done by correlation engines like RSA envision IEEE 2011.
and HA cloud environment. The above work can also be
[13]. MsShashiSahu, Leena Sahu “ A Survey on Frequent Web Mining with
extended with semantic analysis for better prediction.
Improving Data Quality of Log Cleaner” International Journal of Advanced
References Research in Computer Engineering & Technology, vol 4, issue 3, March
[1]. M.Santhanakumar and C.Christopher Columbus, “Web Usage Analysis of 2015.
Web pages UsingRapidminer”, WSEAS Transactions on computers, E- [14]. Rahul Mishra, AbhaChoubey, “Comparative Analysis of Apriori
ISSN: 2224-2872, vol.3, May 2015. Algorithm and Frequent Pattern Algorithm for Frequent Pattern Mining in
[2]. ShailyG.Langhnoja , MehulP.Barot and DarshakB.Mehta, “Web Usage Web Log Data” International Journal of Computer science and Information
Mining Using Association Rule Mining on Clustered Data for Pattern Technologies, vol 3, 2012.
Discovery “,International Journal of Data Mining Techniques and [15]. Pani .S.K, Panigraphy.L, Sankar.V.H, BikramKeshariRatha, Padhi.A.K,
Applications, vol.2 ,Issue.1, June 2013. “Web Usage Mining: A Survey on Pattern Extraction from Web Log”
[3]. Web server logs ://http. Sever side log.org. International Journal of Instrumentation Control & Automation, vol 1, issue 1,
[4]. Nanhay Singh, Achin Jain, Ram and Shringar Raw, “Comparison 2011.
Analysis of Web Usage Mining Using Pattern Recognition Techniques”, [16]. Suresh.R.M, Padmajavalli.R, “ An Overview of data preprocessing in
International Journal of Data Mining & Knowledge Process(IJDKP) vol.3, Data and Web Usage mining”, IEEE 2006.
Issue.4, July 2013. [17]. Maryam Jafari, ShahramJamali., “ Discovering Users Access Patterns
[5].J.Srivastava et al, “Web usage Mining: Discoveryand Applications of for web Usage Mining from Web Log Files” Journal of Advanced in
usage patterns from Web Data“, ACM SIGKDD Explorations, vol.1, Issue. 2, Computer Research vol 4, issue 3, August 2013.
pp.12-23, 2000. [18]. Preeti Sharma & Sanjay Kumar, “ An Approach for Customer Behavior
[6]. S.Saravanan and B.UmaMaheswari, “Analyzing Large Web Log Files in Analysis using Web Mining” International Journal of Internet Computing,
A Hadoop DistributedCluster Environment”, International Journal of vol 1, issue 2 2011 ISSN No. 2231 – 6965.
Computer Technology & Applications, vol.5, pp. 1677-1681. [19]. G. Arumugam, S. Suguna, “Optimal Algorithms for Generation of User
[7]. K.V.Shvachko, “ TheHadoop Distributed File System Requirements”, Session Sequences Using Server Side Web User Logs”, IEEE
MSST ’10 Proceeding of the 2010 IEEE 26th Symposium on Mass Storage Explorer,Pages: 1-6, ISBN:978-2-9532-4431-1, June 2009.
System and Technologies(MSST). [20]. Narkhede, Sayalee, Trupti Baraskar, Debajyoti Mukhopadhyay.
[8]. Apache Hadoop ://http://hadoop.apache.org. Analyzing web application log files to find hit count through the utilization of
[9]. A white paper by OrzotaInc, “Beyond Web Application Log Analysis Hadoop MapReduce in cloud computing environment. 2014 Conference on
using Apache Hadoop”. IT in Business Industry and Government (CSIBIG).
[10]. Resul Das, Ibrahim Turkoglu, “Extraction of Interesting Patterns [21]. S.SiddharthAdhikari ,DeveshSaraf, Mahesh Revanwar and Nikhil
through Association Rule Mining for Improvement of website Usability” Ankam, “Analysis of Log Data and Statistics Report Generation Using
International of Electrical & Electronics Engineering, vol 9, issue 2, 2010.
[11]. AmitPratap Singh, Jain Dr.R.C., “A Survey on Different Phases of Web
Usage Mining for Anomaly User Behavior Investigation” International
adoop”, International Journal of Innovative Researchin Computer and
Communication Engineering, vol 2, Issue 4, April 2014.

You might also like