Clickstream Analysis
Clickstream Analysis
CLICKSTREAM DATA
USING HADOOP
Project Report submitted in partial fulfillment of the requirements for the
award of degree of
Submitted By
Kartik Gupta
(Roll No. 201100048)
Under the supervision of:
Mr.Pritpal Singh
Assistant Professor
June 2014
Certificate
I hereby certify that the work is being presented in the report. Commercial Analytics
of Clickstream Data using Hadoop , in partial fulfillment of the requirements for the
award of degree of Master of Computer Application submitted in School of
Mathematics and Computer Application, Thapar University, Patiala is an authentic
record of my own work carried out under the supervision of Mr. Abhinandan Garg
and Mr. Pritpal Singh.
Kartik Gupta
201100048
Date:
Certify that the above statement made by the student is correct to the best of our
knowledge and belief.
Faculty Coordinator
Industry Coordinator
Page | 1
Acknowledgement
I would like to express my deep sense of gratitude to my supervisor, Mr.Pritpal
Singh, Assistant Professor, School of Mathematics and Computer Application, Thapar
University, Patiala, for his invaluable help and guidance during the course of thesis. I
am highly indebt to him for constantly encouraging me by giving his critics on my
work. I am grateful to him for giving me the support and confidence that helped me a
lot in carrying out the project work in the present form. And for me, its an honor to
work under him.
I also take the opportunity to thank Dr.Rajiv Kumar (SDP Coordinator), Associate
Professor of School of Mathematics and Computer Applications Thapar University,
Patiala, for providing us great infrastructure and facilities, which has helped to
become a complete software engineer.
I would also like to thank my parents and friends for their inspiration and ever
encouraging moral support, which went a long way in successful completion of my
training.
Above all, I would like to thank the almighty God for his blessings and for driving me
with faith, hope and courage in the thinnest of the times.
Kartik Gupta
Page | 2
Abstract
Hadoop play an important role in todays life to solve a big data problem because
everyone is surrounded by data, in fact it wont be wrong to say that we live in a data
age. As the data is increasing it is becoming more and more challenging for IT
companies to maintain and analyze huge amount of datasets. Now days many
companies facing similar scaling challenges, it wasnt feasible for everyone to
reinvent their own proprietary tool. Doug Cutting saw an opportunity and led the
charge to develop an open source version of Map Reduce system known as Hadoop.
Hadoop is best known for Map Reduce and its distributed file system (HDFS). Its
distributed file system runs on large cluster of commodity machines. Basically it
distributes the large dataset (in terabyte or in petabyte form) to thousand of machine
that they can work accordingly and manage data easily in some seconds. HDFS was
inspired from GFS but GFS provides a fault tolerant way to store data on commodity
hardware and deliver high aggregate performance to their clients where as HDFS
doesnt provide any fault tolerant way to store data.
MapReduce is a framework that is used for parallel implementation. MapReduce
splits the input into the independent chunks and execute them in parallel over different
mappers. When using MapReduce, developer need not to worry about other factors
like fault tolerance, load balancing etc. all these factors are handled by MapReduce.
He can only concentrate more on his programming. Without MapReduce it would not
be efficient to process so much big spatial data. Spatial data may contain the
knowledge about lot of locations. So it is very important to efficiently process it.
Page | 3
Table of Contents
Certificate
Acknowledgement
Abstract
Table of Contents
List of Figuresv
Chapter 1 Introduction
1.1 Big Data................................................................................................................1
1.1.1 Uses of big data.........................................................................................1
1.2 Clickstream Data..................................................................................................2
1.2.1 Potential Uses of Clickstream Data...........................................................3
1.3 Point of Interest.....................................................................................................3
1.4 Tableau(Analytic Tool).........................................................................................4
1.5 MapReduce...........................................................................................................6
1.5.1 Types of nodes in MapReduce..................................................................8
1.5.2 Execution flow in MapReduce..................................................................9
1.5.3 Input and output to MapReduce..............................................................10
Chapter 2 Hadoop.....................................................................................................10
2.1 Hadoop Overview.......................................................................................10
2.2 HDFS..........................................................................................................11
2.3 Hadoop Cluster...........................................................................................12
2.4 Data Replication.........................................................................................12
Chapter 3 Literature Review14
3.1 Clickstream query in event detection.................................................................14
Page | 4
Page | 5
List of Figures
Figure 1.1 Subdividing earths surafce upto one bit......................................................4
Figure 1.2 Subdividing earths surafce upto two bits.....................................................5
Figure 1.3 Subdividing earths surafce upto three bits...................................................6
Figure 1.4 Various nodes in MapReduce.......................................................................8
Figure 1.5 Execution flow in MapReduce.....................................................................9
Figure 1.6 Architecture of Hadoop...............................................................................11
Figure 1.7 Replication in Hadoop................................................................................13
Figure 2.1 Compression and decompression of RDF..................................................17
Figure 2.2 Locations across a region in zip code method............................................18
Figure 2.3 Nearest locations found by zip code method..............................................18
Figure 2.4 Locations across a region in polygon method............................................19
Figure 2.5 Nearest locations found by polygon method..............................................20
Figure 2.6 Locations inside a small polygon................................................................20
Figure 2.7 Voronoi diagram..........................................................................................21
Figure 2.8 Nearest locations found through voronoi method.......................................22
Figure 2.9 Procedure to make k-d tree.........................................................................24
Figure 2.10 k-d tree......................................................................................................25
Figure 3.1 Gaps in zip code method.............................................................................28
Figure 3.2 Gaps in zip code method.............................................................................29
Figure 3.3 Longitude and Latitudes of Earth...............................................................30
Figure 4.1 Flowchart of Mapper..................................................................................39
Figure 4.2 Flowchart of Reducer..................................................................................40
Figure 5.1 Configuration of Hadoop............................................................................41
Figure 5.2 Types of input files......................................................................................42
Page | 6
Page | 7
Chapter 1
Introduction
1.1 Big Data
Big data is a collection of large data sets. Due to their giant size it is not efficient to
process those using traditional methods. Various problems that have to be faced in
processing big data are capturing the data, storage, search, sharing, transferring,
analysis etc. The trend to giant data sets is because of extra information derived from
analysis of a single large set of correlated data, as compared to separate smaller sets
with the equivalent total amount of data. This big data helps in finding the relations
between various fields which may help in various ways, like decision making,
understanding the business trends, long term planning, fighting crime, and getting
real-time roadway traffic conditions. But due to their correlated behavior it becomes
difficult to query them. Professionals are trying to make results from this huge amount
of data.
This explosion of data is seen every sector in the computing industry. The Internet
companies such as Google, Yahoo, Facebook etc have to deal with large amounts of
data generated by the user and this data is in the form of blog posts, photographs,
status messages, and audio/video files. Also there is huge quantity of data which is
indirectly generated by web sites in the form of access log files, click through events
etc. Analysis of this data can induce useful patterns about the behavior of user. Most
of this data is generated frequently, and the data sets are stored temporarily for a fixed
period and once they are used, they are discarded after that. According to Google, the
online data on web today is 281 Exabyte, which was 5 Exabyte in 2002. There has
been a tremendous increase in user generated data since 2005.
1.1.1 Uses of big data
People are sharing more and more data online like sharing their personal lives,
their opinions on peoples and products.
Companies are trying very hard to use this data in order to generate user
preferences, to generate better options for users and to just get to know their users
Page | 1
Big data is beneficial in various situations like if one wants to have minute to
minute or second to second information of anything, then big data will be
accumulated having all the records.
Big data helps in keeping the information about real time traffic conditions, it may
help in better understanding them and avoid any jams.
If minute to minute data about the weather is kept, then it can help in avoiding the
situations like floods or any other calamities.
What is the most efficient path for a site visitor to research a product, and then
buy it?
What products do visitors tend to buy together, and what are they most likely
to buy in the future?
improve their delivering service and advertisement based on user interest and thereby
improving the quality of user interaction and leading to higher customer loyalty. It is
important to identify visitors interest so that user can fulfill their needs accordingly
so that they can improve their growth rates acc. to visitor because not all the visitors
are equal. Some are more profitable to the organization than others, e.g., because they
are more likely to buy high profit products. Thus, it is necessary to make sure that the
site service to the wanted users. Using the groups, it is desirable to analyze banner
efficiency for each user group, and to optimize this by presenting the right banners to
right people Analysis of customer interest for product development purposes is also
very interesting.
1.4 Tableau
Tableau has optimized direct connections for many high-performance databases,
cubes, Hadoop, and cloud data sources such as salesforce.com and Google analytics.
We can work directly with the data to create reports and dashboards. Tableau connects
live or brings your data into its fast, in memory analytical engine.
Its easy to get started: just connected to one or more than 30 databases and formats
supported by tableau, enter your credentials and begin, with a single click to select
live connect or in-memory analytics.
Hadoop data, it is probably huge, unstructured, nested or all three. Hadoops
distributed file system (HDFS) and the MapReduce algorithm support parallel
processing across massive data. This lets you work with data that traditional databases
find extremely difficult to process, including unstructured data and XML data.
Tableau connects to multiple flavors of Hadoop. Once connected, you can bring data
in-memory and do fast ad-hoc visualization. See patterns and outliers in all that data
thats stored in your Hadoop cluster.
The bottom line is that your Hadoop reporting is faster, easier and more efficient with
Tableau.
Expressive Visualization
A bar chart doesnt work for everything. Data visualization tools must be flexible
enough for you to tell any story. And getting to the best visual representation of your
data is rarely a journey along a straight line: when you navigate between different
perspectives, you can find the trends and outliers that are the most important.
Page | 3
There are many ways to look at simple sales data, from a crosstab report to a daily
sales graph to sales by segment by month. Different views answer different questions.
Looking at data with a geographic element on a map brings in an entirely new
dimension: notice how in this example it becomes clear that sales are clustered in a
few metro areas.
Tableau offers the ability to create different views of your data and change them as
your needs evolve. And switching between views is as easy as a click, so you dont
have to wait on an IT change request to understand your data.
Interactive Visualization
The cycle of visual analysis is a process of getting data, representing it one way,
noticing results and asking follow on questions. The follow-on questions might lead
to a need to drill down, drill up, filter, bring in new data, or create another view of
your data. Without interactivity, the analyst is left with unanswered questions. With
the right interactivity, the data visualization becomes a natural extension of the
analysts thought process.
In this dashboard showing crime activity in the District of Columbia, three filters that
act on two different visualizations let you quickly understand trends by day of week
and location. You can filter by crime type, district, or date. You can also highlight
related data in multiple views at once to investigate patterns across time and location.
Page | 4
1.5 MapReduce
MapReduce is a parallel programming technique which is used for processing very
large amounts of data. Processing of this much large amount of data can be done
efficiently only if it is done in a parallel way. Each machine handles a small part of
the data. MapReduce is a programming model that allows the user to concentrate
more on code writing. He needs not be worried about the concepts of parallel
programming like how he will distribute the data to different machines or what will
Page | 5
happen on failure of any machine etc. This all is inbuilt in MapReduce framework.
In MapReduce, the input work is divided into various independent chunks. These
chunks are processed by the mappers in the totally parallel way. The mappers process
the chunks and produce the output, then this output is sorted and fed as input to the
reducers. The framework itself will handle scheduling of tasks, monitoring of tasks
and re-executing the tasks which have been failed. Mapreduce allows a very high
aggregate bandwidth across the cluster because typically the nodes that store and the
nodes that compute are the same. Means the MapReduce framework and distributed
file system runs on the same set nodes due which it provides high aggregate
bandwidth.
The input to this framework is provided in the form of key/value pairs and it produces
the output also in the form of key/value pairs. All the coding is done in the form of
two functions that is Map and Reduce. So the developers must have knowledge that
how he will represent his coding in the form of Map and Reduce functions. The
output of the Map function is sorted and fed to the Reduce function. MapReduce is
very suitable when there is a large amount of data involved. For small amount of data
it might not be that useful. MapReduce can be run on the commodity hardware with
reliability and in a fault tolerant way. Therefore it can be said that MapReduce is a
software framework which allows to write the programs with a ease which involves
large amount of data and the developer need not to worry about anything else when
his data is being processed parallelly.
Before MapReduce large scale data processing was difficult because before that one
had to look after the following factors himself but now they all are handled by the
framework.
I/O Scheduling
Fault/crash tolerance
Page | 6
MapReduce provides all of these easily because these all are inbuilt in Map Reduce
framework. Before MapReduce one needed to have a complete knowledge about all
these factors, had to do separate coding for them like how to distribute the data if any
node fails, keeping a check on processors etc. But now with MapReduce all that is
inbuilt, user only can concentrate on his own implementation.
1.5.1
JobTracker node manages the MapReduce jobs. There is only one of these per
cluster. It receives jobs submitted by the clients. It schedules the map tasks and
reduce tasks on the appropriate TaskTrackers in a rack aware manner and moitors
for any failing task that need to be rescheduled on a different TaskTracker.
TaskTracker nodes are to achieve parallelism for map and reduce tasks, there are
many TaskTrackers per cluster. They perform map and reduce operations.
There are also NameNode and DataNodes and they are a part of hadoop file syatem.
Page | 7
There is only one NameNode per cluster. It manages the the filesystem namespace
and metadata. Expensive hardware commodity is used for this node. There are many
DataNodes per cluster, these manages blocks with data and manages them to the
clients. They periodically report to the NameNode the list of blocks it stores.
Inexpensive commodity hardware is used for this node.
1.5.2
The first step is the mapreduce that has been written tells the job client to run a
mapreduce job.
This sends a message to the Jobtracker which produces a unique ID for the job.
JobClient copies job resources, such as jar file containing the java code that has
been written to implement map and reduce tasks, to the shared file system, usually
HDFS.
As soon as the resources are in HDFS, the JobClient tells the JobTracker to start
the job.
The JobTracker does its own initialization for the job. It calculates how to split the
data so that it can send each split to a different mapper process to maximize
throughput. It retrieves these input splits from the distributed file system.
Now that the Jobtracker has work for them, it will return the map task or reduce
task as response to the heart beat.
The TaskTracker need to obtain the code to execute, so they get it from the shared
file system.
list(k2,v2)
reduce (k2,list(v2))
list(v2)
First line represents the map function which shows that map function took the value in
the form of key value pairs which is represented <k1,v1>. Map function after
processing those key value pairs, produces a new set of key value pairs. This time
these key value pairs can be of different type, therefore they are represented by
<k2.v2>.
This set of key/value pair that is <k2,v2> is fed to the reducer. Now the work of the
reducer is to reduce the input data. Reducer can do that according to various factors.
Like it can combine all the values which had the same key. And then it keys the list of
values that is list(v2).
Page | 9
Chapter 2
Hadoop
2.1 Hadoop Overview
Hadoop is the Apache Software Foundation open source and Java-based
implementation of the Map/Reduce framework. Hadoop was created by Doug
Cutting. Hadoop originated from Apache Nutch3 which is an open source web search
engine was a part of the Lucene project. Nutch was an ambitious project started in
2002, and it soon ran into problems with the creators realizing that the architecture
they had developed would not scale to the billions of web-pages on the Internet. But
in 2003, a paper was published that described Googles distributed file system - the
Google File System (GFS). Hadoop was born in February 2006, when they decided to
Page | 10
move NDFS and Nutch under a separate subproject under Lucene. In January 2008,
Hadoop made its own1.top
level project under Apache and HDFS or Hadoop
Filename
Distributed File System was name kept instead of NDFS. Hadoop provides various
3. Read data
tools which help in processing of vast amounts of data using the Map/Reduce
framework and, additionally, implements the Hadoop Distributed File System
(HDFS).
2.2 Hadoop Distributed File System or HDFS
2. Block_id, datanode
HDFS is a file system which is designed for storing very giant files with streaming
data access patterns. HDFS runs on clusters on commodity hardware. HDFS was
designed keeping in mind the ideas behind Map/Reduce and Hadoop. This implies
that it is capable of handling datasets of much bigger size than conventional file
systems (even petabytes). These datasets are divided into blocks and stored across a
cluster of machines which run the Map/Reduce or Hadoop jobs. This helps the
Hadoop framework to partition the work in such a way that data access is local as
much as possible.
A very important feature of the HDFS is its streaming access. Once the data is
generated and loaded on to the HDFS, it assumes that each analysis will have a large Datanodes
proportion of the dataset. So the time taken to read the whole of dataset is more
important than the latency occurred in reading the first record. This has its advantages
and disadvantages. On one hand, it can read bigger chunks of contiguous data
locations very fast, but on the other hand, random seek turns out to be a so slow that it
is highly advisable to avoid it. Hence, applications for which low-latency access to
data is critical will not perform well with HDFS.
Page | 11
JobTracker
Master node controlling the distribution of a Hadoop (MapReduce) Job across free
nodes on the cluster. It is responsible for scheduling the jobs on the various
TaskTracker nodes. In case of a node-failure, the JobTracker starts the work
scheduled on the failed node on another free node. The simplicity of Map/Reduce
ensures that such restarts are easily achievable.
NameNode
Node controlling the HDFS. It is responsible for serving any component that
needs access to files on the HDFS. And it is the NameNode only that is
responsible for ensuring that HDFS is fault tolerant. In order to achieve the fault
tolerance, the replication of files is made over 3 different nodes in which two
nodes are from same rack and one node from different rack.
TaskTracker
Node actually running the Hadoop Job. It requests work from the JobTracker and
reports back on updates to the work allocated to it. The TaskTracker daemon does
not run the job on its own, but forks a separate daemon for each task instance.
This ensure that if the user code is malicious it does not bring down the
TaskTracker
Page | 12
Datanodes
Datanodes
Block ops
DataNode
This node is part of the HDFS and holds the files that are put on the HDFS.
Read
Usually these nodes also work as TaskTrackers. The JobTracker tries to allocate
Write
Page | 13
Page | 14
Chapter 3
Literature Review
2.1 Clickstream query in event detection
The rise of Social Media platforms in recent years brought up huge information
streams which require new approaches to analyze the respective data. At the time of
writing, on social networking sites, more than 500 million posts are issued every day.
A large part of these information streams comes from many private users who
describe everything on the social media like how they are feeling or what is
happening around them or what they are doing currently. It is understood that how to
leverage the potential of these real-time information streams.
Many people share the information like what they are interested in. Some are
interested in house fires, on-going baseball games, some are interested in bomb
threats, parties, Broadway premiers, gatherings, traffic jams, conferences and
demonstrations in the area one monitor. Furthermore, independent from the event
type, some people like to pinpoint it on a map, so that others could know about it and
the information becomes more actionable. So, if there is an earthquake in the area one
monitor, one want to know where it caused what kind of casualties or damages. One
believes that such a system can be useful in very different scenarios. In particular, one
sees the following customer groups and use cases:
Police forces, governmental organizations and fire departments can increase their
situational awareness picture about the area they are responsible for.
Here, the particular nature of Twitter and its adoption by a younger, trendy
crowd suggests applications along the lines of, e.g., a real-time New York City
party finder, to name just one possibility.
Page | 15
Data redistribution technique should be extended so as to find the nodes with high
performance.
Making a new routing schedule shuffle phase so as to define the scheduler task
while memory management level has reduced.
Filteringjoinaggregation
Page | 16
Map-reduce-merge
Mapreducemerge included a merge phase on top of MapReduce for increasing
the efficiency of the merged data, which has already been partitioned and has been
sorted (or hashed) by the map and reduce modules.
MapReduce-indexing
MapReduce-indexing strategies provided a detailed analysis of four MapReduce
indexing strategies of varying complexity and were examined for the deployment
of large-scale indexing.
Pipelined-MapReduce
The pipelined-MapReduce allows data transfer by using a pipeline between the
operations and it expands the batched MapReduce programming model, and
reduces the completion time of tasks and improves the utilization rate.
by the RDF. But that size of data is so big that even applying the compression to such
data is difficult. Therefore in order to do efficient compression and decompression of
such big data MapReduce algorithms are used. It makes the use of dictionary
encoding technique. Dictionary encoding technique maintains the structure of the
data. The SemanticWeb is an extension of the current WorldWideWeb, where the
semantics of information can be interpreted by machines. Information is represented
as a set of resource description framework (RDF) statements, where each statement is
made of three different terms: a subject, a predicate, and an object. An example
statement is <http://www.vu.nl> <rdf:type> <dbpedia:University>. This example
states that the concept identified by the uniform resource identifier (URI). Flow
diagram for compression and decompression of RDF.
service in those areas from where he get the more visitors and target in other areas
also . Now the methods which is used behind this query is that it matches the
geocoded ip address of all places one by one and the code of which places fall under
same geocoded ip address , it give them as their closest locations. This is done by
matching the geocoded code with others in the Database. So those places whose
geocoded code will match with place will be displayed. It is obvious that places which
will fall under same code would belong to the same region and would be near to each
other.
Example Like one of the business man has to find the visitors finding the products
on his website. Merchant wants to know that from where he found there website path,
how much time he/she will be spent on his website and on which page most time
spent by visitor. Now what this query will do is that it will match the geocoded code
from the log files of the website in the database so that merchant can know the area of
visitor so that he can promote well. These figures show the current location of a
person in encircled dot and all the other location of various places as simple dots.
Figure 2.2 shows encircled dots (current location of vsitor) surrounded by various
other dots(different places) which can be compared to a real life scenario. These
circles depict various different ip address of different areas.
got the more users and how can improve sales on the other areas. A piechart is an area
defined on the map and it has clear boundaries. In each states there are lot of different
cities with different postal codes so here in fig 2.3 shows the sales in particular state
with different postal codes. Through this kind of analytics any merchant can grow
their business easily by finding areas. address because just a polygon has to be
created, which is a circle around one point. Then queries can be done which will be
based on all the new points that have been defined inside the new polygon.
This is the method which is adopted by most of the sites. They just display all the
locations that fall within the specified boundary. And they themselves focus on the
areas which are more profitable than others.
Page | 20
Chapter 3
Problem Statement
3.1 Proposed objective
This Project will find the Visitors location and their path from which he arrives on
website (point of interest). Those locations are point of interest. The location can be of
any category like school, library, hotel, and so on. Main concern here is to draw a
more accurate and efficient solution to this problem. It is desired that one could query
the unbounded data means there is no limitation on the range of data . There are
mainly two types of queries, one that solves bounded box problem that involves
limited data and other are unbounded box or open box which queries wide range of
data.
So as this algorithm will find the nearest location there will no limitation like one can
find it in the range of 10 km only or anything like that. Anyone would be able to find
the nearest location over a whole country or more than that. So we here provide the
solution for that problem using Mapreduce. In MapReduce execution takes place in
parallel, so also it will take less in executing the large database. Mapreduce is mainly
useful in those cases where large amount of data is involved, so it well suits our work.
Many methods are already there to find nearest location, but somewhere all of them
Page | 21
have few or more flaws which are not well suited for our application. These various
drawbacks are defined as follows with respect to the existing methods.
The solutions verified in this report to tackle a simpler weblog analysis task: using the
remote IP address and timestamp collected with each weblog to measure the amount
of traffic coming to the website by country of origin on an hour-by-hour basis during
the average day. The remote IP address is the first component of the standard Apache
weblog and the time may be extracted from the timestamp, which is the second
component of most weblogs (see Figure ). Our solutions need to extract these items,
and look the IP address up in a table mapping IP addresses to host countries (for
simplicity we will look at only the first two octets of the IP address and look them up
in a table listing all the two-octet or Class B addresses that are used solely by a single
country).
The data used in these tests was generated by a Hadoop program.This program
Page | 22
produces realistic sequential Apache web logs for a specified month, day, year and
number of clicks per day from the unstructured data. As in table1: The remote hosts
are distributed geographically among the top 20 Internet-using countries and
temporally so that each region is most active during their local evening hours
(simulating a consumer or social web site), as shown in Table 2. The web site is
assumed to be in the Central US time zone and each of the countries is assigned a
single offset from that for simplicity.
Note that although the log format used by the Apache web server was used in these
tests, the algorithms used in these solutions can easily be adapted to other formats.
Country
Percent
Country
Percent
China
31
Iran
US
13
Korea
India
Mexico
Brazil
Nigeria
Japan
Turkey
Germany
Italy
Russia
Philippines
France
Pakistan
UK
Vietnam
Hour
Percent
Hour
Percent
00
11
2
Page | 23
01
12
02
13
03
14
04
15
05
16
06
17
07
18
08
19
09
20
12
10
21
12
11
22
12
with MapReduce yet and moreover they are not solving the problem in any accurate
way. So the proposed solution given will be much efficient with respect to time
and accuracy both.
Chapter 4
Implementation
Page | 25
4.1 Overview
The solution is provided to this problem using Hadoop. Hadoop (Map Reduce)
executes in parallel, so our large data will be processed parallel altogether. Map
Reduce needs two functions to be made namely Map and Reduce.
So all the work is carried out in these two functions only. Map function will take the
input in the form of <key, value> pair and provides a new <key ,value> pair as output.
This output then goes to reducer which then processes this data and gives the output.
Clickstream Data
Clickstreams, also known as clickpaths,are the route that visitors choose when
clicking or navigating through a website.
A clickstream is a list of all the pages viewed by a visitor, presented in the order the
pages are viewed also defined as the succession of mouse clicks that each visitor
makes. A clickstream will show you when and where person came in to a website, all
the pages viewed, time spent on each page, and when and where they left.
Taken all together, as aggregated staistics,clickstream info will tell you, on average,
how long people spend on your site, and how often they return. It will also tell which
pages are most frequently viewed.
An interactive clickstream is a graphic representation of a clickstream, a list of pages
seen in the order in which they are visited. The graphic allows you to click on the
pages, and see what the visitor saw, hence the label interactive.
The most obvious reason for examining clickstreams is to extract specific
information about what people are doing on your site. Examining individual
clickstreams will give you the information you need to make content related decisions
without guessing.
There is a wealth of information to be analyzed; you can examine visitor clickstreams
in conjunction with any of the information provided by a good stats program: visit
durations, search terms, ISPs, countries, browsers, etc. The Process will give you
insight into what your visitors are thinking.
Page | 26
The raw data file appears in the File Browser, and you can see that it contains
information such as URL, timestamp, IP address, geocoded IP address, and user ID
(SWID).
The Acme log dataset contains about 4 million rows of data, which represents five
days of clickstream data. Often, organizations will process weeks, months, or even
years of data.
Page | 27
The data is in unstructured form and it can be structured by the map reduce tmethod
using hive commands.
Now at the users table using HCatalog. In the HCatalog we browse the data of the
users which are visit the site through the different paths.
Page | 28
Fig .4.4- Shows the user information who visit the site
We can also use HCatalog to view the data in the products table, which shows product
categories to website URLs and after that we show the which user has chosen and
how much time has been spent on which category
Page | 29
row.
Fig4.8-Shows the hive Script to join data of Acme log, CRM, CMS data
Page | 31
You can view the data generated by the webloganalytics script in Hive as described in
the preceding steps.
Page | 32
Chapter 6
Results and Analysis
Page | 33
Figure 5.1 shows the pre map reduce processing. HDFS is used for the
implementation of map reduce .So once the hadoop is installed, it is started by using
the command start-all.sh. After this command hadoop is started on the linux. This
is confirmed by using command jps, it is a tool which checks the processing status
of jvm. This command shows that whether all the nodes are up and working. So here
it shows all the nodes to be working.
Page | 34
Figure 6.2 shows the first phase when MapReduce job has started. In the starting both
mapping and reduction are 0 percent. It can be seen that they both start
simultaneously the data which has been mapped by the mapper, reducer starts to
reduce it simultaneously. It means the unstructured big data which have been
converted by mapper in to the corresponding structured data are accepted by the
reducer simultaneously.
Page | 35
The map view displays a global view of the data. Now lets take a look at a count of
IP address by state. First, drag the ip field into the SIZE box
Page | 36
Figure 6.5 shows the location of country and state from visitors belong or from where
to access the website and search for something. In this map the visitors are from USA
(New York) its ip address is 4.30.96.133.
Page | 38
Chapter 6
Conclusion and Future Scope
6.1 Conclusion
The amount of data is rapidly growing on a social websites, e-commerece and many
other sites. With the advancement in web technology and availability of raw data
infrastructure, the demand for analyze of Clickstream information over web has
increased significantly. Clickstream information Analytics play an important role in a
wide variety of applications such as decision support systems, profile-based
marketing, to know about the visitor and path where he comes from. A lot of
frameworks and techniques are there to handle Clickstream info. In this report a
method to tackle the behavior and location problem of the visitor has been derived.
There are already existing methods to find the path but they were somehow not so
accurate and efficient according to this problem. So a more refined method has been
presented which uses Mapreduce method to refine the raw data to find the locations of
Page | 40
the visitor.
The size of the data in this becomes too big. It would be inefficient to access that data
sequentially. So the MapReduce has been used to execute the data parallelly. It
improves the solution in terms of time complexity to a great extends because the data
is being processed parallel. A variety of clickstream data efficiently modeled using
MapReduce. MapReduce is a programming model that lets developers focus on the
writing code that processes their data without having to worry about the details of
parallel execution. A MapReduce job usually splits the input data-set into independent
chunks which are processed by the map tasks in a completely parallel manner. The
framework sorts the outputs of the maps, which are then input to the reduce tasks.
There MapReduce is very well suited for the complex clickstream data.
It would be of great help for various organizations. Like for a telecom industry,
ecommerce industry if it wants find the behavior and location around it where it can
start its service but at same time has adequate population and similarly for other
organizations.
Page | 41
Bibliography
Page | 42
[1] Pascal Hitzler and Krzysztof Janowicz Linked Data, Big Data, and the 4th
Paradigm IOS Press and the authors Semantic Web 4,2013
[2] Nathan
Eagle
Big
data,
global
development,
and
complex
social
Concurrency and
Page | 43
[14] Der-Tsai Lee On -Nearest Neighbor Voronoi Diagrams in the Plane IEEE
Trans. Computers, 1982
[15] Tomoyuki Shibata, Toshikazu Wada K-D Decision Tree: An Accelerated and
Memory Efficient Nearest Neighbor Classifier IEICE Transactions Vol. 93-D
No. 7 Pg. 1670-1681, 2010
Page | 44