Ieee
Ieee
Abstract—The number of online purchases is increasing In order to create an advantage over competitors, to
constantly. Companies have recognized the related acquire potential customers, and to position themselves in
opportunities and they are using online channels progressively. the market, web analyses are often carried out by firms. In
In order to acquire potential customers, companies often try to particular, when it comes to investigating the behavior of an
gain a better understanding through the use of web analytics. internet user, clickstream data has been established as a
One of the most useful sources are web log files. Basically, suitable source of information [4–6]. The data itself contains
these provide an abundance of important information about a large number of fields that can provide very detailed
the user behavior on a website, such as the path or access time. information about the user of a website. Thus, companies
Mining this so-called clickstream data in the most
have used clickstream data for a variety of purposes so far.
comprehensive way has become an important task in order to
This supports them in designing their websites as attractive
predict the behavior of online customers, optimize webpages,
and give personalized recommendations. As the number of and intuitive as possible, and also enable them to provide
customers constantly rises, the volume of the generated data direct recommendation of certain products to acquire
log files also increases, both in terms of size and quantity. customers [4]. However, the volume of this clickstream data
Thus, for certain companies, the currently used technologies continues to grow as a result of the increasing interest in
are no longer sufficient. In this work, a comprehensive online shopping and the spread of mobile devices [1].
workflow will be proposed using a clustering algorithm in a Particularly, when it comes to the efficient analysis and
Hadoop ecosystem to investigate user interest patterns. The evaluation of such large amounts of data, current
complete workflow will be demonstrated on an application technologies frequently reach their limits.
scenario of one of the largest business-to-business (B2B) In the long term, both aspects should be observed in this
electronic commerce websites in Germany. Furthermore, an context, the specific data itself as well as the general
experimental evaluation method will be applied to verify the analyzing method. Therefore, a specific methodology
applicability and efficiency of the used algorithm, along with evaluating these data is crucial, both in terms of the
the associated framework. overarching procedure as well as the used algorithm. This
would ensure that only the data, that promises a certain
Keywords: big data, clustering algorithm, clickstream data, added value, will be analyzed. In addition, the use of big data
hadoop ecosystem technologies appears to be useful in this context of very large
amounts of differently structured data, as already confirmed
I. INTRODUCTION by different cases with similar requirements [7]. This leads to
In times of mobile devices, all-encompassing the main research question: “How can massive user-
networking, and the continuous shift of business into the generated clickstream data from e-commerce pages be
Internet, the number of online transactions is steadily examined in order to identify user-specific interest
increasing. Studies such as the PWC 2016 report [1], patterns?” Starting from this, various questions could be
E.Eichmann [2], and C. Annicelli et al. [3] highlight this derived, which were also investigated in the context of this
change. In 2019, for instance, it is expected that every third study:
person on Earth will make at least one purchase per year x RQ1-Which information from the recorded
over the Internet, in comparison to around every 4th. person clickstream data are essential and should be mined?
in 2014 [3]. Companies have recognized the associated x RQ2-What is the most suitable algorithm to reveal
opportunities and are using online channels increasingly for interest patterns of a user?
their sales. The increased competition leads them to find x RQ3-How can the algorithm be applied to big data
innovative solutions with which they might attract new technologies in order to examine massive amounts of
customers. Since the main objective should be to distinguish data without losing the performance and throughput?
from the competition, it is important to recognize the The main objective of this work was to implement an
interests and needs of potential customers, to provide analysis method that examines massive amounts of user-
specific recommendations through an attractive webpage. generated clickstream data, using big data technologies, to
But in doing so, it is often a nontrivial undertaking to obtain reveal interest patterns. The expected surplus values arise not
detailed information about potential customers, especially only from possible recommendations, but also from certain
when no purchases have been made yet. optimizations. This includes long-term strategic orientations
407
Su et al. [15] proposed in their work an approach that category indicator. For this reason, the leader clustering
combines three indicators to analyze the browsing behavior algorithm by Su et al. [15] was used.
of website users. Specifically, a leader clustering approach Essentially, this algorithm allocates each user to a cluster
was used which examines clickstream data on the visiting based on the various indicators and the calculated similarity.
frequency, browsing sequence, and browsing duration of an In the case of the first entry of the dataset, this takes always
e-commerce website. Contrary to some other contributions, the role of the leader in a new cluster. After this, each object,
this approach is not limited to a unique category and starting from the second entry in the dataset set, may either
recognizes other categories as well. Furthermore, it can be be assigned to the most similar cluster (according to the
noted that few of the investigated contributions were outlined chosen leader) or act as a new leader of a cluster.
in this paper as well, such as [13, 16–18]. This may be Nevertheless, it can be noted that this assignment depends on
related to the fact that very little research in this specific field two different values which need to be determined
has been carried out so far. However, this fact reveals that experimentally in beforehand. The general threshold γ
the main algorithm itself was influenced by other defines at which time the user acts as the new leader of a
contributions. Hence, it has been assumed that this approach cluster. This is the case when the similarity of certain
would be the most appropriate algorithm for further indicators deviates too much from the threshold value. In
application. In order to give an overview of the investigated case of multiple assignability, the second value, the rough
contributions and examine the suitability of this very threshold β, determines to which cluster the user can be
comprehensive approach, a comparison between the assigned. In addition to these two values, however, the
collected investigations was realized. For this purpose, the weighting of the individual indicators can also be
indicators as described in Su et al. [15] were identified and determined. The pseudocode of the generally described
used as the basic comparative features. The results of this leader clustering algorithm is shown in Fig. 3. More precise
classification are depicted in Table 1. information and formulas concerning the calculation of
similarity and the algorithm itself can be found within the
TABLE I. CLICKSTREAM ANALYSIS LITERATURE contribution by Su et al. [15].
1 INPUT
Clickstream Indicators Across 2 S: Clickstream dataset
Approach Ref.
Path Frequency Duration Cat. 3 γ: general threshold
4 β: rough threshold
Bayesian 5 OUTPUT
x x x [19]
Methods 6 leaders: array of all leaders
7 clusters: array of all clusters
Collaborati 8 STEP:
x x x [18]
ve filtering 9 Randomly choose a user from dataset as a leader of the first
Fuzzy 10 cluster
x x [14] 11 FOR EACH useri IN dataset DO
clustering 12 FOR EACH lj IN leaders DO
Stochastic 13 calculate the similarity simij between lj and useri
x x x [16] 14 Sort leaders by corresponding similarities
regression 15 Max_sim = largest value of similarity
Fuzzy 16 IF max_sim ≤ γ :
leader x x x [20] 17 Create a new cluster led by useri
18 ELSE:
clustering 19 FOR EACH lj IN sorted leaders DO
Model- 20 IF (simij/max_sim) ≥ β :
based x x [12] 21 add useri into the cluster led by lj
22 RETURN clusters, leaders
biclustering
Cross-
sectional x x [21] Figure 3. The leader clustering alogrithm [15]
approach
Graph
x x x [22]
D. Workflow for implementing data analyses
Clustering
Descision
Although the intended algorithm is a combination of
x x x [23] multiple research papers, the majority of the mentioned
Tree
Fuzzy Co- contributions did not provide specific implementation
x x [13]
Clustering details. This is very important, especially when currently
Logit established technologies reach their limits and fail while
x x x [24]
modelling preprocessing and analyzing massive amount of data. At this
Association
rule mining
x x [17] point, big data technologies are increasingly used, especially
K-Means when large amounts of differently structured data must be
x x [25] analyzed very quickly, as shown in numerous use cases [7].
clustering
Leader
x x x x [15] However, it should be noted that the implementation of these
Clustering projects is often a nontrivial undertaking. Usually, as in the
case of conventional data analysis, several steps are
necessary in advance. These steps also include the collection,
C. The leader clustering algortihm cleansing, and transformation of the data [26]. Depending on
As one can easily notice, only a few contributions the respective phase additional technological considerations
recognized all of the presented indicators. In most cases, not are necessary, considering the demands and the associated
all were considered equally. The same applies for the across characteristics of the data. This is also highlighted by
408
different real-world applications and descriptions in various threshold values. The knowledge about the interest patterns
contributions, realizing such projects [27–39]. Presumably, of the users themselves will be derived through the
this is due to the application specificity and the diversity of interpretation of the data obtained by the analysis. After that,
the individual technologies. each user is left to decide whether to use the results for the
However, this could not be applied to the basic optimization of the web pages for enhancement of decision
processing platform. In most of the cases, Hadoop or a support or as a starting point for further research. This is the
complete package, such as Hortonworks, Cloudera, or case, for instance, if further observations will be carried out
MapR, has been used. A consensus, on the other hand, exists by additional methods.
in the superordinate process itself. The KDD process was
applied in most of the cases implicitly. This process is
generally based on the work of Fayyad et al. [40]. It can be
used to gain new insights and knowledge through the
analysis of different datasets [4, 40]. In this multi-step
process, the analysis of the data is gradually approached by
the previously described steps: selection, preprocessing, data
mining and interpretation [40]. The main analysis is achieved
in most cases by data mining methods and the following
interpretation of the yielded results.
IV. DESIGN AND DEVELOPMENT
As described in the second chapter of this work, the
design and development process represents the core of Figure 4. The proposed workflow
design science methodology. The result is often described as However, when it comes to the technical implementation,
an artifact and may vary in its form [8]. Depending on the it should be noted that certain requirements and conditions
main objective, this can be presented, for instance, as a may change. In this case, the practitioner of this workflow
model, software, or a process. In the following chapter, a must react as soon as possible, for instance, to structural
general procedure will be presented based on the previously changes of the data in the preprocessing stage.
described findings. Following this, a real-world application
is demonstrated and subsequently evaluated within the next B. Demonstration of the artefact
chapter. According to the chosen methodology, a demonstration
A. The proposed approach of the developed model will now be given. More
specifically, the clickstream data of one of the largest
At the present time, a suitable algorithm has been found European B2B trade companies was investigated in one
to derive user interest patterns from clickstream data [15], batch using the proposed workflow and big data
but not in which way this could be implemented, especially technologies. In doing so, the data of a complete financial
in terms of big data technologies. For this reason, an year was used as a starting point. This includes a total of
appropriate approach was developed using the previously 1366 million entries of clickstream data distributed over
discussed findings, as depicted in Fig.4. It illustrates the 4672 log files. At the technical level, a cluster consisting of
basic procedure for the determination of user interest patterns three servers, each with 6 Cores 2.60 GHz, 128 GB of RAM,
using big data technologies on a high level. As already and 8 Terabytes of hard drive, was used to implement this
mentioned, the process is based on the individual steps of project. Corresponding to the respective levels of the
KDD and the necessary preconditions of the algorithm itself. proposed model, appropriate big data technologies had to be
The first step involves the collection and storage of the identified. As a core, the Hadoop ecosystem has been used.
user-generated log entries which act as the input clickstream The exact specifications, according to the individual steps in
data. Within the following steps, the data will be transferred the workflow, including the underlying hardware can be
to the preprocessing and the transformation layer. The used found in Table 2. Furthermore, the applied workflow in
importing tool depends on the source of the data itself as conjunction with the used technologies is depicted in Fig.5.
well as the framework for the following preparation of the
1) Storage, selection and ingestion
data. Due to the preprocessing, cleansing, and transformation
at this stage, the framework should, in turn, be selected by Before the actual implementation and application of the
the characteristics of the data. In doing so, consideration algorithm took place, the data had to be selected first and
should be given, for instance, to the structure, quantity, and then transferred. According to the previous description, the
speed with which the data will be processed. As it is the case log files of the web servers were selected as a suitable source
for many big data projects, Hadoop could be used as a of the clickstream data. The transmission of the addressed
starting point. data can be carried out in different ways. Specifically, after a
For the analysis, according to the necessary data mining preliminary exploration, four reliable tools have been
method, the leader clustering algorithm will be applied. identified. Sqoop, Flume Kafka, and Pentaho Data
Depending on the usability of the results, this can be adjusted Integration (PDI), which works with the Secure File Transfer
by changing the major constants, such as the weighting or Protocol (SFTP), turned out to be suitable tools. The later
409
was implemented to transfer the data from the weblog server (200) and an external IP address was checked. As a result, all
to Hadoop HDFS. Due to the given semi-structured data and irrelevant entries that did not have the targeted status code,
the nature of the batch processing, this seemed to be the best an internal IP, or any other Uniform Resource Identifier
option. (URI) file type extension, such as java script (.js), were
removed. A total of 94% of all clickstream data was
TABLE II. APPLICATION FRAMEWORK BASED ON removed.
THE KDD STEPS
3) Transformation
Level Specifications For the determination of the individual indicators, all
One cluster with three servers each with 6 remaining data had to be transformed into its desired format.
Hardware The timestamp values were recorded in the format 'YYYY-
cores -2.60 GHz, 128 GB RAM, 8TB HD
Platform
Hadoop ecosystem 2.6.0 and HDFS with MM-DD HH:MM:SS'. For the later calculation of the
Cloudera cdh5.4.2 relative time duration, the timestamp of each record was
Selection Pentaho PDI 5.4.0. transformed into its corresponding UNIX type. For the
Preprocessing & Regular expressions, MapReduce Jobs and identification of the respective products and categories, the
Transformation Hive 1.1.0 on Spark associated assignment was loaded from the data warehouse
and compared by means of a Hive user defined function
Analysis Hive HQL, Pig Scripts, Impala
(Hive UDF).
Interpretation Hue GUI, Solr and Microsoft Excel Different fields exist to identify users within the
clickstream data, such as the IP address or the session ID.
2) Preprocessing Due to problems, when multiple users share a common IP
After the data was successfully transferred, some address, the session ID was used, as recommended by [15].
preparatory measures had to be taken before the actual According to this, during the sessionization, a set of clicks
analysis could be carried out. In addition to the processing of within a certain session ID needs to be grouped and assigned
the raw data itself, the relevant indicators for the algorithm to a specific user. This transformation step was also
had to be identified and extracted. First of all, an initial implemented using a Hive UDF.
cleansing of the data was performed to remove duplicates
and filter out invalid and incomplete entries. Due to the C. Data analysis
complexity of the task and the amount of data, this was Subsequently, the individual indicators had to be defined
realized through the use of MapReduce. For instance, in the as input parameters for data analysis. The duration was
case of incomplete entries, attention was paid to fields such calculated by subtracting the page request time of two
as the session ID, Uniform Resource Locator (URL), or the adjacent pages belonging to the same session ID. In doing
time, which are important as input for the algorithm [15]. At so, it has been found out that the number of sessions
the same time, the data was converted into the tab-separated decreases when the time interval is increasing.
value (TSV) format and divided into equal-sized blocks to This is not to be considered a negative finding. But due
allow post-examination using Hive and Pig. After the to the heterogeneity of the sequence in terms of the viewed
incorrect entries were removed, the relevant data was products and categories, it could be assumed that the initial
extracted. At this point, it had to be ensured that all user intention has changed within the sequence. Specifically,
remaining entries were successful entries from outside of the this applies for very long visiting paths. Based on this
company network. More specifically, all Hypertext Transfer finding, a suitable temporal threshold value had to be
Protocol (HTTP) queries provided a positive status code determined until which the called page will be added to the
410
current sequence. After numerous experimental tests, the the website search engine. In both cases, the user spends a
limit of 240 seconds per page was set. Therefore, the end of long time to find what he is looking for. From this
each sequence was marked by exceeding the set time limit. information, the categorization of products and the
Thus, the sequence was implicitly determined by the navigation structure could be reengineered to enhance the
temporal delineation. The frequency with which the user user experience. Furthermore, recommendations could be
visited the single product and category pages within his or given, such as binders on webpages where printers are listed
her sequence was determined by analyzing the single and vice versa.
transformed clickstream entries.
Additionally, it should be mentioned that some of the V. EVALUATION
sessions were omitted which could distort the results. Thus, To observe and measure the applicability and efficiency
only sequences were considered that had more than one entry of the developed artifact, it is necessary to verify and
and less than one thousand. In the case of the latter, we validate the proposed solution [9]. For this reason, an
assumed that these were spam bots. experimental evaluation [8] was carried out in two parts
Due to the experimental investigation of the duration and using different datasets with the already implemented
the dependency of the derived sequence, we slightly shifted technical framework of the real-world use case. On the one
the focus to the frequency. Hence, the weighting value of the hand, the developed workflow from Fig.4 itself was
frequency variable was set to 0.4, whereas the value of the evaluated and, on the other hand, the usability of the results
path and time indicators were both set to 0.3. obtained from it. Initially, the results from the demonstration
Similar to the previously described values, the remaining were compared to those of another record. In contrast to the
thresholds were also determined by experimental tests. After previous dataset, the latter contained only data from the
first evaluations, we found that a general threshold with a following month. Basically this represents around one
value higher than 0.25 generates many very small clusters. twelfth of the corresponding amount, containing a complete
This partly leads to very misleading results and interest financial year. This illustrates that even smaller datasets can
patterns in some of the clusters due to their specific scope. be examined exclusively, as already viable with conventional
For the rough threshold, which is responsible for the technologies. However, this can lead to stability problems,
clustering overlapping, we set the value to 1 and decrease it for instance caused by the statistical outlier. Regarding this,
slowly to see the differences. We realized that the number of only large amounts of data should be analyzed or aggregated.
overlapping clusters drastically increased when the value This can be observed on one of 741 clusters in Fig. 7. The
went below 0.95. This rapid change of overlaps is first record contains log entries of more than one year and
comprehensible since the analyzed dataset comes from a the second only of one month.
B2B e-commerce website. Hence, many cluster leaders share
a similar navigation approach, especially if they might be
employed by the same company.
After all targeted data was cleaned and transformed, and
the important constants were determined, the algorithm
could be executed. For the implementation and execution,
Impala and Hive were used.
D. Interpretation
Overall, 741 clusters were generated. A closer
examination of the results showed that in almost all clusters,
a high fluctuation of user requests has been found. Even
within a less extensive cluster, many different products and
categories were queried, as shown in the taxonomy of Fig.6.
411
TABLE III. DERIVED CATEGORY CONJUNCTIONS and real-time analysis, as planned by the described company
First Category Next Category Clicks
(see Fig.5). In the use case itself, an implementation was
carried out as an example on the basis of batch processing.
Switching Program Cable Accessories 91
Particularly when it comes to possible recommendations,
Cable Accessories Switching Program 90 which are not only given during a further visit or later by
Outdoor Wall Lights Switching Program 75 means of a personalized e-mail, such time critical
Cooking & Grill
evaluations are essential. A starting point here would be, for
Switching Program 70 instance, the use of Flume which is a reliable tool for real-
Technology
Switching Program Ceiling-Mounted Lights 69 time ingestions. It should be mentioned, however, that
additional investigations would be necessary to obtain such
Switching Program LED Bulbs 68 precise technological specifications. The sole examination of
Bulbs - Fluorescent Bases Switching Program 65 a single case study is not sufficient enough. In addition to
these future developments in the technological sense, further
LED Bulbs Switching Program 62
optimizations on the proposed approach are imaginable.
Ceiling-Mounted Lights Switching Program 62 Thus, the extension of the algorithm appears to be sensible
Switching Program Outdoor Wall Lights 62 by considering additional data, such as from ratings, social
networks, customer communications, or created wishing
due to the low amount of clickstream data, there is no clear lists. A diversification of such additional sources is also
preference or ranking, for example, of further product shown in technical implementation of the demonstrated use
recommendations. This is not the case when much larger case in Fig.5. In this way, particularly extensive interest
records are used, as found in the first dataset. patterns can be derived which are composed both of the
As initially assumed, the reliability and efficiency of the derived behavior and the personally expressed opinion of the
model through the comparison of two differently long user. Furthermore, in the sense of the extension, procedures
datasets has been proven. It could also be shown that in such also appear conceivable which present a generalized
analyses, primarily very extensive datasets should be sequence to specify both the individual indicator weightings
examined or at least considered when it comes to real-time as well as the thresholds values.
evaluations. In the future, further evaluations using different
use cases could be realized to obtain knowledge about VII. CONCLUSIONS
technical implementations. Thus, derivations of possible In this work, a new workflow that illustrates the basic
technology recommendations are also conceivable, in procedure for the determination of user interest patterns
general as well as for the individual phases. Furthermore, using big data technologies on a high level has been
general methods could be established in order to determine provided. The overall goal was to identify user-specific
the threshold values and the severities of the single interest patterns from massive amounts of user-generated
indicators. clickstream data. A structured literature review was carried
out to find a suitable solution. Using this, it has been found
VI. DISCUSSION out that at the present time many approaches exist to analyze
Within the scope of this work, a process model was clickstream data. The approach by Su et al. [15] has proved
developed and demonstrated, based on the known sequence to be particularly extensive. For this reason, it was applied in
of KDD (cf. Fig 4.). At its core, it allows the application of the further course of this work. Since no common method of
the leader clustering algorithm using big data technologies. possible implementation and application, especially in the
Despite the developed model and the successful context of big data technologies, has been found, the widely
demonstration, it has to be observed that strict adherence is used KDD [40] was used. Therefore, a process has been
not an absolute guarantee for a successful implementation. developed that uses the implementation of the leader
Especially in terms of huge amounts of clickstream data, clustering algorithm considering big data technologies. This
there are certain specifications which have to be considered. was demonstrated in a real-world case and evaluated using
For instance, the complexity of the preprocessing and various datasets.
transformation depends on how the targeted clickstream data
is structured, as seen in the steps of the demonstration. REFERENCES
Furthermore, the right choice of technologies is crucial. As [1] PWC, They say they want a revolution: Total Retail 2016. [Online]
the pure implementation of the projects has shown, there are Available: http://www.pwc.com/gx/en/retail-
still significant differences and uncertainties, particularly consumer/publications/assets/total-retail-global-report.pdf. Accessed
on: Jan. 19 2017.
regarding the choice of the right technology. In the future, [2] E. Eichmann, eCommerce Industry Outlook 2016. [Online]
further investigation could be realized to find at which point Available: http://www.criteo.com/de/resources/criteo-ecommerce-
the use of big data technologies and thus the application of industry-outlook-2016/.
the proposed solution seems reasonable, as described in [41]. [3] C. Annicelli et al., Worldwide retail ecommerce sales: emarketer´s
Additionally, a recommendation of specific technologies updated estimates and forecast through 2019. [Online] Available:
http://www.emarketer.com/public_media/docs/eMarketer_eTailWes
depending on the respective stage, of both the origin model t2016_Worldwide_ECommerce_Report.pdf.
as well as the developed solution (Fig. 4), appears promising. [4] J. Lee, M. Podlaseck, E. Schonberg, and R. Hoch, “Visualization
Regarding this, it is also possible to evaluate technologies and Analysis of Clickstream Data of Online Stores for
that allow, for example, an even higher degree of automation
412
Understanding Web Merchandising,” Data Mining and Knowledge [24] D. van den Poel and W. Buckinx, “Predicting online-purchasing
Discovery, vol. 5, no. 1/2, pp. 59–84, 2001. behaviour,” European Journal of Operational Research, vol. 166,
[5] R. Cooley, P.-N. Tan, and J. Srivastava, “Discovery of Interesting no. 2, pp. 557–575, 2005.
Usage Patterns from Web Data,” in Lecture notes in computer [25] W. W. Moe, “Buying, Searching, or Browsing: Differentiating
science Lecture notes in artificial intelligence, vol. 1836, Web usage Between Online Shoppers Using In-Store Navigational
analysis and user profiling: International WEBKDD'99 Workshop, Clickstream,” Journal of Consumer Psychology, vol. 13, no. 1, pp.
San Diego, CA, USA, August 15, 1999 ; revised papers, B. Masand, 29–39, 2003.
Ed., Berlin: Springer, 2000, pp. 163–182. [26] NIST Big Data Public Working Group (NBD-PWG), NIST Big Data
[6] S. Senecal, P. J. Kalczynski, and J. Nantel, “Consumers' decision- Interoperability Framework: Volume 1, Definitions: National
making process and their online shopping behavior: A clickstream Institute of Standards and Technology, 2015.
analysis,” Journal of Business Research, vol. 58, no. 11, pp. 1599– [27] T. Hansmann and P. Niemeyer, “Big Data - Characterizing an
1608, 2005. Emerging Research Field Using Topic Models,” in IEEE/WIC/ACM
[7] NIST Big Data Public Working Group (NBD-PWG), NIST Big Data International Joint Conferences on Web Intelligence (WI) and
Interoperability Framework: Volume 3, Use Cases and General Intelligent Agent Technologies (IAT), 2014: 11 - 14 Aug. 2014,
Requirements: National Institute of Standards and Technology, Warsaw, Poland ; proceedings ; [including workshops] ; held as
2015. part of the 2014 Web Intelligence Congress (WIC '14), Piscataway,
[8] A. R. Hevner, S. T. March, J. Park, and S. Ram, “Design science in NJ: IEEE, 2014, pp. 43–51.
information systems research,” MIS quarterly, vol. 28, no. 1, pp. [28] A. Kumaresan, “Framework for Building a Big Data Platform for
75–105, 2004. Publishing Industry,” in Lecture Notes in Business Information
[9] K. Peffers, T. Tuunanen, M. Rothenberger, and S. Chatterjee, “A Processing, vol. 224, Knowledge management in organizations:
Design Science Research Methodology for Information Systems 10th international conference, KMO 2015, Maribor, Slovenia,
Research,” J. Manage. Inf. Syst., vol. 24, no. 3, pp. 45–77, 2007. August 24-28, 2015 : proceedings, L. Uden, M. Heričko, and I.-H.
[10] A. Fink, Conducting research literature reviews: From the internet Ting, Eds., Cham, Heidelberg, New York, Dordrecht, London:
to paper, 4th ed. Los Angeles: SAGE, 2014. Springer, 2015, pp. 377–388.
[11] P. Mayring, Qualitative Inhaltsanalyse: Grundlagen und Techniken, [29] J. Zhan et al., Eds., Study of the key technologies of electric power
11th ed. Weinheim: Beltz, 2010. big data and its application prospects in smart grid. 2014 IEEE PES
[12] V. Melnykov, “Model-based biclustering of clickstream data,” Asia-Pacific Power and Energy Engineering Conference (APPEEC),
Computational Statistics & Data Analysis, vol. 93, pp. 31–45, 2016. 2014.
[13] R. Rathipriya and K. Thangavel, “A Fuzzy Co-Clustering approach [30] H. Hu, Y. Wen, T.-S. Chua, and X. Li, “Toward Scalable Systems
for Clickstream Data Pattern,” Global Journal of Computer Science for Big Data Analytics: A Technology Tutorial,” IEEE Access, vol.
and Technology, vol. 10, no. 6, 2, pp. 652–687, 2014.
http://computerresearch.org/index.php/computer/article/download/9 [31] C. L. Philip Chen and C.-Y. Zhang, “Data-intensive applications,
60/958, 2010. challenges, techniques and technologies: A survey on Big Data,”
[14] L. Zheng, S. Cui, D. Yue, and X. Zhao, “User interest modeling Information Sciences, vol. 275, pp. 314–347, 2014.
based on browsing behavior,” in 3rd International Conference on [32] I. A. T. Hashem et al., “The rise of “big data” on cloud computing:
Advanced Computer Theory and Engineering (ICACTE), 2010: 20 - Review and open research issues,” Information Systems, vol. 47, pp.
22 Aug. 2010, Chengdu, China ; proceedings, Piscataway, NJ: 98–115, 2015.
IEEE, 2010, V5-455-V5-458. [33] M. Chen, S. Mao, and Y. Liu, “Big Data: A Survey,” Mobile Netw
[15] Q. Su and L. Chen, “A method for discovering clusters of e- Appl, vol. 19, no. 2, pp. 171–209, 2014.
commerce interest patterns using click-stream data,” Electronic [34] D. Dev and R. Patgiri, “A Survey of Different Technologies and
Commerce Research and Applications, vol. 14, no. 1, pp. 1–13, Recent Challenges of Big Data,” in Smart Innovation, Systems and
2015. Technologies, Proceedings of 3rd International Conference on
[16] W. W. Moe and P. S. Fader, “Dynamic Conversion Behavior at E- Advanced Computing, Networking and Informatics, A. Nagar, D. P.
Commerce Sites,” Management Science, vol. 50, no. 3, pp. 326– Mohapatra, and N. Chaki, Eds., New Delhi: Springer India, 2016,
335, 2004. pp. 537–548.
[17] Y. S. Kim and B.-J. Yum, “Recommender system based on click [35] P. Pääkkönen and D. Pakkala, “Reference Architecture and
stream data using association rule mining,” Expert Systems with Classification of Technologies, Products and Services for Big Data
Applications, vol. 38, no. 10, pp. 13320–13327, 2011. Systems,” Big Data Research, vol. 2, no. 4, pp. 166–186, 2015.
[18] Y.-J. Park and K.-N. Chang, “Individual and group behavior-based [36] L. Rodríguez-Mazahua et al., “A general perspective of Big Data:
customer profile model for personalized product recommendation,” Applications, tools, challenges and trends,” J Supercomput, vol. 72,
Expert Systems with Applications, vol. 36, no. 2, pp. 1932–1939, no. 8, pp. 3073–3113, 2016.
2009. [37] M. D. Assunção, R. N. Calheiros, S. Bianchi, M. A. Netto, and R.
[19] C. Sismeiro and R. E. Bucklin, “Modeling Purchase Behavior at an Buyya, “Big Data computing and clouds: Trends and future
E-Commerce Web Site: A Task-Completion Approach,” Journal of directions,” Journal of Parallel and Distributed Computing, vol. 79-
Marketing Research, vol. 41, no. 3, pp. 306–323, 2004. 80, pp. 3–15, 2015.
[20] H. Yu et al., “A novel possibilistic fuzzy leader clustering [38] G. Bello-Orgaz, J. J. Jung, and D. Camacho, “Social big data:
algorithm,” HIS, vol. 8, no. 1, pp. 31–40, 2011. Recent achievements and new challenges,” Information Fusion, vol.
[21] L. Aguiar and B. Martens, Digital music consumption on the 28, pp. 45–59, 2016.
Internet: Evidence from Clickstream data. Luxembourg: [39] G. Pole and P. Gera, “A Recent Study of Emerging Tools and
Publications Office, 2013. Technologies Boosting Big Data Analytics,” in Advances in
[22] G. Silahtaroglu and H. Donertasli, “Analysis and prediction of Ε- Intelligent Systems and Computing, Innovations in Computer
customers' behavior by mining clickstream data,” in 2015 IEEE Science and Engineering, H. S. Saini, R. Sayal, and S. S. Rawat,
International Conference on Big Data (Big Data): IEEE, 2015, pp. Eds., Singapore: Springer Singapore, 2016, pp. 29–36.
1466–1472. [40] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining
[23] G. Wang, X. Zhang, S. Tang, H. Zheng, and B. Y. Zhao, to knowledge discovery in databases,” AI magazine, vol. 17, no. 3,
“Unsupervised Clickstream Clustering for User Behavior Analysis,” p. 37, 1996.
in CHI 2016: #chi4good ; proceedings ; The 34rd Annual CHI [41] M. Volk, S. Hart, S. Bosse, and K. Turowski, “How much is Big
Conference on Human Factors in Computing Systems, San Jose, Data? A Classification Framework for IT Projects and Technologies
CA, USA, May 07 - 12, 2016, New York, NY: ACM, 2015?, pp. Diego, CA, USA, August 11-14, 2016,” in 22nd Americas
225–236. Conference on Information 2016.
413