Sciencedirect: Performing Customer Behavior Analysis Using Big Data Analytics
Sciencedirect: Performing Customer Behavior Analysis Using Big Data Analytics
com
ScienceDirect
Procedia Computer Science 79 (2016) 986 – 992
Abstract
Although there are many systems that have implemented customer behavior analytics, it’s still
an upcoming and unexplored market that has greater potential for better advancements. Big
data is one of the most rising technology trends that have the capability for significantly
changing the way business organizations use customer behavior to analyze and transform it into
valuable insights. Even decision trees can be used efficiently for analyzing data. At the end of
this paper, a proposed Map Reduce implementation of well-known statistical classifier, C4.5
decision tree algorithm has been proposed. Apart from this,the system aims to implement
Customer data visualization using Data Driven Documents (d3.js) which allows us to build well
customized graphics.
© 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
© 2016 The Authors. Published by Elsevier B.V.
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review underresponsibility
Peer-review under responsibilityofofthe
theOrganizing
Organizing Committee
Committee of ICCCV
of ICCCV 20162016.
1. Main text
Here Big data is a collection of unstructured data that has very large volume, comes from
variety of sources like web ,business organizations etc. in different formats and comes to us
with a great velocity which makes processing complex and tedious using traditional database
management tools .It can be termed as a growing torrent. So the major demanding issues in big
data processing include storage, search, distribution, transfer, analysis and visualization.
Earlier, the term 'Analytics' indicated the study of existing data to research about potential
trends and to analyze the effects of certain decisions or events that can be used for business
intelligence to gain various valuable insights. Today's biggest challenge is how to discover all
the hidden information through the huge amount of data collected from a varied collection of
sources. There comes Big Data Analytics into picture. One of them is the customer behavior
analysis which is referred as customer analytics.
1877-0509 © 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Organizing Committee of ICCCV 2016
doi:10.1016/j.procs.2016.03.125
Anindita A. Khade / Procedia Computer Science 79 (2016) 986 – 992 987
Customer analytics helps to turn big data into big value by allowing the organizations to predict
the buyer behavior thereby improving their sales, market optimization, inventory planning,
fraud detection and many more applications. A wide range of approaches are available and can
be implemented but the one that stands out is the use of decision trees for the purpose of
classification that can be efficiently used in consumer analytics.
Various decision tree algorithms have been developed over a period of time with enhancement
in performance and ability to handle various types of data. One of the well-known decision tree
algorithm is C4.5 is C4.5 [3-4], an extension of basic ID3 decision tree algorithm [5]. Customer
analytics is incomplete without visualization of the data. In addition to classification of data
using decision trees it is also important to visualize the data so that organizations get a visual
aspect of the data in order to understand the variations in customer patterns.
2. Literature Survey
In the late 1970s, there were two approaches for constructing Database Management System’s
(DBMS’s). The first approach was based on the hierarchical data model, typified by
(Information Management Systems) from IBM, in response to the enormous information
storage requirements generated by the Apollo space program. The second approach
was based on the network data model, which attempted to create a database standard and resolve
some of the difficulties of the hierarchical model, such as its inability to represent complex
relationships DBMSs. However, these two models had some fundamental disadvantages like
the complex programs had to be written to answer even simple queries. Also there was minimal
data independence .
Many experimental relational DBMS were implemented thereafter, with the first commercial
products appearing in the 1970’s and early 1980’s. Relational DBMS used extensively in the
80’s and 90’s was limited in meeting the more complex entity and data needs of companies, as
their operations and applications became increasingly complex. In response to the increasing
complexity of database applications, two new data models had emerged, the Object-Relational
Database Management Systems (ORDBMS) and Object-Oriented Database Management
Systems (OODBMS), which subscribes to the relational and object data models respectively.
The OODBMS and ORDBMS have been combined to represent the third generation of
Database Management Systems.
Dawn Of Big Data Analytics:
Data turns to big data when its volume, velocity, or variety go beyond the abilities of the IT
operational systems to gather, store, analyze, and process it. Most of the organizations are
capable of handling vast amount of unstructured data using varied tools and equipments but
with the rapidly growing volume and fast flood of data, they do not have the capability of
mining it and derive necessary insights in a well-timed way.
Big Data is emerging from the realms of science projects at companies to help
telecommunication giants understand exactly which customers are happy with their service and
what processes caused the dissatisfaction, and predict which customers are going to change the
service. To obtain this information, billions of loosely-structured bytes of data in different
locations needs to be processed until the required data is found out. This type of analysis
enables executive management to fix faulty processes or people and may be able to reach out
to retain at-risk customers . Big data is becoming one of the most important technology trends
that have the potential for dramatically changing the way organizations use customer behaviour
to analyze and transform it into valuable insights.[11]
Select records from your data tree and generate customer profiles that indicate common features
and behaviors. Use customer profiles to inform effective sales and marketing strategy.
Forecasting enables you to adapt to changes, trends and seasonal patterns. You can accurately
predict monthly sales volume or anticipate to the number of orders expected in any given month.
4) Mapping – Identify Geographical Zones
This technique detects relationship or affinity patterns across data and generates a set of rules.
It automatically selects the rules that are most useful to key business insights: What products
do customers purchase simultaneously and when? Which customers are not buying and why?
What new cross-selling opportunities exist?
Decision trees are one of the most popular methods for classification in various data mining
applications and assist the process of decision making. Classification helps you do things like
select the right products to recommend to particular customers and predict potential churn. Most
primarily used decision tree algorithms include ID3, C4.5 and CART.
Flot: A JavaScript plotting library for jQuery, Flot is a browser-based application compatible
with most common browsers — including Internet Explorer, Chrome, Firefox, Safari and
Opera. Flot supports a variety of visualization options for data points, interactive charts, stacked
charts, panning and zooming, and other capabilities through a variety of plugins for specific
functionality. [12]
3) D3.js: A JavaScript library for creating data visualizations with an emphasis on web
standardsUsing HTML, SVG and CSS, bring documents to life with a data-driven approach to
DOM manipulation — all with the full capabilities of modern browsers and no constraints of
proprietary frameworks. [12]
4) SAS Visual Analytics:SAS Visual Analytics is a tool for exploring data sets of all sizes
visually for more comprehensive analytics. With an intuitive platform and automatic
forecasting tools, SAS Visual Analytics allows even non-technical users to explore the deeper
relationships behind data and uncover hidden opportunities. [12]
Anindita A. Khade / Procedia Computer Science 79 (2016) 986 – 992 989
3. Related Technologies
Java code for the map function and the reduce function for this implementation is written
990 Anindita A. Khade / Procedia Computer Science 79 (2016) 986 – 992
for overriding the default map and reduce function provided by hadoop framework. The
programming logic for the respective is based on C4.5 algorithm.
2. Methodology
The flow of the system is as follows:
1) Loading the customer dataset from HDFS as input for the algorithm.
2) Invoke the instance of C4.5 class.
3) Using the MapReduce framework of Hadoop, Map function is invoked which checks
whether this instance belongs to Current Node or not. For all uncovered attributes it
outputs index and its value and class label of instance.
4) Reduce function counts number of occurrences of combination of (index and its value
and class Label) and prints count against it.
5) Calculate entropy, information gain and gain ratio of attributes.
6) Process the input dataset from HDFS according to the defined algorithm of C4.5
decision tree data mining in MapReduce framework.
7) Generate the decision rules and store it in HDFS.
8) Accept the new test data from web UI.
9) Access the rules and based on it, decide the category of the new data.
10) Provide visualization of the dataset from HDFS on the Web UI in the form of bar graphs, pie
charts etc. using D3.js.
threshold and then divides attributes with values above the threshold and values equal to
or below the threshold. C4.5algorithm can easily handle missing values. As missing
attribute values are not utilized in gain calculations by C4.5.[8]
Let C denote the number of classes. In this case, there are two classes in which the
records will be classified into. The classes are yes and no. The p(S, j) is the proportion of
instances in S that are assigned to j -th class. Therefore, the entropy of at tribute S is
calculated as:
Entropy(S) = -∑ j=1c p(S,j) *log p(S,j)
CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers
without tying yourself to a proprietary framework, combining powerful visualization
components and a data-driven approach to DOM manipulation.[14]
4. Conclusion
This paper defines the proposed system for distributed implementation of C4.5 algorithm
using MapReduce framework along with the customer data visualization. With the rise in
development of cloud computing and big data, traditional decision tree algorithms cannot
fit any more and hence we introduced the mapreduce implementation of C4.5 decision tree
algorithm. Visualization done using D3.js is fast and reusable because it uses traditional
HTML elements along with Scalable Vector Graphics (SVG). In future works, the use of
fast and real time database systems like Apache HBase or MongoDB can be incorporated
with this system. In addition to this, we can use distributed refined algorithms like
ForestTree implemented in Apache Mahout to increase efficiency and scalability.
5. References
1. Tom white, ―Hadoop - The Definitive Guideǁ,3rd.Edition, O’Reilly Media, Inc.,Sebastopol, CA 95472,2012.
2. Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,Rafael Coss, Roman B. Melnyk ―Hadoop For Dummiesǁ, John Wiley
& Sons, Inc., Hoboken, New Jersey,2014
3. J.R. Quinlan,―C4.5: programs for machine learningǁ, Morgan Kaufmann,1993.
992 Anindita A. Khade / Procedia Computer Science 79 (2016) 986 – 992