Full

CHAPTER 1
INTRODUCTION
1.1 OVERVIEW OF THE PROJECT
The volume, variety, and velocity of data being produced in all areas of the retail
industry is growing exponentially, creating both challenges and opportunities for those
diligently analyzing this data to gain a competitive advantage. Although retailers have
been using data analytics to generate business intelligence for years, the extreme
composition of today’s data necessitates new approaches and tools. This is because the
retail industry has entered the big data era, having access to more information that can
be used to create amazing shopping experiences and forge tighter connections between
customers, brands, and retailers.
A trail of data follows products as they are manufactured, shipped, stocked,

advertised, purchased, consumed, and talked about by consumers – all of which can
help forward-thinking retailers increase sales and operations performance. This requires
an end-to-end retail analytics solution capable of analyzing large datasets populated by
retail systems and sensors, enterprise resource planning (ERP), inventory control, social
media, and other sources.
In an attempt to demystify retail data analytics, this project chronicles a real-

world implementation that is producing tangible benefits, such as allowing retailers to:
 Increase sales per visit with a deeper understanding of customers’ purchase

patterns.
 Learn about new sales opportunities by identifying unexpected trends from

social media.
 Improve inventory management with greater visibility into the product pipeline
A set of simple analytics experiments was performed to create capabilities and a

framework for conducting large-scale, distributed data analytics. These experiments
1
facilitated an understanding of the edge-to-cloud business analytics value proposition
and at the same time, provided insight into the technical architecture and integration
needed for implementation.
1.2 COMPETITIVE ADVANTAGE
According to McKinsey Global Institute,big data has the potential to increase

net retailer margins by 60 percent . Likewise,companies in the top third of their industry
in the use of data-driven-decision making were on average,five percent more productive
and six percent more profitable than their competitors,wrote Andrew McAfee and Erik
Brynjolfsson in a Harvard Business Review article.
In order to generate the insight needed to reap substantial business benefits,new

innovative approaches and technologies are required.This is because big data in retail is
like a mountain, and retailers must uncover those tiny,but game-changing golden
nuggests of insights and knowledge that can be used to create a competitive advantage.
1.3 BIG DATA TECHNOLOGIES
In order for retailers to realize the full potential of big data, they must find a new
approach for handling large amounts of data. Traditional tools and infrastructure are
struggling to keep up with larger and more varied data sets coming in at high velocity.
New technologies are emerging to make big data analytics scalable and cost-effective,
such as a distributed grid of computing resources. Processing is pushed out to the nodes
where the data resides, which is in contrast to long-established approaches that retrieve
data for processing from a central point.
Hadoop is a popular, open-source software framework that enables distributed

processing of large data sets on clusters of computers. The framework easily scales on
servers based on Intel Xeon processors, as demonstrated by Cloudera provider of
Apache Hadoop-based software, support services, and training. Although Hadoop has
captured a lot of attention, other tools and technologies are also available for working
on different types of big data analytics problems.
2
1.4 REAL-WORLD USE CASES
In order to demonstrate big data use cases for 3,000 natural food stores in the
United States. Living Naturally develops and markets a suite of online and mobile
programs designed to enhance the productivity and marketing capabilities of retailers
and suppliers. Since 1999, Living Naturally has worked with thousands of retail
customers and 20,000 major industry brands.
In a period of three months, a team of four analysts worked with Living

Naturally on a retail analytics project spanning four key phases: problem definition, data
preparation, algorithm definition, and big data solution Implementation. During the
project, the following use cases were investigated in detail:
Product Pipeline Tracking: When Inventory levels are out sync with demand,
make recommendations to retail buyers to remedy the situation and maximize return for
the store. Market Basket Analysis: When an item goes on sale, let retailers know about
adjacent products that benefit from a sales increase as well. Social Media Analysis:
Prior to products going viral on social media, suggest retail buyers increase order size to
be more responsive to shifting consumer demand and avoid out-of-stocks
3
CHAPTER 2
LITERATUR REVIEW
2.1 Background of the study
It includes the investigation and possible changes to the existing system.

Analysis is used to gain an understanding of the existing system and what is required of
it. At the conclusion of the system analysis, there is a system description and a set of
requirements for a new system.
A detailed study of the process must be made by various techniques like

interviews, questionnaires etc. The data collected by these sources must be scrutinized
to arrive to a conclusion. The conclusion is an understanding of how the system
functions. This system is called the existing system. Now the existing system is
subjected to close study and problem areas are identified. The designer now functions as
a problem solver and tries to sort out the difficulties that the enterprise faces. The
solutions are given as proposals. The proposal is then weighed with the existing system
analytically and the best one is selected.
The proposal is presented to the user for an endorsement by the user. The
proposal is reviewed on user request and suitable changes are made. This is loop that
ends as soon as the user is satisfied with proposal. Preliminary study is problem solving
activity that requires intensive communication between the system users and system
developers. It does various feasibility studies. In these studies a rough figure of the
system activities can be obtained, which can be used to take decisions regarding
strategies to be followed for an effective system development.
The various task to be carried out in system analysis involves: examining the
document and the relevant aspects of the existing system, its failures and problems;
analyse the findings and record the results; define and document in outline the proposed
system; test the proposed design against the known facts; produce a detailed report to
support the proposals; estimate the resource required to design and implement the
proposed system.
4
The objective of this system study is to determine whether there is any need for
the new system. All the levels of the feasibility measures have to be performed.
Thereby knowing the performance by which a new system has to be performed.
2.1.1 Problem Definition
Problem Definition deals with observations, site visits and discussions to

identify analyze and document project requirements and carry out feasibility studies
and technical assessments to determine the best approaches for full system
development.
Addition of new features is very difficult and creates more overheads. The
present system is not easily customizable. The existing system records customer
details, employee details, new service/event details manually. The change in one
module or any part of the system is widely affected in other parts or sections. And it
is difficult to coordinate all the changes by the administrator. This present system
then becomes more error prone when we think about updating. Lots of security
problems are present in the existing system. Keeping the problem definition in mind
the proposed system evolves which is easily customizable, user friendly, easy to
update with new features in future and so on.
2.1.2 Requirement Analysis
Requirements analysis in systems engineering and software engineering,

encompasses those tasks that go into determining the needs or conditions to meet for a
new or altered product or project, taking account of the possibly conflicting
requirements of the various stakeholders, analyzing, documenting, validating and
managing software or system requirements. Requirements analysis is critical to the
success of a systems or software project. Requirement analysis for the Digital Security
Surveillance system outputs the following results.
5
2.2 EXISTING SYSTEM
Increasingly connected consumers and retail channels have served to make a

wide variety of new data types and sources available to today's retailer. Traditional
structured types of data such as the transactional and operational data found in a data
warehouse are now only a small part of the overall data landscape for retailers. Savvy
retailers now collect large volumes of unstructured data such as web logs and
clickstream data,location data,social network interactions and data from a variety of
sensors. The opportunities offered by this data are significant, and span all functional
areas in retail.According to a 2011 study by Mckinsey Global Institute,retailers fully
exploiting big data technology stand to dramatically improve margins and productivity.
In order for retailers to take advantage of new types of data,they need new approach
storage and analysis.
2.2.1 The Data Deluge, And Other Barriers
Modern retailers are increasingly pursuing omni-channel retail strategies that

seek to create a seamless customer experience across a variety of physical and digital
interaction channels, including in-store, telephone, and web, mobile and social media.
The emergence of omni-channel retailing, the need for a 360-view of the customer and
advances in retail technology are leading today's retailers to collect a wide variety of
new types of data.
These new data sources including clickstream and web logs, social media
interactions, search queries, in-store sensors and video, marketing assets and
interactions, and variety of public and purchased data sets provided by 3rd parties have
put tremendous pressure on traditional retail data systems.Retailers face challenges
collecting big data. These are often characterized by “the three Vs”:
 Volume or the sheer amount of data being amassed by today’s enterprises.
 Velocity, or the speed at which this data is created, collected and processed.
 Variety or the fact that the fastest growing types of data have little or no inherent
structure or a structure that changes frequently.
6
These three Vs have individually and collectively outstripped the capabilities of
traditional storage and analytics solutions that were created in a world of more
predictable data. Compounding the challenges represented by the Vs is the fact that
collected data can have little or no value as individual or small groups of records. But
when explored in the aggregate, in very large quantities or over a longer historical time
frame, the combined data reveals patterns that feed advanced analytical applications. In
addition, retailers seeking to harness the power of big data face a distinct set of
challenges.
These include:
 Data Integration. It can be challenging to integrate unstructured and

transactional data while also accommodating privacy concerns and regulations.
 Skill Building. Many of the most valuable big data insights in retail come from
advanced techniques like machine learning and predictive analytics. However, to
use these techniques analysts and data scientists must build new skill sets.
 Application of Insight. Incorporating big data analytics into retail will require a
shift in industry practices from those based on weekly or monthly historical
reporting to new techniques based on real-time decision-making and predictive
analytics.
Fortunately, open technology platforms like Hadoop allow retailers to overcome

many of the general and industry-specific challenges just described. Retailers who do so
benefit from the rapid innovation taking place within the Hadoop community and the
broader big data ecosystem.
2.3 PROPOSED SYSTEM
To unlock the drawbacks of Existing System retailers from across the industry
have turned to Apache Hadoop to meet these needs and to collect,manage and anlyze a
wide variety of data.In doing so,they are gaining new insights into the
customer,offering the right producr at the right price,improving operations and supply
Apache Hadoop is an open-source data processing platform create at web-scale

7
Internet companies confronted with the challenge of storing and processing massive
amounts of structured and unstructured data.By combining the affordability of low-cost
commodity servers and an open source approach,Hadoop provides cost-effective and
efficient data storage and processing that scales to meet the needs of the very largest
retail organizations.Hadoop enables a shared “data lake” allowing retail organizations
to:
 Collect Everything: A data lake can store any type of data,including an

enterprises structured data,publicly available data,and semi-structured and
unstructured social media content.Data generated by the Internet of Things, such
as in-store sensors that capture location data on how products and people move
about the store is also easily included.
 Dive in Anywhere. A data lake enables users across multiple business units to
refine,explore and enrich data on their terms, performing data exploration and
discovery to support business decisions.
 Flexibly Access Data. . A data lake supports multiple data access patterns across
a shared infrastructure, enabling batch, interactive, real-time, in-memory and
other types of processing on underlying data sets.
A Hadoop-based data lake, in conjunction with existing data management
investments, can provide retail enterprises an opportunity for Big Data analytics while
at the same time increasing storage and processing efficiency, which reduces costs. Big
data technology and approaches have broad application within the retail domain.
McKinsey Global Institute identified 16 big data use cases, or “levers,” across
marketing, merchandising, operations, supply chain and new business models.
8
Function Big data lever
Marketing Cross-selling
Location-based marketing
In-store behavior analysis
Customer micro-segmentation
Sentiment analysis
Enhancing the multichannel consumer

experience
Merchandising Assortment optimization
Pricing optimization
Placement and design optimization
Operations Performance transparency
Labor inputs optimization
Supply chain Inventory management
Distribution and logistics optimization
Informing supplier negotiations
New business models Price comparison services
Web-based markets
Table: 1 Big Data Levers in Retail Analytics
9
2.5 FACT FINDING TECHNIQUES
Requirements analysis encompasses all of the tasks that go into the investigation,
scoping and definition of a new or altered system. The first activity in analysis phase is to
do the preliminary investigation. During the preliminary investigation data collecting is a
very important and for this we can use the fact finding techniques.
The following fact finding techniques can be used for collecting the data:
 Observation - This is a skill which the analysts have to develop. The analysts
have to identify the right information and choose the right person and look at the
right place to achieve his objective. He should have a clear vision of how each
departments work and work flow between them and for this should be a good
observer.
 Interview - This method is used to collect the information from groups or

individuals. In this method the analyst sits face to face with the people and
records their responses. The type of questions to be asked is prepared in advance
and should be ready to answer any type of question. The information collected is
quite accurate and reliable as the doubts are cross checked and cleared from the
customer site itself. This method also helps gap the areas of misunderstandings
and help to discuss about the future problems.
Requirements analysis is an important part of the system design process, whereby

requirements engineers and business analysts, along with systems engineers or software
developers, identify the needs or requirements of a client. Once the client’s requirements
have been identified and facts collected, the system designers are then in a design a
solution
10
2.6 REVIEW OF THE LITERATURE
Text summarization is an old challenge in text mining but in dire need of
researcher’s attention in the areas of computational intelligence, machine learning and
natural language processing. A set of features are extracted from each sentence that
helps to identify its importance in the document. Every time reading full text is time
consuming. Clustering approach is u useful to decide which type of data present in
document. The concept of k-means clustering for natural language processing of text for
word matching is used and in order to extract meaningful information from large set of
offline documents, data mining document clustering algorithm are adopted. Automated
text summarization focused two main ideas have emerged to deal with this task; the first
was how a summarizer has to treat a huge quantity of data and the second, how it may
be possible to produce a human quality summary.
Depending on the nature of text representation in the summary, summary can be
categorized as an abstract and an extract. An extract is a summary consisting of the
overall idea of the document. An abstract is a summary, which represents the subject
matter of the document. In general, the task of document summarization covers generic
summarization and query-oriented summarization. The query oriented method generates
summary of documents and generic method summarizes the oveall sense of the
document without any additional information.
Traditional document clustering algorithms use the full text in documents to
generate feature vectors. Such methods often produce unsatisfactory results because
there is much noisy information in documents. The varying –length problem of the
documents is also a significant negative factor affecting the performance. This
technique retrieves important sentences which emphasize on high information richness.
The maximum sentence generated scores are clustered to generate the summary of the
document. Thus k-mean clustering is used to group the maximum sentences of the
document and find the relation to extract clusters with most relevant sets in the
document. This helps in producing the summary of the document. The main purpose of
k-mean clustering algorithm is to generate pre define length of summary having
maximum informative sentences. The approach for automatic text summarization is
presented by extraction of sentences from the Reuters-21578 corpus which include
newspaper articles.
11
CHAPTER 3
RESEARCH METHODOLOGY
3.1 APACHE HADOOP
Apache Hadoop is an open source software framework used for distributed

storage and processing of big data sets using the MapReduce programming model. It
consists of computer clusters built from commodity hardware commodity hardware. All
the modules in Hadoop are designed with a fundamental assumption that hardware
failures are common occurrences and should be automatically handled by the
framework.
The core of Apache Hadoop consists of a storage part, known as Hadoop

Distributed File System (HDFS), and a processing part which is a MapReduce
programming model. Hadoop splits files into large blocks and distributes them across
nodes in a cluster. It then transfers packaged code into nodes to process the data in
parallel. This approach takes advantage of data locality where nodes manipulate the data
they have access to. This allows the dataset to be processed faster and more efficiently
than it would be in a more conventional supercomputer architecture that relies on a
parallel file system where computation and data are distributed via high-speed
networking.
The base Apache Hadoop framework is composed of the following modules:
 Hadoop Common – contains libraries and utilities needed by other Hadoop

modules; Hadoop Distributed File System (HDFS) – a distributed file-system
that stores data on commodity machines, providing very high aggregate
bandwidth across the cluster;
 Hadoop YARN – a resource-management platform responsible for managing

computing resources in clusters and using them for scheduling of users'
applications; and
12
 Hadoop MapReduce – an implementation of the MapReduce programming
model for large scale data processing.
The term Hadoop has come to refer not just to the base modules and sub-
modules above, but also to the ecosystem, or collection of additional software packages
that can be installed on top of or alongside Hadoop, such as Apache Pig,Apache Hive,
Apache Hbase,Apache Phoenix,Apache Spark,Apache ZoopKeeper,Apache
ZoopKeeper,Cloudera Impala,Apache Flume,Apache Sqoop,Apache Oozie, and Apache
Storm.Apache Hadoop's MapReduce and HDFS components were inspired by Google
papers on their MapReduce and Google File System.
The Hadoop framework itself is mostly written in theJava programming

language, with some native code in C and command line utilities written as shell scripts.
Though MapReduce Java code is common, any programming language can be used
with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's
program.Other projects in the Hadoop ecosystem expose richer user interfaces.
3.1.1 Architecture
Hadoop consists of the Hadoop Common package, which provides file system
and OS level abstractions, a MapReduce engine (either MapReduce/MR1 or
YARN/MR2) and the Hadoop Distributed File System (HDFS). The Hadoop Common
package contains the necessary Java ARchive(JAR) files and scripts needed to start
Hadoop.
For effective scheduling of work, every Hadoop-compatible file system should

provide location awareness: the name of the rack (more precisely, of the network
switch) where a worker node is.
Hadoop applications can use this information to execute code on the node where
the data is, and, failing that, on the same rack/switch to reduce backbone traffic. HDFS
uses this method when replicating data for data redundancy across multiple racks. This
approach reduces the impact of a rack power outage or switch failure; if one of these
13
hardware failures occurs, the data will remain available.
3.1.2 File Systems
Hadoop Distributed File System
The Hadoop distributed file system (HDFS) is a distributed, scalable, and

portable file system written in Java for the Hadoop framework. Some consider HDFS to
instead be a data store due to its lack of POSIX compliance and inability to be mounted,
but it does provide shell commands and Java API methods that are similar to other file
systems. A Hadoop cluster has nominally a single namenode plus a cluster of datanodes,
although redundancy options are available for the namenode due to its criticality. Each
datanode serves up blocks of data over the network using a block protocol specific to
HDFS. The file system uses TCP/IP sockets for communication. Clients use remote
procedur call (RPC) to communicate between each other.
HDFS stores large files (typically in the range of gigabytes to terabytes) across
multiple machines. It achieves reliability by replicating the data across multiple hosts,
and hence theoretically does not require RAID storage on hosts (but to increase I/O
performance some RAID configurations are still useful). With the default replication
value, 3, data is stored on three nodes: two on the same rack, and one on a different
rack. Data nodes can talk to each other to rebalance data, to move copies around, and to
keep the replication of data high. HDFS is not fully POSIX compliant, because the
requirements for a POSIX file-system differ from the target goals for a Hadoop
application. The trade-off of not having a fully POSIX-compliant file-system is
increased performance for data throughput and support for non-POSIX operations such
as Append.
14
Figure 3.1: Architecture of Hadoop
HDFS added the high-availability capabilities, as announced for release 2.0 in

May 2012, letting the main metadata server (the NameNode) fail over manually to a
backup. The project has also started developing automatic fail-over.
The HDFS file system includes a so-called secondary namenode, a misleading

name that some might incorrectly interpret as a backup namenode for when the primary
namenode goes offline. In fact, the secondary namenode regularly connects with the
primary namenode and builds snapshots of the primary namenode's directory
information, which the system then saves to local or remote directories. These check
pointed images can be used to restart a failed primary namenode without having to
replay the entire journal of file-system actions, then to edit the log to create an up-to-
date directory structure. Because the namenode is the single point for storage and
management of metadata, it can become a bottleneck for supporting a huge number of
files, especially a large number of small files.
15
HDFS Federation, a new addition, aims to tackle this problem to a certain extent
by allowing multiple namespaces served by separate namenodes. Moreover, there are
some issues in HDFS, namely, small file issue, scalability problem, Single Point of
Failure (SPoF), and bottleneck in huge metadata request. An advantage of using HDFS
is data awareness between the job tracker and task tracker. The job tracker schedules
map or reduce jobs to task trackers with an awareness of the data location.
For example: if node A contains data (x,y,z) and node B contains data (a,b,c),
the job tracker schedules node B to perform map or reduce tasks on (a,b,c) and node A
would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the amount
of traffic that goes over the network and prevents unnecessary data transfer. When
Hadoop is used with other file systems, this advantage is not always available. This can
have a significant impact on job-completion times, which has been demonstrated when
running data-intensive jobs.
HDFS was designed for mostly immutable files and may not be suitable for
systems requiring concurrent write-operations.HDFS can be mounted directly with a
Filesystem in Userspace (FUSE)virtual file system on Linux and some other Unix
systems.
File access can be achieved through the native Javaapplication programming

interface (API), the Thrift API to generate a client in the language of the users' choosing
(C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa , Smalltalk, and
OCaml ), thecommand-line interface, browsed through the HDFS-UI Web application
(webapp) over HTTP , or via 3rd-party network client libraries.
HDFS is designed for portability across various hardware platforms and

compatibility with a variety of underlying operating systems. The HDFS design
introduces portability limitations that result in some performance bottlenecks, since the
Java implementation can't use features that are exclusive to the platform on which
HDFS is running. Due to its widespread integration into enterprise-level infrastructures,
monitoring HDFS performance at scale has become an increasingly important issue.
16
Monitoring end-to-end performance requires tracking metrics from datanodes,
namenodes, and the underlying operating system. There are currently several
monitoring platforms to track HDFS performance, including HortonWorks, Cloudera,
and Datadog.
3.1.3 Jobtracker and Tasktracker: The Mapreduce Engine
Above the file systems comes the MapReduce Engine, which consists of one
JobTracker, to which client applications submit MapReduce jobs. The JobTracker
pushes work out to available TaskTracker nodes in the cluster, striving to keep the work
as close to the data as possible. With a rack-aware file system, the JobTracker knows
which node contains the data, and which other machines are nearby. If the work cannot
be hosted on the actual node where the data resides, priority is given to nodes in the
same rack.
This reduces network traffic on the main backbone network. If a TaskTracker fails
or times out, that part of the job is rescheduled. The TaskTracker on each node spawns a
separate Java Virtual Machine process to prevent the TaskTracker itself from failing if
the running job crashes its JVM. A heartbeat is sent from the TaskTracker to the
JobTracker every few minutes to check its status. The Job Tracker and TaskTracker
status and information is exposed by Jetty and can be viewed from a web browser.
Known limitations of this approach are:
 The allocation of work to TaskTrackers is very simple. Every TaskTracker has a

number of available slots (such as "4 slots"). Every active map or reduce task
takes up one slot. The Job Tracker allocates work to the tracker nearest to the
data with an available slot. There is no consideration of the current system load
of the allocated machine, and hence its actual availability.
 If one TaskTracker is very slow, it can delay the entire MapReduce job –
especially towards the end of a job, where everything can end up waiting for the
slowest task. With speculative execution enabled, however, a single task can be
executed on multiple slave nodes.
17
3.1.4 Scheduling
By default Hadoop uses FIFO scheduling, and optionally 5 scheduling priorities

to schedule jobs from a work queue. In version 0.19 the job scheduler was refactored
out of the JobTracker, while adding the ability to use an alternate scheduler (such as the
Fair scheduler or the Capacity scheduler, described next).
3.1.5 Fair Scheduler
The fair scheduler was developed by Facebook The goal of the fair scheduler is
to provide fast response times for small jobs and Qos for production jobs. The fair
scheduler has three basic concepts.
 Jobs are grouped into pools .

 Each pool is assigned a guaranteed minimum share.
 Excess capacity is split between jobs.
By default, jobs that are uncategorized go into a default pool. Pools have to
specify the minimum number of map slots, reduce slots, and a limit on the number of
running jobs.
3.1.6 Capacity Scheduler
The capacity scheduler was developed by Yahoo . The capacity scheduler

supports several features that are similar to the fair scheduler.
 Queues are allocated a fraction of the total resource capacity.

 Free resources are allocated to queues beyond their total capacity.
 Within a queue a job with a high level of priority has access to the queue's
resources.
There is no preemption once a job is running.
3.2 APACHE HIVE
Apache Hive is a data warehouse software project built on top of Apache Hadoop
for providing data summarization, query, and analysis. Hive gives an SQL-like interface
to query data stored in various databases and file systems that integrate with Hadoop.
Traditional SQL queries must be implemented in the MapReducee Java API to execute
18
SQL applications and queries over distributed data.
Hive provides the necessary SQL abstraction to integrate SQL-like queries

(HiveQL) into the underlying Java without the need to implement queries in the low-
level Java API. Since most data warehousing applications work with SQL-based
querying languages, Hive aids portability of SQL-based applications to Hadoop. While
initially developed by Facebook , Apache Hive is used and developed by other
companies such as Netflix and theFinancial Industry Regulatory Authority (FINRA).
Amazon maintains a software fork of Apache Hive included in Amazon Elastic
MapReduce on Amazon Web Services.
3.2.1 Architecture
Major components of the Hive architecture are:
Figure 3.2: Architecture of Hive
Hive Architecture
 Metastore: Stores metadata for each of the tables such as their schema and
location. It also includes the partition metadata which helps the driver to track
the progress of various data sets distributed over the cluster. The data is stored in
19
a traditional RDBMS format. The metadata helps the driver to keep a track of
the data and it is highly crucial. Hence, a backup server regularly replicates the
data which can be retrieved in case of data loss.
 Driver: Acts like a controller which receives the HiveQL statements. It starts the
execution of statement by creating sessions and monitors the life cycle and
progress of the execution. It stores the necessary metadata generated during the
execution of an HiveQL statement. The driver also acts as a collection point of
data or query result obtained after the Reduce operation.
 Compiler: Performs compilation of the HiveQL query, which converts the query
to an execution plan. This plan contains the tasks and steps needed to be
performed by the Hadoop MapReduce to get the output as translated by the
query. The compiler converts the query to an Abstract syntax tree (AST). After
checking for compatibility and compile time errors, it converts the AST to a
directed acyclic graph (DAG). DAG divides operators to MapReduce stages
and tasks based on the input query and data.
 Optimizer: Performs various transformations on the execution plan to get an

optimized DAG. Various transformations can be aggregated together, such as
converting a pipeline of joins by a single join, for better performance. It can also
split the tasks, such as applying a transformation on data before a reduce
operation, to provide better performance and scalability. However, the logic of
transformation used for optimization used can be modified or pipelined using
another optimizer.
 Executor: After compilation and Optimization, the Executor executes the tasks
according to the DAG. It interacts with the job tracker of Hadoop to schedule
tasks to be run.
 It takes care of pipelining the tasks by making sure that a task with dependency
gets executed only if all other prerequisites are run.
 CLI, UI, and Thrift Server: Command Line Interface and UI (User Interface)
allow an external user to interact with Hive by submitting queries, instructions
20
and monitoring the process status. Thrift server allows external clients to interact
with Hive just like how JDBC/ODBC servers do.
3.2.2 APACHE SQOOP
Apache Sqoop is a tool designed for efficiently transferring bulk data between
Apache Hadoop and external datastores such as relational databases, enterprise data
warehouses.Sqoop is used to import data from external datastores into Hadoop
Distributed File System or related Hadoop eco-systems like Hive and HBase. Similarly,
Sqoop can also be used to extract data from Hadoop or its eco-systems and export it to
external datastores such as relational databases, enterprise data warehouses. Sqoop
works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres
etc.
For Hadoop developers, the interesting work starts after data is loaded into
HDFS. Developers play around the data in order to find the magical insights concealed
in that Big Data. For this, the data residing in the relational database management
systems need to be transferred to HDFS, play around the data and might need to transfer
back to relational database management systems. In reality of Big Data world,
Developers feel the transferring of data between relational database systems and HDFS
is not that interesting, tedious but too seldom required. Developers can always write
custom scripts to transfer data in and out of Hadoop, but Apache Sqoop provides an
alternative.Sqoop automates most of the process, depends on the database to describe
the schema of the data to be imported. Sqoop uses MapReduce framework to import and
export the data, which provides parallel mechanism as well as fault tolerance. Sqoop
makes developers life easy by providing command line interface. Developers just need
to provide basic information like source, destination and database authentication details
in the sqoop command. Sqoop takes care of remaining part.
21
Sqoop provides many salient features like:
 Full Load
 Incremental Load
 Parallel import/export
 Import results of SQL query
 Compression
 Connectors for all major RDBMS Databases
 Kerberos Security Integration
 Load data directly into Hive/Hbase
 Support for Accumulo
Sqoop is Robust, has great community support and contributions. Sqoop is

widely used in most of the Big Data companies to transfer data between relational
databases and Hadoop.Relational database systems are widely used to interact with the
traditional business applications. So, relational database systems has become one of the
sources that generate Big Data.
As we are dealing with Big Data, Hadoop stores and processes the Big Data
using different processing frameworks like MapReduce, Hive, HBase, Cassandra, Pig
etc and storage frameworks like HDFS to achieve benefit of distributed computing and
distributed storage. In order to store and analyze the Big Data from relational databases,
Data need to be transferred between database systems and Hadoop Distributed File
System (HDFS). Here, Sqoop comes into picture. Sqoop acts like a intermediate layer
between Hadoop and relational database systems. You can import data and export data
between relational database systems and Hadoop and its eco-systems directly using
sqoop.
Sqoop provides command line interface to the end users. Sqoop can also be
accessed using Java APIs. Sqoop command submitted by the end user is parsed by
Sqoop and launches Hadoop Map only job to import or export data because Reduce
phase is required only when aggregations are needed. Sqoop just imports and exports
the data; it does not do any aggregations.
22
Sqoop parses the arguments provided in the command line and prepares the Map
job. Map job launch multiple mappers depends on the number defined by user in the
command line. For Sqoop import, each mapper task will be assigned with part of data to
be imported based on key defined in the command line. Sqoop distributes the input data
among the mappers equally to get high performance. Then each mapper creates
connection with the database using JDBC and fetches the part of data assigned by
Sqoop and writes it into HDFS or Hive or HBase based on the option provided in the
command line.
Figure 3.3: Architecture of Sqoop
Sqoop supports incremental loads of a single table or a free forms SQL query as
well as saved jobs which can be run multiple times to import update made to a database
since the last import.Imports can also be used to populate tables in Hive or
Hbase.Esports can be used to put data from Hadoop into a relational database.Sqoop got
the name from sql+hadoop.Sqoop became a top-level Apache project in March 2012.
23
Informatica provides a Sqoop based connector from version 10.1.Pentaho
provides open source Sqoop based connector steps,Sqoop Import and Sqoop Export,in
their ETL suite Pentaho Data Integration since version 4.5 of the software.Microsoft
uses a Sqoop-based connector to help transfer data from Microsoft Sql Server databases
to Hadoop.
3.3 POWERBI
Power BI is not a new name in the BI market, components of Power BI has been
in the market through different time periods. Some components such As Power BI
Desktop is such new that released as general availablility at 24th of July. On the other
hand Power Pivot released at 2010 for the first time. Microsoft team worked through
long period of time to build a big umbrella called Power BI, this big umbrella is not just
a visualization tool such as Tableau, it is not just a self-service data analysis tool such as
PivotTable and PivotChart in Excel, it is not just a cloud based tool for data analysis.
Power BI is combination of all of those, and it is much more. With Power BI you can
connect to many data sources (wide range of data sources supported, and more data
sources add to the list every month). You can mash up the data as you want with a very
powerful data mash up engine. You can model the data, build your star schema, or add
measures and calculated columns with an In-Memory super fast engine. You can
visualize data with great range of data visualization elements and customize it to tell the
story behind the data.
You can publish your dashboard and visualization tool in cloud and share it to
those who you want. You can work with On-premises as well as Azure/cloud based data
sources. and believe me there are much more things that you can do with Power BI
which you can’t do with other products easily.
Power BI is a cloud based data analysis, which can be used for reporting and
data analysis from wide range of data source. Power BI is simple and user friendly
enough that business analysts and power users can work with it and get benefits of it.
On the other hand Power BI is powerful and mature enough that can be used in
enterprise systems by BI developers for complex data mash-up and modelling scenarios.
24
Power BI made of 6 main components, these components released in the market
separately, and they can be used even individually. Components of Power BI are:
 Power Query: Data mash up and transformation tool.
 Power Pivot: In-memory tabular data modelling tool
 Power View: Data visualization tool
 Power Map: 3D Geo-spatial data visualization tool
 Power Q&A: Natural language question and answering engine.
 Power BI Desktop: A powerful companion development tool for Power BI
There are many other parts for Power BI as well, such as;
 PowerBI.com Website; which Power BI data analysis can be shared through this
website and hosted there as cloud service
 Power BI Mobile Apps; Power BI supported in Android, Apple, and Windows

Phones.
Some of above components are strong and has been tested for very long time.
Some of them however are new and under frequent regular updates. Power BI built easy
graphical user interfaces to follow, so a business user simply could user Power Query or
Power BI desktop to mash up the data without writing even a single line of code. It is on
the other hand so powerful with power query formula language (M) and data analysis
expression (DAX) that every developer can write complex codes for data mash up and
calculated measures to respond challenging requirements.
So if you’ve heard somewhere that Power BI is a basic self-service data analysis

tool for business analysts and cannot be used for large enterprises systems, I have to say
this is totally wrong! I’ve been using Power BI technology myself in many large
enterprise scale systems and applications, and I’ve seen usage of that in many case
studies all around the world.
25
Power BI components can be used individually or in a combination. Power
Query has an add-in for Excel 2010 and Excel 2013, and it is embedded in Excel 2016.
The add-in for Power Query is available for free! for everyone to download and use it
alongside with existing an Excel (as long as it is Excel 2010 or higher versions). Power
Pivot has been as an add-in for Excel 2010, from Excel 2013 Power Pivot is embedded
in Excel, this add-in is again free to use! Power View is an add-in for Excel 2013, and it
is free for use again. Power Map is an add-in for Excel 2013, it is embedded in Excel
2016 as 3D maps. Power Q&A doesn’t require any installation or add-in, it is just an
engine for question and answering that works on top of models built in Power BI with
other components.
Components above can be used in a combination. You can mash up the data with
Power Query, and load the result set into Power Pivot model. You can use the model
you’ve built in Power Pivot for data visualization in Power View or Power Map. There
is fortunately a great development tool that combines three main components of Power
BI. Power BI Desktop is the tool that gives you combined editor of Power Query, Power
Pivot, and Power View. Power BI Desktop is available as stand-alone product that can
be downloaded separately. With Power BI Desktop you will have all parts of the
solution in one holistic view.
3.3.1 Power Query
Power Query is data transformation and mash up engine. Power Query can be
downloaded as an add-in for Excel or be used as part of Power BI Desktop. With Power
Query you can extract data from many different data sources. You can read data from
databases such as SQL Server, Oracle, MySQL, DB2, and many other databases. You
can fetch data from files such as CSV, Text, Excel.
You can even loop through a folder. You can use Microsoft Exchange, Outlook,
Azure…. as source. You can connect to Facebook as source and many other
applications. You can use online search or use a web address as the source to fetch the
data from that web page. Power Query gives you a graphical user interface to transform
data as you need, adding columns, changing types, transformations for date and time,
text, and many other operations are available. Power Query can load the result set into
26
Excel or into Power Pivot model.
Power Query also uses a powerful formula language as code behind called M. M
is much more powerful than the GUI built for it. There are many functionality in M that
cannot be accessed through graphical user interface. I would write deeply about Power
Query and M in future chapters so you can confidently write any code and apply
complex transformations to the data easily screenshot below is a view of Power Query
editor and some of it’s transformations.
Figure 3.4: Power Query
27
3.3.2 Power Pivot
Power Pivot is data modelling engine which works on Velocity In-Memory

based tabular engine. The In-Memory engine gives Power Pivot super fast response
time and the modelling engine would provide you a great place to build your star
schema, calculated measures and columns, build relationships through entities and so
on. Power Pivot uses Data Analysis expression language (DAX) for building measures
and calculated columns. DAX is a powerful functional language, and there are heaps of
functions for that in the library. We will go through details of Power Pivot modelling
and DAX in future chapters. Screenshot below shows the relationship diagram of Power
Pivot.
Figure 3.5: Power Pivot
28
3.3.3 Power View
The main data visualization component of Power BI is Power View. Power

View is an interactive data visualization that can connect to data sources and fetch the
metadata to be used for data analysis.
Power View has many charts for visualization in its list. Power View gives you
ability to filter data for each data visualization element or for the entire report. You can
use slicers for better slicing and dicing the data. Power View reports are interactive,
user can highlight part of the data and different elements in Power View talk with each
other
Figure 3.6: Power View
3.3.4 Power Map
Power Map is for visualizing Geo-spatial information in 3D mode. When visualization

renders in 3D mode it will gives you another dimension in the visualization. You can
visualize a measure as height of a column in 3D, and another measure as heatmap view.
You can highlight data based on the Geo-grahpical location such as country, city, state,
and street address.
29
Power Map works with Bing maps to get best visualization based on Geo-
graphical either latitude and longitude or country, state, city, and street address
information. Power Map is an add-in for Excel 2013, and embedded in Excel 2016.
3.3.4 Power BI Desktop
Power BI Desktop is the newest component in Power BI suit. Power BI Desktop is a

holistic development tool for Power Query, Power Pivot and Power View. With Power
BI Desktop you will have everything under a same solution, and it is easier to develop
BI and data analysis experience with that. Power BI Desktop updates frequently and
regularly. This product has been in preview mode for a period of time with name of
Power BI Designer. There are so much great things about Power BI Desktop that cannot
fit in a small paragraph here, you’ll read about this tool in future chapters. because of
great features of this product I’ll write the a section “Power BI Hello World” with a
demo of this product. You can have a better view of newest features of Power BI
Desktop. screenshot below shows a view of this tool
Figure 4.7: Power BI Desktop
30
3.3.6 Power BI Website
Power BI solution can be published to PowerBI website. In Power BI website

the data source can be scheduled to refresh (depends on the source and is it supporting
for schedule data refresh or not). Dashboards can be created for the report, and it can be
shared with others. Power BI website even gives you the ability to slice and dice the
data online without requiring any other tools, just a simple web browser. You can built
report and visualizations directly on Power BI site as well. screenshot below shows a
view of Power BI site and dashboards built there;
3.3.7 Power Q&A
Power Q&A is a natural language engine for questions and answers to your data
model. Once you’ve built your data model and deployed that into Power BI website,
then you or your users can ask questions and get answers easily. There are some tips
and tricks about how to build your data model so it can answer questions in the best way
which will be covered in future chapters. Power Q&A and works with Power View for
the data visualizations.
So users can simply ask questions such as: Number of Customers by Country,
and Power Q&A will answer their question in a map view with numbers as bubbles,
Fantastic, isn’t it?
3.3.8 Power BI Mobile Apps
There are mobile apps for three main mobile OS providers: Android, Apple, and
Windows Phone. These apps gives you an interactive view of dashboards and reports in
the Power BI site, you can share them even from mobile app. You can highlight part of
the report, write a note on it and share it to others
31
Figure 3.8: Power Q&A
Figure 3.9: Power BI Mobile Apps
32
3.3.9 Power BI Pricing
Power BI provide these premium services for free! You can create your account
in PowerBI.com website just now for free. Many components of Power BI can be used
individually for free as well. you can download and install Power BI Desktop, Power
Query add-in, Power Pivot add-in, Power View add-in, and Power Map add-in all for
free! There are some features of these products that reserved for paid version however,
such as Power BI Pro which gives you some more features of the product.
33
CHAPTER 4
EXPERIMENTAL RESULTS
4.1 OVERVIEW
The Retail Analytics is developed to help companies to stay ahead of shopper

trends by applying customer analytics in retail to uncover ,Smart business decision-
making .To quickly grasp how the company is performing ,a retail dashboard helps
with strategic oversight and decision. Data analytics can help in optimizing the price by
monitoring competitive pricing and discount strategies. To ensure you have the right
product mix ,use data analytics to monitor the breadth and depth of competitors
catalogues and identify their newly-launched products interpret and act on
meaningful data insights ,including in-store and online shopper patterns.
4.1.1 Problem Definition
Before starting a big data project, it is important to have clarity of purpose,

which can be accomplished with the following steps:
4.1.2 Identify the Problem
Retailers need to have a clear idea of the problems they want to solve and
understand the value of solving them. For instance, a clothing store may find four of ten
customers looking for blue jeans cannot find their size on the shelf, which results in lost
sales. Using big data, the retailer's goal could be to reduce the frequency of out-of-
stocks to less than one in ten customers, thus increasing the sales potential for jeans by
over 50 percent.
Big data can be used to solve a lot of problems, such as reduce spoilage, increase
margin, increase transaction size or sales volume, improve new product launch success
or increase customer dwell time in stores. For example, a popular medical doctors with
a health-oriented television show recommended raspberry ketene pills as a weight
reducer to his very large following on Twitter.
34
This caused a run on the product at health stores and ultimately led to empty
shelves, which took a long time to restock since there was little inventory in the supply
chain .
4.1.3 Identify All the Data Sources
Retailers collect a variety of data from point-of-sale (POS) systems, including the
typical sales by product and sales by customer via loyalty cards. However, there could
be less obvious sources of useful data that retailers can find by “walking the store”, a
process intended to provide a better understanding of a product's end-to-end life cycle.
His exercise allows retailers to think about problems with a fresh set of eyes, avoid
thinking of their data in silo'd terms, and consider new opportunities that present
themselves when big data is used to find correlations between exist and and new data
sources ,such as:
Video: Surveillance cameras and anonymous video analytics on digital signs.
Social Media: Twitter, Face book, blogs and product reviews
Supply chain: Orders, Shipment, invoices, inventory, sales receipts, and payments.
Advertising: Coupons in flyers and newspapers, and advertisements on TV and in-store

signs.
Environment: Weather, community events, seasons, and holidays
Product Movement: RFID tags and GPS
4.1.4 Conduct 'Cause and Effect' Analysis
After listing the entire possible data source, the next step is to explore the data
through queries that help determine which combinations of data could be used to solve
specific retail problems. This meant using social media data to inform health stores up o
three weeks before a product is likely to go viral, providing enough time for retail
buyers to increase their orders.
35
The team queries social media feeds and compared them to historic store
inventory levels and found it was possible to give retail buyers an early warning of
potentially rising product demand. In this case, it was important to find a reliable
leading indicator that could be used to predict how much product should be ordered and
when.
4.1.5 Determine Metrics That Need Improvement
After clearly defining a problem, it is critical to determine metrics that can

accurately measure the effectiveness of the future big data solution’s recommendation is
to make sure the metrics can be translated into a monetary value (e.g., income from
increased sales, savings from increased sales, saving from reduced spoilage) and
incorporated into a return on investment (ROI) calculation.
The team focused on metrics related to reducing shrinkage and spoilage, and
increasing profitability by suggesting which products a store should pick for promotion.
4.1.6 Verify the Solution Is Workable
A workable big data solution hinges on presenting the findings in a clear and
acceptable way to employees. As described previously, the solution provided retail
buyers with timely product ordering recommendations displayed with an intuitive UI
running on devices used to enter orders.
4.2 CONSIDER BUSINESS PROCESS RE-ENGINEERING
By definition, the outcomes from a big data project can change the way people
do their jobs. It is important to consider the costs and time of the business process re-
engineering that may be needed to fully implement the big data solution.
4.2.1 Calculate An ROI
Retailers should calculate an ROI to ensure their big data project makes
Financial sense. At this point, the ROI is likely to have some assumptions and
unknowns, but hopefully these will have only a second order effect on the financials.
36
If the return is deemed too low it may make sense to repeat the prior steps and
look for other problems and opportunities to pursue. Moving forward, the following
steps will add clarity to what can be accomplished and raise the confidence level of the
ROI.
4.2.2 Data Preparation
Retail data used for big data analysis typically comes from multiple sources and
different operational systems, and may have inconsistencies. Transaction data was
supplied by several systems, which truncated UPC codes or stored them differently due
to special characters in the data field. More often than not, one of the first steps is to
cleanse, enhance, and normalize the data. Data cleaned tasks performed are:
 Populate missing data elements
 Scrub data for discrepancies
 Make data categories uniform
 Reorganize data Fields
 Normalize UPC data
4.3 DATA FLOW DIAGRAM
Data flow diagram is a way of representing system requirements in a graphic

form. A DFD also known as “Bubble Chart” has the purpose of clarifying system
requirements and identifies major transformations that will become program in system
design. So it is the starting point of design phase that functionally decomposes the
requirements specifications down to the lowest level of details. A DFD consist of series
of bubbles joined by lines. The bubbles represent data transformation and the lines
represent data flow in the system.
37
4.3.1 DFD Symbols
In a DFD there are four symbols
A square defines a source or destination of system data.
An arrow identifies data flow in motion. It is a pipeline through

which in format flows
A circle or a bubble represents a process that transforms incoming

data flows into going data flows.
An open rectangle is a data source or data at rest or a temporary of

data constructing a DFD.
4.3.2 Rules in Drawing DFD’S
 Process should name and numbered for easy reference.

 The direction of flow is from top to bottom and from left to right.
 Data traditionally flow from source to destination, although they may flow from
source.
 When a process is exploded into lower levels, they are numbered.
 The names of data source, sources and destination are written in capital letters.
process and data flow names have the first letter of each word capital
A DFD is often used as a preliminary step to create an overview of the system,

which can later be elaborated. The DFD is designed to aid communication. The rule of
thumb is exploding the DFD to a functional level, so that the next sublevel does not
exceeds to process.
38
Beyond that it is best to take each function separately and expand it to show the
explosion of single process. Data flow diagrams are one of the three essential
perspectives of the structured-systems analysis and design method SSADM. With a data
flow diagram, users are able to visualize how the system will operate, what the system
will accomplish, and how the system will be implemented. It is a common practice to
show the interaction between the system and external agents which act as data sources
and data sinks.
Figure 4.1: Dataflow Diagram for Retail Analytics
4.4 MODULES
 MARKETING AND DEVELOPMENT PLANS

 SHARP,STRATEGIC FOCUS ON RETAIL BI
 MERCHANDISING
 STORES
 E-COMMERCE
 CUSTOMER RELATIONSHIP MANAGER
39
4.4.1 Marketing and Development Plans
A Marketing and Development plan is a comprehensive document or blueprint

that outlines a company's advertising and marketing efforts for the coming year. It
describes business activities involved in accomplishing specific marketing objectives
within a set time frame. A marketing plan also includes a description of the current
marketing position of a business, a discussion of the target market and a description of
the marketing mix that a business will use to achieve their marketing goals.
A Marketing and Development plan has a formal structure, but can be used as a
formal or informal document which makes it very flexible. It contains some historical
data, future predictions, and methods or strategies to achieve the marketing objectives.
Marketing and Development plans start with the identification of customer needs
through a market research and how the business can satisfy these needs while
generating an acceptable level of return.
This includes processes such as market situation analysis, action programs,

budgets, sales forecasts, strategies and projected financial statements. A marketing and
Development plan can also be described as a technique that helps a business to decide
on the best use of its resources to achieve corporate objectives. It can also contain a full
analysis of the strengths and weaknesses of a company, its organization and its
products.
The Marketing and Development plan shows the step or actions that will be
utilized in order to achieve the plan goals. For example, a marketing and development
plan may include a strategy to increase the business's market share by fifteen percent.
The marketing and development plan would then outline the objectives that need to be
achieved in order to reach the fifteen percent increase in the business market share. The
marketing and development plan can be used to describe the methods of applying a
company's marketing resources to fulfil marketing objectives. Marketing and
development planning segments the markets, identifies the market position, forecast the
market size, and plans a viable market share within each market segment.
40
Marketing and development planning can also be used to prepare a detailed case
for introducing a new product, revamping current marketing strategies for an existing
product or put together a company marketing plan to be included in the company
corporate or business plan.
4.4.2 Sharp, Strategic Focus on Retail
Given the enormous growth of data, banks are suffering from their inability to
effectively exploit their data assets. As a result, retails must now capitalize on the
capabilities of business intelligence (BI) software, particularly in areas such as customer
intelligence, performance management, financial analysis, fraud detection, risk
management, and compliance. Furthermore, retails need to develop their BI real-time
delivery ability in order to respond faster to business issues. This brief argues the
following statements:
• BI must be fully aligned with business processes.
• Ability to respond faster to customers, to regulators or to management generates need

for real-time automation.
4.4.3 Merchandising
The activity of promoting the sale of goods at retail. Merchandising activities

may include display techniques, free samples, on-the-spot demonstration, pricing, shelf
talkers, special offers and other point-of-sales methods. According to American
Marketing Association, merchandising encompasses “planning involved in marketing
the right merchandise or service at the right place, at the right time, in the right
quantities and at the right price.”
Retail Merchandising refers to the various activities which contribute to the sale
of products to the consumers for their end use. Every retail store has its own line of
merchandise to offer to the customers. The display of the merchandise plays an
important role in attracting the customers into the store and prompting them to purchase
as well.
41
 Merchandising helps in the attractive display of the products at the store in order
to increase their sale and generate revenues for the retail store.
 Merchandising helps in the sensible presentation of the products available for

sale to entice the customers and make them a brand loyalist.
 Promotional Merchandising
 The ways the products are displayed and stocked on the shelves play an
important role in influencing the buying behaviour of the individuals.
4.4.4 Stores
Retail in-store analytics is vital for providing near-real-time view to key in-store
metrics, collecting footfall statistics using people counting solutions (video
based/thermal based/Wi-Fi-based). This information is fed into in-store analytics tools
for measuring sales conversion by co-relating traffic inflow with the number of
transactions and value of transactions. The deep insights from such near real-time views
of traffic, sales and engagement help in planning store workforce needs based on traffic
patterns. Reliable in-store data and analytics solutions enable better comparison of store
performances, optimizing campaign strategy, improved assortment and merchandising
decisions.
Happiest Minds In-Store Analytics Solution
Challenges related to change management, technology implementation, work

force management, backend operational discourage players from leveraging smart in-
store analytics solutions in physical stores. Happiest Minds addresses all these
challenges in a better way. Our In-store Analytics solution, which is available on cloud
andon-premise, leverages digital technologies like mobile, beacon/ location tracking
devices, video based people counting, analytics to monitor and improve in-store
efficiency, increase employee productivity and sales. The in-store conversion
dashboard, with multi-dimensional drill down views, provides location view, time view
and
42
4.4.5 E-Commerce
E Commerce is growing day by day in both B-to-B and B-to-C context.

Retailing industry including Fashion Retail and Grocery retailing have caught on to the
bandwagon and have begun to offer E trading or Online Shopping. In the early 1990s
we saw Companies setting up websites with very little understanding of E Commerce
and Consumer behaviour. E commerce as a model is totally different from the
traditional shopping in all respect. All Companies have fast realised the need to have E
commerce strategy separately but as a part of overall Retail Strategy.
Retail Strategy involves planning for the business growth keeping in view the
current market trends, opportunities as well as threats and building a strategic plan that
helps the Company deal with all these external factors and stay on course to reach its
goals. Further the Retail business strategy is concerned with identifying the markets to
be in, building the product portfolio and band width coupled with brand positioning and
the various elements of brand visibility and in store promotions etc. Business operations
are more or less standard and proven models that are adapted as best practices.
4.4.6 Customer Relationship Management
A sound and well-rounded customer relationship management system is an

important element in maintaining one’s business in the retail marketing industry. Not
only is customer relationship management a business strategy but it is also a powerful
tool to connect retail companies with their consumers. Developing this bond is essential
in driving the business to the next levels of success.
4.4.7 Retail Marketing Landscape
Today’s retail marketing landscape is changing and retail industry organizations

struggle to achieve or maintain good marketing communications with existing
consumers as well as prospective customers. But unlike in past years’ shotgun
approach, most retail companies now are aiming at specific targets. Tracking a
particular goal, as opposed to random, scattershot pursuits, allows these companies to
channel their marketing efforts to obtain highest possible return on investment.
43
The need to identify consumer-related issues, better understand your customers
and meet their needs with the company’s products and services. By making accurate
estimates regarding product or service demands in a given consumer market, one can
formulate support and development strategies accordingly.
Integrating A Customer Relationship Management System In The Retail

Marketing
A retail business sells goods or services; strives to attract more consumers by

marketing and advertising and seeks customer feedback. And these are just among the
many things that a retail business has to juggle: product supply, finance, operations,
membership database, etc. With a customer relationship management integration (CRM
Integration) system in place, managers and supervisors of retail businesses can set
goals, implement processes, and measures and achieve them in a more efficient manner.
A CRM Integration system can combine several systems to allow a single view.
Data can be integrated from consumer lifestyle, expenditure, brand choice.If the CRM
system is implemented to track marketing strategies over products, services, then it can
provide a scientific, data-based approach to marketing and advertising analysis.
The CRM Integration system improves the overall efficiency of marketing

campaigns since it allows retail companies to specifically target the right group of
consumers. The right message, at the right time to the right people delivers a positive
marketing response from consumers and translates to more sales. Overall, this system
provides a clear picture of the consumer segments, allowing retail companies to develop
suitable business strategies, formulate appropriate marketing plans for their products or
services, and anticipate a change of business landscape
44
4.5 ALGORITHM DEFINITON PROCESS
4.5.1. Footfall
The number of people visiting a shop or a chain of shops in a period of time is

called its footfall. Footfall is an important indicator of how successfully a company's
marketing, brand and format are bringing people into its shops. Footfall is an indicator
of the reach a retailer has, but footfall needs to be converted into sales and this is not
guaranteed to happen. Many retailers have struggled to turn high footfall into sales.
Trends in footfall do tell investors something useful. They may be an indicator

of growth and help investors to understand why a retailer's sales growth (or decline) is
happening. Investors may want to know whether sales growth due to an increase in the
number of people entering the shops (footfall) or more success at turning visitors into
buyers (which can be seen by comparing footfall to the number of transactions).
Sales growth may also come from selling more items to each buyer (compare
number of transactions to sales volumes), selling more expensive items (an
improvement in the sales mix), or increasing prices. Which of these numbers is
disclosed varies from company to company. Investors should look at whatever is
available.
Calculation of FootFall:
Customer conversion ratio = No of transactions / Customer traffic x 100
4.5.2. Conversion Rate
A conversion is any action that you define. It could be a purchase, an old

fashioned phone call, contact form submission, newsletter signup, social share, a
specified length of time a visitor spends on a web page, playing a video, a download,
etc. Many of the small businesses that I meet only have a gut-sense of where their new
business comes from because they haven’t been tracking conversions. Knowing your
conversion rate(s) is a first step in understanding how your sales funnel is performing
and what marketing avenues are giving the greatest return on investment (ROI).
45
Once you have defined what conversions you want to track, you can calculate
the conversion rate. For the purposes of the following example, let’s call a conversion a
sale. Even if you’re still in the dark ages without a viable website, as long as you are
tracking the number of leads you get and the number of resulting sales (conversions),
you can calculate your conversion rate like so…
Conversion Rate = Total Number of Sales / Number of Leads * 100
Example: Let’s say you made 20 sales last year and you had 100 inquiries/ leads. Your
sales to lead conversion rate would be 20%.If you’re tracking conversions from website
leads, your formula looks like so:
Conversion Rate = Total Number of Sales / Number of Unique Visitors* 100
4.5.3. Visits Frequency
The total number of visits divided by the total number of visitors during the
same timeframe.
Visits / Visitors = Average Visits per Visitor
Sophisticated users may also want to calculate average visits per visitor for
different visitor segments. This can be especially valuable when examining the activity
of new and returning visitors or, for online retailers, customers and non-customers.
Presentation
The challenge with presenting average visits per visitor is that need to examine
an appropriate timeframe for this KPI to make sense. Depending on business model it
may be daily or it may be annually: Search engines like Google or Yahoo can easily
justify examining this average on a daily, weekly and monthly basis. Marketing sites
that support very long sales cycles waste their time with any greater granularity than
monthly.
Consider changing the name of the indicator when present it to reflect the
timeframe under examination, e.g, “Average Daily Visits per Visitor” or “Average
Monthly Visits per Visitor”.
46
Expectation
Expectations for average visits per visitor vary widely by business model.
 Retail sites selling high-consideration items will ideally have a low average
number of visits indicating low barriers to purchase; those sites selling low
consideration items will ideally have a high average number of visits, ideally
indicating numerous repeat purchases. Online retailers are advised to segment
this KPI by customers and non-customers as well as new versus returning
visitors regardless of customer status.
 Advertising and marketing sites will ideally have high average visits per visitor,
a strong indication of loyalty and interest.
 Customer support sites will ideally have a low average visits per visitor,
suggesting either high satisfaction with the products being supported or easy
resolution of problems. Support sites having high frequency of visit per visitor
should closely examine average page views per visit, average time spent on site
and call centre volumes, especially if the KPI is increasing (e.g., getting worse.)
4.5.4. Repeat Customer Percent
Repeat Customer Rate is calculated by dividing your Repeat Customers by your

Total Paying Customers. Every store has two types of customers: New Customers and
Repeat Customers. Knowing your Repeat Customer Rate will show you what
percentage of customers are coming back to your store to shop again. This is an
important factor in customer Lifetime Value. Customer Acquisition Costs (CAC) are
one of the highest expenses in the ecommerce world. When you pay buckets of
marketing money to Google Ad words, Bing Ads, Face book Ads and Interest, want to
ensure effectively encouraging customers to shop more than once.
Repeat customer%=Repeat customer/No of customer
47
4.5.5 Calculation of Top And Bottom 5 Revenue Generators Product/Store
Top and Bottom 5 Revenue generators for Product/Store/customer gives the

demand of the product in corresponding to the store and customers. With this
information we can also indentify the likelihood of the customers .We can also get clear
idea of stock to be purchased.
We can also increase the availability of product if we know the frequently

purchased product. The Profit will be high if we decrease the stock containing the least
revenue generated product.
The calculation for Revenue generators are as follows:
=agar(if(rank(sum(Revenue))<=5 or rank(-sum(Revenue))<=5,Customer),Customer)
4.5.6 Average Basket Size
On average, a customer with a basket will buy two items more on any given visit
than a customer left to fumble with two or three purchases, other shopping bags and a
handbag or purse. This is no small matter as store managers attempt to maximise
“basket size,” or “average transaction value” as it is known in the colourful language of
financial analysts.
Average basket size calculation = Turnover ÷ number of people passing though

cash desk. It is one of the merchandising ratios.
It measures the average amount spent by a customer in a point of sale on a visit.

Average basket size refers to the number of items getting sold in a single purchase. It is
the equivalent of total units sold ÷ number of invoices.
4.5.7 Average Ticket Size
The “Average Ticket” is the average dollar amount of a final sale, including all
items purchased, of a merchant’s typical sales. Processors use the Average Ticket to
better understand the merchant’s business and to determine how much risk the merchant
poses to the processor.
48
Many merchants find it difficult to determine the size of their average ticket due
to widely varying sales amounts, or because the business is new. Often times,
processors only need estimation and not an exact figure. A ticket that records all the
terms, conditions and basic information of a trade agreement. A deal ticket is created
after the transaction of shares, futures contracts or other derivatives. Also referred to as
a "trading ticket".
Deal Ticket
A deal ticket is similar to a trading receipt. It tracks the price, volume, names
and dates of a transaction, along with all other important information. Companies use
deal tickets as part of an internal control system, allowing them organized access to the
transaction history. Deal tickets can keep be in either electronic or physical form.
4.5.8 Cost of Goods Sold (Cogs)
Cost of goods sold refers to the carrying value of goods sold during a particular period.
Costs are associated with particular goods using one of several formulas,
including specific identification, first-in first-out (FIFO), or average cost. Costs include
all costs of purchase, costs of conversion and other costs incurred in bringing the
inventories to their present location and condition. Costs of goods made by the business
include material, labour, and allocated overhead. The costs of those goods not yet sold
are deferred as costs of inventory until the inventory is sold or written down in value.
4.5.9 Gross Margin Return on Investment
GMROI demonstrates whether a retailer is able to make a profit on his

inventory. As in the above example, GMROI is calculated by dividing gross margin by
the inventory cost. And keep in mind what gross margin is: the net sale of goods minus
the cost of goods sold.
Retailers need to be well aware of the GMROI on their merchandise, because it

lets them determine how much they're earning on for every dollar they invest. Divide
the sales by average cost of inventory and times that by the gross margin % to get
GMROI
49
4.5.10 Sell Through Percentage
Sell through rate is a calculation, commonly represented as a percentage,
comparing the amount of inventory a retailer receives from a manufacturer or supplier
against what is actually sold to the customer. The period (usually one month) examined
is useful when comparing the sale of one product or style against another. Or more
importantly, when comparing the sell through of a specific product from one month to
another to examine trends.
So, in your store, if you bought 100 chairs and after 30 days had sold 20 chairs
(meaning you had 80 chairs left in inventory) then your sell-through rate would be 20
percent. Using your beginning of month (BOM) inventory, you divide your sales by that
BOM. It is calculated this way:
Sell through = Sales / Stock on Hand (BOM) x 100 (to convert to a

percentage)
Or in our example (20 /100) x 100 = 20 percent
Sell through is a healthy way to assess if your investment is returning well for you.
For example, a sell-through rate of 5 percent might mean you either have too many on
hand (so you are overbought) or priced too high. In comparison, a sell-through rate of
80 percent might mean you have too little inventory (under bought) or priced too low.
Truly the sell through rate's analysis is based on what you want from the merchandise.
50
4.6 HIVE TABLE STRUCTURE
Field Name Field Type
Customerid Int
City String
State String
Zip Int
Table: 1 Customer address
Customer_accountid String
Account_opened Timestamp
Table: 2 Customer_Details
Customerid Int
FristName String
LastName String
Datacreated Timestamp
Table: 3 Customer
51
Filed Name Filed Type
Itemid String
Itemtypeid String
Mfgname String
Createddate Timestamp
Table: 4 Itemdetails
Orderdetailsid String
Orderid String
Itemid String
Quantity Int
Unitprice Int
Shipmethodid String
Table: 5 Order details
52
Orderid String
Customerid String
Transactiondate Timestamp
Totalorderamount Int
Taxamoutn Int
Frightamount Int
Trdisamount Int
Subtotal Int
Paymenttypeid String
Ordersource String
Cancelreturn String
Datecreated Timestamp
Storeid String
Table: 6 Orders
53
Itemtypeid String
Paymenttype String
Createduser String
Datecreated Timestamp
Modifieduser String
Datemodified Timestamp
Table: 7 Payment_type
Field Name Filed Type
Shipmethodid String
Shipmentmethod String
Createduser String
Modifieduser String
Datemodified TimeStamp
Table 8:Shipment_type
54
Id Int
Name String
Region String
City String
Country String
State String
Zip String
Table 9: Store-Details
Itemtype String
Typename String
Table 10: Items
55
4.7 INPUT DESIGN
Once the output requirements have been finalized, the next step is to find out
what inputs are needed to produce the desired outputs. Inaccurate input data results in
errors in data processing. Errors entered by data entry operator can be controlled by
input design. Input design is a process of converting user originated inputs to computer-
based format. The various objectives of the input design should focus on;
 Controlling amount of input

 Avoiding delay
 Avoiding errors in data
 Avoiding extra steps
 Keeping the process simple
Input is considered as the process of keying in data into the system, which will
be converted to system format. People all over the world who belong to different
cultures and geographies will use a web site. So the input screens given in the site
should be really flexible and faster to use. With highly competitive environment
existing today in web based businesses the success of the site depends on the number
users logging on to the site and transacting with the company. A smooth and easy to use
site interface and flexible data entry screens are a must for the success of the site. The
easy to use hyperlinks of the site help in navigating between different pages of the site
in a faster way.
A document should be concise because longer documents contain more data and
so take longer to enter and have a greater chance of data entry errors. The more quickly
an error is detected, the closer the error is to the person who generated it and so the error
is more easily corrected. A data input specification is a detailed description of the
individual fields (data elements) on an input document together with their
characteristics. Be specific and precise, not general, ambiguous, or vague in case of
error messages.
56
4.8 OUTPUT DESIGN
The outputs from computer systems are required mainly to communicate the
results of processing to users. They are also used to provide permanent (“hard”) copy of
these results for later consultation. Output is what the client is buying when he or she
pays for a development project. Inputs, databases, and processes exist to provide output.
Printouts should be designed around the output requirements of the user. The output
devices are considered keeping in mind factors such as compatibility of the device with
the system response time requirements, expected print quality and number of copies
needed.
Output to be produced also depends on following factors :
 Type of user and purpose

 Contents of output
 Format of the output
 Frequency and timing of output
 Volume of output
 Sequence
 Quality
The success or failure of software is decided by the integrity and correctness

output that is produced form the system. One of the main objective behind the
automation of business systems itself is the fast and prompt generation of reports in a
short time period. In today’s competitive world of business it is very important for
companies to keep themselves up to date about the happenings in the business. Prompt
and reliable reports are considered to be the lifeline of every business today. At the
same time wrong reports can shatter the business itself and create huge and irreparable
losses for the business. So the outputs/reports generated by the software systems are of
paramount importance.
4.9 IMPLEMENTATION PLANNING
This section describes about the Implementation of the application and the
details of how to access this control from any application. Implementation is the
57
process of assuring that the information system is operational and then allowing users
take over its operation for use and evaluation. Implementation includes the following
activities.
 Obtaining and installing the system hardware.
 Installing the system and making it run on its intended hardware.
 Providing user access to the system.
 Creating and updating the database.
 Documenting the system for its users and for those who will be responsible for
maintaining it in the future.
 Making arrangements to support the users as the system is used.
 Transferring on-going responsibility for the system from its developers to the
operations or maintenance part.
 Evaluating the operation and use of the system.
4.10 IMPLEMENTATION PHASE IN THIS PROJECT
The new system of Web Based Digital Security Surveillance has been
implemented. The present system has been integrated with the already existing
hardware. The database was put into the Microsoft SQL server. The database is
accessible through Internet on any geographic location. Documentation is provided
well in such a way that it is useful for users and maintainers.
4.11 MAINTENANCE
Maintenance is any work done to change the system after it is in operational.

The term maintenance is used to describe activities that occur following the delivery of
the product to the customer. The maintenance phase of the software life cycle is the
time period in which a software product performs useful work. Maintenance activities
involve making enhancements to products, adapting products to new environments,
correcting problems.
An integral part of software is the maintenance one, which requires an accurate

58
maintenance plan to be prepared during the software development. It should specify
how users will request modifications or report problems. The budget should include
resource and cost estimates. A new decision should be addressed for the developing of
every new system feature and its quality objectives. The software maintenance, which
can last for 5–6 years (or even decades) after the development process, calls for an
effective plan which can address the scope of software maintenance, the tailoring of the
post-delivery/deployment process, the designation of who will provide maintenance,
and an estimate of the life-cycle costs. The selection of proper enforcement of
standards is the challenging task right from early stage of software engineering which
has not got definite importance by the concerned stakeholders.
In this we retrieve the data from the database design by searching the database.
So, for maintaining data our project has a backup facility so that there is an additional
copy of data, which needs to be maintained. More over this project would update the
annual data on to a CD, which could be used for later reference
59
CHAPTER 5
CONCLUSION
In this research work, the goal is to propose an automatic and intelligent system
able to analyze the data on inventory levels, supply chain movement, consumer
demands, sales, etc., that are crucial for making marketing and procurement decisions.
Also, it will help the organization to analyze the data with various factors and slice and
dice the data based on region, time, (year, quarter, month, week and
day),product(product category, product type, model and product) and etc. The analytics
on demand and supply data can be used for maintaining procurement level and also for
taking marketing decision .Retail analytics gives us detailed customer insights along
with insights into the business and processes of the organization with scope and need
for improvement.
60
CHAPTER 6
APPENDIX
SAMPLE SOURCE CODE
UDF FOR TEXT AND SPECIAL CHARACTER TRIMING IN HIVE
package org.hardik.letsdobigdata;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public class Strip extends UDF {
private Text result = new Text();
public Text evaluate(Text str, String stripChars) {
if(str == null) {
return null;}
result.set(StringUtils.strip(str.toString(), stripChars));
return result; }
public Text evaluate(Text str) {
if(str == null) {
return null; }
result.set(StringUtils.strip(str.toString()));
return result;}}
61
UDAF FOR MEAN CALCULATION
package org.hardik.letsdobigdata;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.hardik.letsdobigdata.MeanUDAF.MeanUDAFEvaluator.Column;
@Description(name = "Mean", value = "_FUNC(double) - computes mean", extended =

"select col1, MeanFunc(value) from table group by col1;")
public class MeanUDAF extends UDAF {
// Define Logging
static final Log LOG = LogFactory.getLog(MeanUDAF.class.getName());
public static class MeanUDAFEvaluator implements UDAFEvaluator {
/*** Use Column class to serialize intermediate computation. This is our

groupByColumn */
public static class Column {
double sum = 0;
int count = 0;}
private Column col = null;
public MeanUDAFEvaluator() {
super();
init(); }
// A - Initalize evaluator - indicating that no values have been
62
// aggregated yet.
public void init() {
LOG.debug("Initialize evaluator");
col = new Column(); }
// B- Iterate every time there is a new value to be aggregated
public boolean iterate(double value) throws HiveException {
LOG.debug("Iterating over each value for aggregation");
if (col == null)
throw new HiveException("Item is not initialized");
col.sum = col.sum + value;
col.count = col.count + 1;
return true;}
// C - Called when Hive wants partially aggregated results.
public Column terminatePartial() {
LOG.debug("Return partially aggregated results");
return col;}
// D - Called when Hive decides to combine one partial aggregation with another
public boolean merge(Column other) {
LOG.debug("merging by combining partial aggregation");
if(other == null) {
return true;}
col.sum += other.sum;
col.count += other.count;
return true; }
63
// E - Called when the final result of the aggregation needed.
public double terminate(){
LOG.debug("At the end of last record of the group - returning final result");
return col.sum/col.count;
} }}
64
HIVE QUERY FOR ANALYTICS:
1.Select count(*) from visitor where month(created)='month';
Taking the number of visitors in the shop in a particular interval of time.
Footfall means counting the number of persons in the Shopping Mall/Retail Shop.
2.Select customerid,sum(totalorderamount) from placedorder group by customerid order

by 2 desc limit 5;
This Query is used to get the least revenue producing product/customer/store
Select customerid,sum(totalorderamount) from placedorder group by customerid order

by 2 asc limit 5;
This Query is used to get the greatest revenue producing product/customer/store
3.Select customerid,count(itemid) from placedorder join deliveredorder

placedorder.orderid=deliveredorder.orderid group by customerid ;
This query is used to calculate the Average Basket value.
Count of items purchased by the customer
4.Select sum(amount)/count(orderid) from placedorder;
This Query is used to get the Average ticket Size
Describes the details of the customer’s order
5.Select storeid, customerid, count(1) from visitors group by storeid, customerid having
count(1) > 1
Regular customer percentage for a shop
6.select customerid, count(*) from visitors where month(datecreated)='month'group by

customerid;
No of times the customer visits the shop in a month
65
SCREENSHOTS
Figure A.6.1: UDF for Text splitting in Retail Analytics
Figure A.6.2: UDAF for Retail Analytics(Finding the mean)
66
Figure A.6.3: Sqoop transformation data
Figure A.6.3: Finding FootFall
67
Figure A.6.5: Finding conversion rate
Figure A.6.6: Finding basket size
68
Figure A.6.7: Frequency visitors
Figure A.6.8: Finding least 5 revenue product
69
Figure A.6.9: Finding Top revenue producing product
Figure A.6.10: Repeated customer percentage
70
Figure A.6.11: Creating the Integrating Service
Figure A.6.12: Connecting Hive with PowerBI
71
CHAPTER 7
REFERENCES
1 Apache Hadoop Project. http://hadoop.apache.org
2 Apache Hive Project. http://hadoop.apache.org/hive
3 Cloudera Distribution Including Apache Hadoop (CDH). http://www.cloudera.com
4 Greenplum Database. http://www.greenplum.com
5 GridMix Benchmark. http://hadoop.apache.org/docs/mapreduce/current/gridmix.html
6 Oracle Database - Oracle. http://www.oracle.com
7 PigMix Benchmark. https://cwiki.apache.org/confluence/display/PIG/PigMix
8 Teradata Database - Teradata Inc.http://www.teradata.com
9 TwinFin - Netezza, Inc. http://www.netezza.com/
10 TPC Benchmark DS, 2012.
72
PUBLICATIONS
Mrs.A. Sangeetha MCA., (M.Phil CS)., Mrs. K. Akilandeswari MCA, M.Phil.,

Assistant Professor, “Survey on Data Democratization using Big Data” Paper
published in International journal of science and Research,
73

Full

Uploaded by

Copyright:

Available Formats

Full

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Full

Uploaded by

Copyright:

Available Formats

CHAPTER 1

1.1 OVERVIEW OF THE PROJECT

A trail of data follows products as they are manufactured, shipped, stocked,

In an attempt to demystify retail data analytics, this project chronicles a real-

 Increase sales per visit with a deeper understanding of customers’ purchase

 Learn about new sales opportunities by identifying unexpected trends from

A set of simple analytics experiments was performed to create capabilities and a

1.2 COMPETITIVE ADVANTAGE

According to McKinsey Global Institute,big data has the potential to increase

In order to generate the insight needed to reap substantial business benefits,new

1.3 BIG DATA TECHNOLOGIES

Hadoop is a popular, open-source software framework that enables distributed

In a period of three months, a team of four analysts worked with Living

2.1 Background of the study

It includes the investigation and possible changes to the existing system.

A detailed study of the process must be made by various techniques like

2.1.1 Problem Definition

Problem Definition deals with observations, site visits and discussions to

2.1.2 Requirement Analysis

Requirements analysis in systems engineering and software engineering,

Increasingly connected consumers and retail channels have served to make a

2.2.1 The Data Deluge, And Other Barriers

Modern retailers are increasingly pursuing omni-channel retail strategies that

 Volume or the sheer amount of data being amassed by today’s enterprises.

 Data Integration. It can be challenging to integrate unstructured and

Fortunately, open technology platforms like Hadoop allow retailers to overcome

2.3 PROPOSED SYSTEM

Apache Hadoop is an open-source data processing platform create at web-scale

 Collect Everything: A data lake can store any type of data,including an

In-store behavior analysis

Enhancing the multichannel consumer

Merchandising Assortment optimization

Placement and design optimization

Operations Performance transparency

Labor inputs optimization

Supply chain Inventory management

Distribution and logistics optimization

Informing supplier negotiations

New business models Price comparison services

Table: 1 Big Data Levers in Retail Analytics

 Interview - This method is used to collect the information from groups or

Requirements analysis is an important part of the system design process, whereby

3.1 APACHE HADOOP

Apache Hadoop is an open source software framework used for distributed

The core of Apache Hadoop consists of a storage part, known as Hadoop

The base Apache Hadoop framework is composed of the following modules:

 Hadoop Common – contains libraries and utilities needed by other Hadoop

 Hadoop YARN – a resource-management platform responsible for managing

The Hadoop framework itself is mostly written in theJava programming

For effective scheduling of work, every Hadoop-compatible file system should

3.1.2 File Systems

Hadoop Distributed File System

The Hadoop distributed file system (HDFS) is a distributed, scalable, and

HDFS added the high-availability capabilities, as announced for release 2.0 in

The HDFS file system includes a so-called secondary namenode, a misleading

File access can be achieved through the native Javaapplication programming

HDFS is designed for portability across various hardware platforms and

3.1.3 Jobtracker and Tasktracker: The Mapreduce Engine

 The allocation of work to TaskTrackers is very simple. Every TaskTracker has a