Full
Full
Full
INTRODUCTION
The volume, variety, and velocity of data being produced in all areas of the retail
industry is growing exponentially, creating both challenges and opportunities for those
diligently analyzing this data to gain a competitive advantage. Although retailers have
been using data analytics to generate business intelligence for years, the extreme
composition of today’s data necessitates new approaches and tools. This is because the
retail industry has entered the big data era, having access to more information that can
be used to create amazing shopping experiences and forge tighter connections between
customers, brands, and retailers.
Improve inventory management with greater visibility into the product pipeline
1
facilitated an understanding of the edge-to-cloud business analytics value proposition
and at the same time, provided insight into the technical architecture and integration
needed for implementation.
In order for retailers to realize the full potential of big data, they must find a new
approach for handling large amounts of data. Traditional tools and infrastructure are
struggling to keep up with larger and more varied data sets coming in at high velocity.
New technologies are emerging to make big data analytics scalable and cost-effective,
such as a distributed grid of computing resources. Processing is pushed out to the nodes
where the data resides, which is in contrast to long-established approaches that retrieve
data for processing from a central point.
2
1.4 REAL-WORLD USE CASES
In order to demonstrate big data use cases for 3,000 natural food stores in the
United States. Living Naturally develops and markets a suite of online and mobile
programs designed to enhance the productivity and marketing capabilities of retailers
and suppliers. Since 1999, Living Naturally has worked with thousands of retail
customers and 20,000 major industry brands.
Product Pipeline Tracking: When Inventory levels are out sync with demand,
make recommendations to retail buyers to remedy the situation and maximize return for
the store. Market Basket Analysis: When an item goes on sale, let retailers know about
adjacent products that benefit from a sales increase as well. Social Media Analysis:
Prior to products going viral on social media, suggest retail buyers increase order size to
be more responsive to shifting consumer demand and avoid out-of-stocks
3
CHAPTER 2
LITERATUR REVIEW
The proposal is presented to the user for an endorsement by the user. The
proposal is reviewed on user request and suitable changes are made. This is loop that
ends as soon as the user is satisfied with proposal. Preliminary study is problem solving
activity that requires intensive communication between the system users and system
developers. It does various feasibility studies. In these studies a rough figure of the
system activities can be obtained, which can be used to take decisions regarding
strategies to be followed for an effective system development.
The various task to be carried out in system analysis involves: examining the
document and the relevant aspects of the existing system, its failures and problems;
analyse the findings and record the results; define and document in outline the proposed
system; test the proposed design against the known facts; produce a detailed report to
support the proposals; estimate the resource required to design and implement the
proposed system.
4
The objective of this system study is to determine whether there is any need for
the new system. All the levels of the feasibility measures have to be performed.
Thereby knowing the performance by which a new system has to be performed.
Addition of new features is very difficult and creates more overheads. The
present system is not easily customizable. The existing system records customer
details, employee details, new service/event details manually. The change in one
module or any part of the system is widely affected in other parts or sections. And it
is difficult to coordinate all the changes by the administrator. This present system
then becomes more error prone when we think about updating. Lots of security
problems are present in the existing system. Keeping the problem definition in mind
the proposed system evolves which is easily customizable, user friendly, easy to
update with new features in future and so on.
5
2.2 EXISTING SYSTEM
These new data sources including clickstream and web logs, social media
interactions, search queries, in-store sensors and video, marketing assets and
interactions, and variety of public and purchased data sets provided by 3rd parties have
put tremendous pressure on traditional retail data systems.Retailers face challenges
collecting big data. These are often characterized by “the three Vs”:
Velocity, or the speed at which this data is created, collected and processed.
Variety or the fact that the fastest growing types of data have little or no inherent
structure or a structure that changes frequently.
6
These three Vs have individually and collectively outstripped the capabilities of
traditional storage and analytics solutions that were created in a world of more
predictable data. Compounding the challenges represented by the Vs is the fact that
collected data can have little or no value as individual or small groups of records. But
when explored in the aggregate, in very large quantities or over a longer historical time
frame, the combined data reveals patterns that feed advanced analytical applications. In
addition, retailers seeking to harness the power of big data face a distinct set of
challenges.
These include:
Skill Building. Many of the most valuable big data insights in retail come from
advanced techniques like machine learning and predictive analytics. However, to
use these techniques analysts and data scientists must build new skill sets.
Application of Insight. Incorporating big data analytics into retail will require a
shift in industry practices from those based on weekly or monthly historical
reporting to new techniques based on real-time decision-making and predictive
analytics.
To unlock the drawbacks of Existing System retailers from across the industry
have turned to Apache Hadoop to meet these needs and to collect,manage and anlyze a
wide variety of data.In doing so,they are gaining new insights into the
customer,offering the right producr at the right price,improving operations and supply
8
Function Big data lever
Marketing Cross-selling
Location-based marketing
Customer micro-segmentation
Sentiment analysis
Pricing optimization
Web-based markets
9
2.5 FACT FINDING TECHNIQUES
Requirements analysis encompasses all of the tasks that go into the investigation,
scoping and definition of a new or altered system. The first activity in analysis phase is to
do the preliminary investigation. During the preliminary investigation data collecting is a
very important and for this we can use the fact finding techniques.
The following fact finding techniques can be used for collecting the data:
Observation - This is a skill which the analysts have to develop. The analysts
have to identify the right information and choose the right person and look at the
right place to achieve his objective. He should have a clear vision of how each
departments work and work flow between them and for this should be a good
observer.
10
2.6 REVIEW OF THE LITERATURE
Text summarization is an old challenge in text mining but in dire need of
researcher’s attention in the areas of computational intelligence, machine learning and
natural language processing. A set of features are extracted from each sentence that
helps to identify its importance in the document. Every time reading full text is time
consuming. Clustering approach is u useful to decide which type of data present in
document. The concept of k-means clustering for natural language processing of text for
word matching is used and in order to extract meaningful information from large set of
offline documents, data mining document clustering algorithm are adopted. Automated
text summarization focused two main ideas have emerged to deal with this task; the first
was how a summarizer has to treat a huge quantity of data and the second, how it may
be possible to produce a human quality summary.
Depending on the nature of text representation in the summary, summary can be
categorized as an abstract and an extract. An extract is a summary consisting of the
overall idea of the document. An abstract is a summary, which represents the subject
matter of the document. In general, the task of document summarization covers generic
summarization and query-oriented summarization. The query oriented method generates
summary of documents and generic method summarizes the oveall sense of the
document without any additional information.
Traditional document clustering algorithms use the full text in documents to
generate feature vectors. Such methods often produce unsatisfactory results because
there is much noisy information in documents. The varying –length problem of the
documents is also a significant negative factor affecting the performance. This
technique retrieves important sentences which emphasize on high information richness.
The maximum sentence generated scores are clustered to generate the summary of the
document. Thus k-mean clustering is used to group the maximum sentences of the
document and find the relation to extract clusters with most relevant sets in the
document. This helps in producing the summary of the document. The main purpose of
k-mean clustering algorithm is to generate pre define length of summary having
maximum informative sentences. The approach for automatic text summarization is
presented by extraction of sentences from the Reuters-21578 corpus which include
newspaper articles.
11
CHAPTER 3
RESEARCH METHODOLOGY
12
Hadoop MapReduce – an implementation of the MapReduce programming
model for large scale data processing.
The term Hadoop has come to refer not just to the base modules and sub-
modules above, but also to the ecosystem, or collection of additional software packages
that can be installed on top of or alongside Hadoop, such as Apache Pig,Apache Hive,
Apache Hbase,Apache Phoenix,Apache Spark,Apache ZoopKeeper,Apache
ZoopKeeper,Cloudera Impala,Apache Flume,Apache Sqoop,Apache Oozie, and Apache
Storm.Apache Hadoop's MapReduce and HDFS components were inspired by Google
papers on their MapReduce and Google File System.
3.1.1 Architecture
Hadoop consists of the Hadoop Common package, which provides file system
and OS level abstractions, a MapReduce engine (either MapReduce/MR1 or
YARN/MR2) and the Hadoop Distributed File System (HDFS). The Hadoop Common
package contains the necessary Java ARchive(JAR) files and scripts needed to start
Hadoop.
Hadoop applications can use this information to execute code on the node where
the data is, and, failing that, on the same rack/switch to reduce backbone traffic. HDFS
uses this method when replicating data for data redundancy across multiple racks. This
approach reduces the impact of a rack power outage or switch failure; if one of these
13
hardware failures occurs, the data will remain available.
HDFS stores large files (typically in the range of gigabytes to terabytes) across
multiple machines. It achieves reliability by replicating the data across multiple hosts,
and hence theoretically does not require RAID storage on hosts (but to increase I/O
performance some RAID configurations are still useful). With the default replication
value, 3, data is stored on three nodes: two on the same rack, and one on a different
rack. Data nodes can talk to each other to rebalance data, to move copies around, and to
keep the replication of data high. HDFS is not fully POSIX compliant, because the
requirements for a POSIX file-system differ from the target goals for a Hadoop
application. The trade-off of not having a fully POSIX-compliant file-system is
increased performance for data throughput and support for non-POSIX operations such
as Append.
14
Figure 3.1: Architecture of Hadoop
15
HDFS Federation, a new addition, aims to tackle this problem to a certain extent
by allowing multiple namespaces served by separate namenodes. Moreover, there are
some issues in HDFS, namely, small file issue, scalability problem, Single Point of
Failure (SPoF), and bottleneck in huge metadata request. An advantage of using HDFS
is data awareness between the job tracker and task tracker. The job tracker schedules
map or reduce jobs to task trackers with an awareness of the data location.
For example: if node A contains data (x,y,z) and node B contains data (a,b,c),
the job tracker schedules node B to perform map or reduce tasks on (a,b,c) and node A
would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the amount
of traffic that goes over the network and prevents unnecessary data transfer. When
Hadoop is used with other file systems, this advantage is not always available. This can
have a significant impact on job-completion times, which has been demonstrated when
running data-intensive jobs.
HDFS was designed for mostly immutable files and may not be suitable for
systems requiring concurrent write-operations.HDFS can be mounted directly with a
Filesystem in Userspace (FUSE)virtual file system on Linux and some other Unix
systems.
16
Monitoring end-to-end performance requires tracking metrics from datanodes,
namenodes, and the underlying operating system. There are currently several
monitoring platforms to track HDFS performance, including HortonWorks, Cloudera,
and Datadog.
Above the file systems comes the MapReduce Engine, which consists of one
JobTracker, to which client applications submit MapReduce jobs. The JobTracker
pushes work out to available TaskTracker nodes in the cluster, striving to keep the work
as close to the data as possible. With a rack-aware file system, the JobTracker knows
which node contains the data, and which other machines are nearby. If the work cannot
be hosted on the actual node where the data resides, priority is given to nodes in the
same rack.
This reduces network traffic on the main backbone network. If a TaskTracker fails
or times out, that part of the job is rescheduled. The TaskTracker on each node spawns a
separate Java Virtual Machine process to prevent the TaskTracker itself from failing if
the running job crashes its JVM. A heartbeat is sent from the TaskTracker to the
JobTracker every few minutes to check its status. The Job Tracker and TaskTracker
status and information is exposed by Jetty and can be viewed from a web browser.
Known limitations of this approach are:
If one TaskTracker is very slow, it can delay the entire MapReduce job –
especially towards the end of a job, where everything can end up waiting for the
slowest task. With speculative execution enabled, however, a single task can be
executed on multiple slave nodes.
17
3.1.4 Scheduling
The fair scheduler was developed by Facebook The goal of the fair scheduler is
to provide fast response times for small jobs and Qos for production jobs. The fair
scheduler has three basic concepts.
By default, jobs that are uncategorized go into a default pool. Pools have to
specify the minimum number of map slots, reduce slots, and a limit on the number of
running jobs.
Apache Hive is a data warehouse software project built on top of Apache Hadoop
for providing data summarization, query, and analysis. Hive gives an SQL-like interface
to query data stored in various databases and file systems that integrate with Hadoop.
Traditional SQL queries must be implemented in the MapReducee Java API to execute
18
SQL applications and queries over distributed data.
3.2.1 Architecture
Hive Architecture
Metastore: Stores metadata for each of the tables such as their schema and
location. It also includes the partition metadata which helps the driver to track
the progress of various data sets distributed over the cluster. The data is stored in
19
a traditional RDBMS format. The metadata helps the driver to keep a track of
the data and it is highly crucial. Hence, a backup server regularly replicates the
data which can be retrieved in case of data loss.
Driver: Acts like a controller which receives the HiveQL statements. It starts the
execution of statement by creating sessions and monitors the life cycle and
progress of the execution. It stores the necessary metadata generated during the
execution of an HiveQL statement. The driver also acts as a collection point of
data or query result obtained after the Reduce operation.
Compiler: Performs compilation of the HiveQL query, which converts the query
to an execution plan. This plan contains the tasks and steps needed to be
performed by the Hadoop MapReduce to get the output as translated by the
query. The compiler converts the query to an Abstract syntax tree (AST). After
checking for compatibility and compile time errors, it converts the AST to a
directed acyclic graph (DAG). DAG divides operators to MapReduce stages
and tasks based on the input query and data.
Executor: After compilation and Optimization, the Executor executes the tasks
according to the DAG. It interacts with the job tracker of Hadoop to schedule
tasks to be run.
It takes care of pipelining the tasks by making sure that a task with dependency
gets executed only if all other prerequisites are run.
CLI, UI, and Thrift Server: Command Line Interface and UI (User Interface)
allow an external user to interact with Hive by submitting queries, instructions
20
and monitoring the process status. Thrift server allows external clients to interact
with Hive just like how JDBC/ODBC servers do.
Apache Sqoop is a tool designed for efficiently transferring bulk data between
Apache Hadoop and external datastores such as relational databases, enterprise data
warehouses.Sqoop is used to import data from external datastores into Hadoop
Distributed File System or related Hadoop eco-systems like Hive and HBase. Similarly,
Sqoop can also be used to extract data from Hadoop or its eco-systems and export it to
external datastores such as relational databases, enterprise data warehouses. Sqoop
works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres
etc.
For Hadoop developers, the interesting work starts after data is loaded into
HDFS. Developers play around the data in order to find the magical insights concealed
in that Big Data. For this, the data residing in the relational database management
systems need to be transferred to HDFS, play around the data and might need to transfer
back to relational database management systems. In reality of Big Data world,
Developers feel the transferring of data between relational database systems and HDFS
is not that interesting, tedious but too seldom required. Developers can always write
custom scripts to transfer data in and out of Hadoop, but Apache Sqoop provides an
alternative.Sqoop automates most of the process, depends on the database to describe
the schema of the data to be imported. Sqoop uses MapReduce framework to import and
export the data, which provides parallel mechanism as well as fault tolerance. Sqoop
makes developers life easy by providing command line interface. Developers just need
to provide basic information like source, destination and database authentication details
in the sqoop command. Sqoop takes care of remaining part.
21
Sqoop provides many salient features like:
Full Load
Incremental Load
Parallel import/export
Import results of SQL query
Compression
Connectors for all major RDBMS Databases
Kerberos Security Integration
Load data directly into Hive/Hbase
Support for Accumulo
As we are dealing with Big Data, Hadoop stores and processes the Big Data
using different processing frameworks like MapReduce, Hive, HBase, Cassandra, Pig
etc and storage frameworks like HDFS to achieve benefit of distributed computing and
distributed storage. In order to store and analyze the Big Data from relational databases,
Data need to be transferred between database systems and Hadoop Distributed File
System (HDFS). Here, Sqoop comes into picture. Sqoop acts like a intermediate layer
between Hadoop and relational database systems. You can import data and export data
between relational database systems and Hadoop and its eco-systems directly using
sqoop.
Sqoop provides command line interface to the end users. Sqoop can also be
accessed using Java APIs. Sqoop command submitted by the end user is parsed by
Sqoop and launches Hadoop Map only job to import or export data because Reduce
phase is required only when aggregations are needed. Sqoop just imports and exports
the data; it does not do any aggregations.
22
Sqoop parses the arguments provided in the command line and prepares the Map
job. Map job launch multiple mappers depends on the number defined by user in the
command line. For Sqoop import, each mapper task will be assigned with part of data to
be imported based on key defined in the command line. Sqoop distributes the input data
among the mappers equally to get high performance. Then each mapper creates
connection with the database using JDBC and fetches the part of data assigned by
Sqoop and writes it into HDFS or Hive or HBase based on the option provided in the
command line.
Sqoop supports incremental loads of a single table or a free forms SQL query as
well as saved jobs which can be run multiple times to import update made to a database
since the last import.Imports can also be used to populate tables in Hive or
Hbase.Esports can be used to put data from Hadoop into a relational database.Sqoop got
the name from sql+hadoop.Sqoop became a top-level Apache project in March 2012.
23
Informatica provides a Sqoop based connector from version 10.1.Pentaho
provides open source Sqoop based connector steps,Sqoop Import and Sqoop Export,in
their ETL suite Pentaho Data Integration since version 4.5 of the software.Microsoft
uses a Sqoop-based connector to help transfer data from Microsoft Sql Server databases
to Hadoop.
3.3 POWERBI
Power BI is not a new name in the BI market, components of Power BI has been
in the market through different time periods. Some components such As Power BI
Desktop is such new that released as general availablility at 24th of July. On the other
hand Power Pivot released at 2010 for the first time. Microsoft team worked through
long period of time to build a big umbrella called Power BI, this big umbrella is not just
a visualization tool such as Tableau, it is not just a self-service data analysis tool such as
PivotTable and PivotChart in Excel, it is not just a cloud based tool for data analysis.
Power BI is combination of all of those, and it is much more. With Power BI you can
connect to many data sources (wide range of data sources supported, and more data
sources add to the list every month). You can mash up the data as you want with a very
powerful data mash up engine. You can model the data, build your star schema, or add
measures and calculated columns with an In-Memory super fast engine. You can
visualize data with great range of data visualization elements and customize it to tell the
story behind the data.
You can publish your dashboard and visualization tool in cloud and share it to
those who you want. You can work with On-premises as well as Azure/cloud based data
sources. and believe me there are much more things that you can do with Power BI
which you can’t do with other products easily.
Power BI is a cloud based data analysis, which can be used for reporting and
data analysis from wide range of data source. Power BI is simple and user friendly
enough that business analysts and power users can work with it and get benefits of it.
On the other hand Power BI is powerful and mature enough that can be used in
enterprise systems by BI developers for complex data mash-up and modelling scenarios.
24
Power BI made of 6 main components, these components released in the market
separately, and they can be used even individually. Components of Power BI are:
There are many other parts for Power BI as well, such as;
PowerBI.com Website; which Power BI data analysis can be shared through this
website and hosted there as cloud service
Some of above components are strong and has been tested for very long time.
Some of them however are new and under frequent regular updates. Power BI built easy
graphical user interfaces to follow, so a business user simply could user Power Query or
Power BI desktop to mash up the data without writing even a single line of code. It is on
the other hand so powerful with power query formula language (M) and data analysis
expression (DAX) that every developer can write complex codes for data mash up and
calculated measures to respond challenging requirements.
25
Power BI components can be used individually or in a combination. Power
Query has an add-in for Excel 2010 and Excel 2013, and it is embedded in Excel 2016.
The add-in for Power Query is available for free! for everyone to download and use it
alongside with existing an Excel (as long as it is Excel 2010 or higher versions). Power
Pivot has been as an add-in for Excel 2010, from Excel 2013 Power Pivot is embedded
in Excel, this add-in is again free to use! Power View is an add-in for Excel 2013, and it
is free for use again. Power Map is an add-in for Excel 2013, it is embedded in Excel
2016 as 3D maps. Power Q&A doesn’t require any installation or add-in, it is just an
engine for question and answering that works on top of models built in Power BI with
other components.
Components above can be used in a combination. You can mash up the data with
Power Query, and load the result set into Power Pivot model. You can use the model
you’ve built in Power Pivot for data visualization in Power View or Power Map. There
is fortunately a great development tool that combines three main components of Power
BI. Power BI Desktop is the tool that gives you combined editor of Power Query, Power
Pivot, and Power View. Power BI Desktop is available as stand-alone product that can
be downloaded separately. With Power BI Desktop you will have all parts of the
solution in one holistic view.
Power Query is data transformation and mash up engine. Power Query can be
downloaded as an add-in for Excel or be used as part of Power BI Desktop. With Power
Query you can extract data from many different data sources. You can read data from
databases such as SQL Server, Oracle, MySQL, DB2, and many other databases. You
can fetch data from files such as CSV, Text, Excel.
You can even loop through a folder. You can use Microsoft Exchange, Outlook,
Azure…. as source. You can connect to Facebook as source and many other
applications. You can use online search or use a web address as the source to fetch the
data from that web page. Power Query gives you a graphical user interface to transform
data as you need, adding columns, changing types, transformations for date and time,
text, and many other operations are available. Power Query can load the result set into
26
Excel or into Power Pivot model.
Power Query also uses a powerful formula language as code behind called M. M
is much more powerful than the GUI built for it. There are many functionality in M that
cannot be accessed through graphical user interface. I would write deeply about Power
Query and M in future chapters so you can confidently write any code and apply
complex transformations to the data easily screenshot below is a view of Power Query
editor and some of it’s transformations.
27
3.3.2 Power Pivot
28
3.3.3 Power View
Power View has many charts for visualization in its list. Power View gives you
ability to filter data for each data visualization element or for the entire report. You can
use slicers for better slicing and dicing the data. Power View reports are interactive,
user can highlight part of the data and different elements in Power View talk with each
other
29
Power Map works with Bing maps to get best visualization based on Geo-
graphical either latitude and longitude or country, state, city, and street address
information. Power Map is an add-in for Excel 2013, and embedded in Excel 2016.
30
3.3.6 Power BI Website
Power Q&A is a natural language engine for questions and answers to your data
model. Once you’ve built your data model and deployed that into Power BI website,
then you or your users can ask questions and get answers easily. There are some tips
and tricks about how to build your data model so it can answer questions in the best way
which will be covered in future chapters. Power Q&A and works with Power View for
the data visualizations.
So users can simply ask questions such as: Number of Customers by Country,
and Power Q&A will answer their question in a map view with numbers as bubbles,
Fantastic, isn’t it?
There are mobile apps for three main mobile OS providers: Android, Apple, and
Windows Phone. These apps gives you an interactive view of dashboards and reports in
the Power BI site, you can share them even from mobile app. You can highlight part of
the report, write a note on it and share it to others
31
Figure 3.8: Power Q&A
32
3.3.9 Power BI Pricing
Power BI provide these premium services for free! You can create your account
in PowerBI.com website just now for free. Many components of Power BI can be used
individually for free as well. you can download and install Power BI Desktop, Power
Query add-in, Power Pivot add-in, Power View add-in, and Power Map add-in all for
free! There are some features of these products that reserved for paid version however,
such as Power BI Pro which gives you some more features of the product.
33
CHAPTER 4
EXPERIMENTAL RESULTS
4.1 OVERVIEW
Retailers need to have a clear idea of the problems they want to solve and
understand the value of solving them. For instance, a clothing store may find four of ten
customers looking for blue jeans cannot find their size on the shelf, which results in lost
sales. Using big data, the retailer's goal could be to reduce the frequency of out-of-
stocks to less than one in ten customers, thus increasing the sales potential for jeans by
over 50 percent.
Big data can be used to solve a lot of problems, such as reduce spoilage, increase
margin, increase transaction size or sales volume, improve new product launch success
or increase customer dwell time in stores. For example, a popular medical doctors with
a health-oriented television show recommended raspberry ketene pills as a weight
reducer to his very large following on Twitter.
34
This caused a run on the product at health stores and ultimately led to empty
shelves, which took a long time to restock since there was little inventory in the supply
chain .
Retailers collect a variety of data from point-of-sale (POS) systems, including the
typical sales by product and sales by customer via loyalty cards. However, there could
be less obvious sources of useful data that retailers can find by “walking the store”, a
process intended to provide a better understanding of a product's end-to-end life cycle.
His exercise allows retailers to think about problems with a fresh set of eyes, avoid
thinking of their data in silo'd terms, and consider new opportunities that present
themselves when big data is used to find correlations between exist and and new data
sources ,such as:
Supply chain: Orders, Shipment, invoices, inventory, sales receipts, and payments.
After listing the entire possible data source, the next step is to explore the data
through queries that help determine which combinations of data could be used to solve
specific retail problems. This meant using social media data to inform health stores up o
three weeks before a product is likely to go viral, providing enough time for retail
buyers to increase their orders.
35
The team queries social media feeds and compared them to historic store
inventory levels and found it was possible to give retail buyers an early warning of
potentially rising product demand. In this case, it was important to find a reliable
leading indicator that could be used to predict how much product should be ordered and
when.
The team focused on metrics related to reducing shrinkage and spoilage, and
increasing profitability by suggesting which products a store should pick for promotion.
A workable big data solution hinges on presenting the findings in a clear and
acceptable way to employees. As described previously, the solution provided retail
buyers with timely product ordering recommendations displayed with an intuitive UI
running on devices used to enter orders.
By definition, the outcomes from a big data project can change the way people
do their jobs. It is important to consider the costs and time of the business process re-
engineering that may be needed to fully implement the big data solution.
Retailers should calculate an ROI to ensure their big data project makes
Financial sense. At this point, the ROI is likely to have some assumptions and
unknowns, but hopefully these will have only a second order effect on the financials.
36
If the return is deemed too low it may make sense to repeat the prior steps and
look for other problems and opportunities to pursue. Moving forward, the following
steps will add clarity to what can be accomplished and raise the confidence level of the
ROI.
Retail data used for big data analysis typically comes from multiple sources and
different operational systems, and may have inconsistencies. Transaction data was
supplied by several systems, which truncated UPC codes or stored them differently due
to special characters in the data field. More often than not, one of the first steps is to
cleanse, enhance, and normalize the data. Data cleaned tasks performed are:
37
4.3.1 DFD Symbols
38
Beyond that it is best to take each function separately and expand it to show the
explosion of single process. Data flow diagrams are one of the three essential
perspectives of the structured-systems analysis and design method SSADM. With a data
flow diagram, users are able to visualize how the system will operate, what the system
will accomplish, and how the system will be implemented. It is a common practice to
show the interaction between the system and external agents which act as data sources
and data sinks.
4.4 MODULES
39
4.4.1 Marketing and Development Plans
A Marketing and Development plan has a formal structure, but can be used as a
formal or informal document which makes it very flexible. It contains some historical
data, future predictions, and methods or strategies to achieve the marketing objectives.
Marketing and Development plans start with the identification of customer needs
through a market research and how the business can satisfy these needs while
generating an acceptable level of return.
The Marketing and Development plan shows the step or actions that will be
utilized in order to achieve the plan goals. For example, a marketing and development
plan may include a strategy to increase the business's market share by fifteen percent.
The marketing and development plan would then outline the objectives that need to be
achieved in order to reach the fifteen percent increase in the business market share. The
marketing and development plan can be used to describe the methods of applying a
company's marketing resources to fulfil marketing objectives. Marketing and
development planning segments the markets, identifies the market position, forecast the
market size, and plans a viable market share within each market segment.
40
Marketing and development planning can also be used to prepare a detailed case
for introducing a new product, revamping current marketing strategies for an existing
product or put together a company marketing plan to be included in the company
corporate or business plan.
Given the enormous growth of data, banks are suffering from their inability to
effectively exploit their data assets. As a result, retails must now capitalize on the
capabilities of business intelligence (BI) software, particularly in areas such as customer
intelligence, performance management, financial analysis, fraud detection, risk
management, and compliance. Furthermore, retails need to develop their BI real-time
delivery ability in order to respond faster to business issues. This brief argues the
following statements:
4.4.3 Merchandising
Retail Merchandising refers to the various activities which contribute to the sale
of products to the consumers for their end use. Every retail store has its own line of
merchandise to offer to the customers. The display of the merchandise plays an
important role in attracting the customers into the store and prompting them to purchase
as well.
41
Merchandising helps in the attractive display of the products at the store in order
to increase their sale and generate revenues for the retail store.
Promotional Merchandising
The ways the products are displayed and stocked on the shelves play an
important role in influencing the buying behaviour of the individuals.
4.4.4 Stores
Retail in-store analytics is vital for providing near-real-time view to key in-store
metrics, collecting footfall statistics using people counting solutions (video
based/thermal based/Wi-Fi-based). This information is fed into in-store analytics tools
for measuring sales conversion by co-relating traffic inflow with the number of
transactions and value of transactions. The deep insights from such near real-time views
of traffic, sales and engagement help in planning store workforce needs based on traffic
patterns. Reliable in-store data and analytics solutions enable better comparison of store
performances, optimizing campaign strategy, improved assortment and merchandising
decisions.
42
4.4.5 E-Commerce
Retail Strategy involves planning for the business growth keeping in view the
current market trends, opportunities as well as threats and building a strategic plan that
helps the Company deal with all these external factors and stay on course to reach its
goals. Further the Retail business strategy is concerned with identifying the markets to
be in, building the product portfolio and band width coupled with brand positioning and
the various elements of brand visibility and in store promotions etc. Business operations
are more or less standard and proven models that are adapted as best practices.
43
The need to identify consumer-related issues, better understand your customers
and meet their needs with the company’s products and services. By making accurate
estimates regarding product or service demands in a given consumer market, one can
formulate support and development strategies accordingly.
A CRM Integration system can combine several systems to allow a single view.
Data can be integrated from consumer lifestyle, expenditure, brand choice.If the CRM
system is implemented to track marketing strategies over products, services, then it can
provide a scientific, data-based approach to marketing and advertising analysis.
44
4.5 ALGORITHM DEFINITON PROCESS
4.5.1. Footfall
Sales growth may also come from selling more items to each buyer (compare
number of transactions to sales volumes), selling more expensive items (an
improvement in the sales mix), or increasing prices. Which of these numbers is
disclosed varies from company to company. Investors should look at whatever is
available.
Calculation of FootFall:
45
Once you have defined what conversions you want to track, you can calculate
the conversion rate. For the purposes of the following example, let’s call a conversion a
sale. Even if you’re still in the dark ages without a viable website, as long as you are
tracking the number of leads you get and the number of resulting sales (conversions),
you can calculate your conversion rate like so…
Example: Let’s say you made 20 sales last year and you had 100 inquiries/ leads. Your
sales to lead conversion rate would be 20%.If you’re tracking conversions from website
leads, your formula looks like so:
The total number of visits divided by the total number of visitors during the
same timeframe.
Sophisticated users may also want to calculate average visits per visitor for
different visitor segments. This can be especially valuable when examining the activity
of new and returning visitors or, for online retailers, customers and non-customers.
Presentation
The challenge with presenting average visits per visitor is that need to examine
an appropriate timeframe for this KPI to make sense. Depending on business model it
may be daily or it may be annually: Search engines like Google or Yahoo can easily
justify examining this average on a daily, weekly and monthly basis. Marketing sites
that support very long sales cycles waste their time with any greater granularity than
monthly.
Consider changing the name of the indicator when present it to reflect the
timeframe under examination, e.g, “Average Daily Visits per Visitor” or “Average
Monthly Visits per Visitor”.
46
Expectation
Expectations for average visits per visitor vary widely by business model.
Retail sites selling high-consideration items will ideally have a low average
number of visits indicating low barriers to purchase; those sites selling low
consideration items will ideally have a high average number of visits, ideally
indicating numerous repeat purchases. Online retailers are advised to segment
this KPI by customers and non-customers as well as new versus returning
visitors regardless of customer status.
Advertising and marketing sites will ideally have high average visits per visitor,
a strong indication of loyalty and interest.
Customer support sites will ideally have a low average visits per visitor,
suggesting either high satisfaction with the products being supported or easy
resolution of problems. Support sites having high frequency of visit per visitor
should closely examine average page views per visit, average time spent on site
and call centre volumes, especially if the KPI is increasing (e.g., getting worse.)
47
4.5.5 Calculation of Top And Bottom 5 Revenue Generators Product/Store
=agar(if(rank(sum(Revenue))<=5 or rank(-sum(Revenue))<=5,Customer),Customer)
On average, a customer with a basket will buy two items more on any given visit
than a customer left to fumble with two or three purchases, other shopping bags and a
handbag or purse. This is no small matter as store managers attempt to maximise
“basket size,” or “average transaction value” as it is known in the colourful language of
financial analysts.
The “Average Ticket” is the average dollar amount of a final sale, including all
items purchased, of a merchant’s typical sales. Processors use the Average Ticket to
better understand the merchant’s business and to determine how much risk the merchant
poses to the processor.
48
Many merchants find it difficult to determine the size of their average ticket due
to widely varying sales amounts, or because the business is new. Often times,
processors only need estimation and not an exact figure. A ticket that records all the
terms, conditions and basic information of a trade agreement. A deal ticket is created
after the transaction of shares, futures contracts or other derivatives. Also referred to as
a "trading ticket".
Deal Ticket
A deal ticket is similar to a trading receipt. It tracks the price, volume, names
and dates of a transaction, along with all other important information. Companies use
deal tickets as part of an internal control system, allowing them organized access to the
transaction history. Deal tickets can keep be in either electronic or physical form.
Cost of goods sold refers to the carrying value of goods sold during a particular period.
Costs are associated with particular goods using one of several formulas,
including specific identification, first-in first-out (FIFO), or average cost. Costs include
all costs of purchase, costs of conversion and other costs incurred in bringing the
inventories to their present location and condition. Costs of goods made by the business
include material, labour, and allocated overhead. The costs of those goods not yet sold
are deferred as costs of inventory until the inventory is sold or written down in value.
So, in your store, if you bought 100 chairs and after 30 days had sold 20 chairs
(meaning you had 80 chairs left in inventory) then your sell-through rate would be 20
percent. Using your beginning of month (BOM) inventory, you divide your sales by that
BOM. It is calculated this way:
Sell through is a healthy way to assess if your investment is returning well for you.
For example, a sell-through rate of 5 percent might mean you either have too many on
hand (so you are overbought) or priced too high. In comparison, a sell-through rate of
80 percent might mean you have too little inventory (under bought) or priced too low.
Truly the sell through rate's analysis is based on what you want from the merchandise.
50
4.6 HIVE TABLE STRUCTURE
Customerid Int
City String
State String
Zip Int
Customer_accountid String
Account_opened Timestamp
Table: 2 Customer_Details
Customerid Int
FristName String
LastName String
Datacreated Timestamp
Table: 3 Customer
51
Filed Name Filed Type
Itemid String
Itemtypeid String
Mfgname String
Createddate Timestamp
Table: 4 Itemdetails
Orderdetailsid String
Orderid String
Itemid String
Quantity Int
Unitprice Int
Shipmethodid String
52
Field Name Field Type
Orderid String
Customerid String
Transactiondate Timestamp
Totalorderamount Int
Taxamoutn Int
Frightamount Int
Trdisamount Int
Subtotal Int
Paymenttypeid String
Ordersource String
Cancelreturn String
Datecreated Timestamp
Storeid String
Table: 6 Orders
53
Field Name Field Type
Itemtypeid String
Paymenttype String
Createduser String
Datecreated Timestamp
Modifieduser String
Datemodified Timestamp
Table: 7 Payment_type
Shipmethodid String
Shipmentmethod String
Createduser String
Modifieduser String
Datemodified TimeStamp
Table 8:Shipment_type
54
Field Name Field Type
Id Int
Name String
Region String
City String
Country String
State String
Zip String
Table 9: Store-Details
Itemtype String
Typename String
55
4.7 INPUT DESIGN
Once the output requirements have been finalized, the next step is to find out
what inputs are needed to produce the desired outputs. Inaccurate input data results in
errors in data processing. Errors entered by data entry operator can be controlled by
input design. Input design is a process of converting user originated inputs to computer-
based format. The various objectives of the input design should focus on;
Input is considered as the process of keying in data into the system, which will
be converted to system format. People all over the world who belong to different
cultures and geographies will use a web site. So the input screens given in the site
should be really flexible and faster to use. With highly competitive environment
existing today in web based businesses the success of the site depends on the number
users logging on to the site and transacting with the company. A smooth and easy to use
site interface and flexible data entry screens are a must for the success of the site. The
easy to use hyperlinks of the site help in navigating between different pages of the site
in a faster way.
A document should be concise because longer documents contain more data and
so take longer to enter and have a greater chance of data entry errors. The more quickly
an error is detected, the closer the error is to the person who generated it and so the error
is more easily corrected. A data input specification is a detailed description of the
individual fields (data elements) on an input document together with their
characteristics. Be specific and precise, not general, ambiguous, or vague in case of
error messages.
56
4.8 OUTPUT DESIGN
The outputs from computer systems are required mainly to communicate the
results of processing to users. They are also used to provide permanent (“hard”) copy of
these results for later consultation. Output is what the client is buying when he or she
pays for a development project. Inputs, databases, and processes exist to provide output.
Printouts should be designed around the output requirements of the user. The output
devices are considered keeping in mind factors such as compatibility of the device with
the system response time requirements, expected print quality and number of copies
needed.
This section describes about the Implementation of the application and the
details of how to access this control from any application. Implementation is the
57
process of assuring that the information system is operational and then allowing users
take over its operation for use and evaluation. Implementation includes the following
activities.
Documenting the system for its users and for those who will be responsible for
maintaining it in the future.
Transferring on-going responsibility for the system from its developers to the
operations or maintenance part.
The new system of Web Based Digital Security Surveillance has been
implemented. The present system has been integrated with the already existing
hardware. The database was put into the Microsoft SQL server. The database is
accessible through Internet on any geographic location. Documentation is provided
well in such a way that it is useful for users and maintainers.
4.11 MAINTENANCE
In this we retrieve the data from the database design by searching the database.
So, for maintaining data our project has a backup facility so that there is an additional
copy of data, which needs to be maintained. More over this project would update the
annual data on to a CD, which could be used for later reference
59
CHAPTER 5
CONCLUSION
In this research work, the goal is to propose an automatic and intelligent system
able to analyze the data on inventory levels, supply chain movement, consumer
demands, sales, etc., that are crucial for making marketing and procurement decisions.
Also, it will help the organization to analyze the data with various factors and slice and
dice the data based on region, time, (year, quarter, month, week and
day),product(product category, product type, model and product) and etc. The analytics
on demand and supply data can be used for maintaining procurement level and also for
taking marketing decision .Retail analytics gives us detailed customer insights along
with insights into the business and processes of the organization with scope and need
for improvement.
60
CHAPTER 6
APPENDIX
package org.hardik.letsdobigdata;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
if(str == null) {
return null;}
result.set(StringUtils.strip(str.toString(), stripChars));
return result; }
if(str == null) {
return null; }
result.set(StringUtils.strip(str.toString()));
return result;}}
61
UDAF FOR MEAN CALCULATION
package org.hardik.letsdobigdata;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.hardik.letsdobigdata.MeanUDAF.MeanUDAFEvaluator.Column;
// Define Logging
double sum = 0;
public MeanUDAFEvaluator() {
super();
init(); }
62
// aggregated yet.
LOG.debug("Initialize evaluator");
if (col == null)
col.count = col.count + 1;
return true;}
return col;}
// D - Called when Hive decides to combine one partial aggregation with another
if(other == null) {
return true;}
col.sum += other.sum;
col.count += other.count;
return true; }
63
// E - Called when the final result of the aggregation needed.
LOG.debug("At the end of last record of the group - returning final result");
return col.sum/col.count;
} }}
64
HIVE QUERY FOR ANALYTICS:
Footfall means counting the number of persons in the Shopping Mall/Retail Shop.
5.Select storeid, customerid, count(1) from visitors group by storeid, customerid having
count(1) > 1
65
SCREENSHOTS
66
Figure A.6.3: Sqoop transformation data
67
Figure A.6.5: Finding conversion rate
68
Figure A.6.7: Frequency visitors
69
Figure A.6.9: Finding Top revenue producing product
70
Figure A.6.11: Creating the Integrating Service
71
CHAPTER 7
REFERENCES
72
PUBLICATIONS
73