Project Report Final
Project Report Final
By
Md. Masum Hossain
Lakshmi
Anugya Saraswat
Pranjal Sinha
A PROJECT REPORT
Submitted to the
Department of Computer Sciences & Engineering
In partial fulfillment of the requirements
for the award of the degree
of
Bachelor of Technology
April, 2016
ii
DECLARATION
I hereby declare that this project work submission is my own work and that, to the best of
my knowledge and belief, it contains no material previously published or written by
another person nor material which has been accepted for the award of any other degree or
diploma of the university or other institute of higher learning, except where due
acknowledgment has been made in the text.
Place:
Date
iii
CERTIFICATE
This is to certify that the report entitled Advertising promotions for Restaurants of
Delhi using Big Data Analytics By Mr. Md. Masum Hossain (Roll No.120102019),
Ms.Lakshmi (Roll No.120102015), Anugya Saraswat (Roll No.130102801),Pranjal Sinha
(Roll 110101177) to Sharda University, towards the fulfillment of requirements of the
degree of Bachelor of Technology is record of bonafide final year Project work carried out
by him/her in the Department of Computer Science, School of Engineering and
Technology, Sharda University. The results/findings contained in this Project have not
been submitted in part or full to any other University/Institute for award of any other
Degree/Diploma.
Signature of Supervisor
Name: Ms. Supriya Khaitan
Designation: Asst. Professor
Place:
Date:
Date:
iv
Abstract
This project deals for for Advertisement promotion for the restaurants of Delhi applying
the Big Data Analytics. In the current era, Usage of Internet is rapidly increasing. The users
are involved in Distributed processing of mass data through composed by many machines
and personalized search services, based on the user profile have been the hotspots of
research and development.
Hadoop is a software platform which is easy for development and processing mass data. It
is written by Java. Hadoop is scalable, economical, efficient and reliable. It can be
deployed to a big cluster composed of hundreds of low-cost machines.
Main purpose of analysis is extraction of food menu as per user requirement. The system
finds out the users interested field of information through receiving, organizing and
collating the user's information of b browsing, or mining data from the history, such as
browser temporary files, personal favourites and many more.
Choosing a food from online stores is quite confusing for most of the customers.
Customers are interested in buying a food that has been widely acclaimed or available. On
the other side, the owners of restaurants are also interested in knowing where their foods
stand in competition. Both these issues tackled by the analysis of click-stream. Analysis of
clickstreams show how a website is navigated and used by visitors. Click-stream data of
online food stores of Delhi contains information which is useful for understanding the
effectiveness of marketing and merchandising efforts, such as how customers find the store
the food the environment, what food they see, and what food they purchase and finally
what is their feedback.
In our project, are tried to make an effort to help the customers in finding popular, largely
sold food items by the restaurants of Delhi much faster then the normal available methods.
For this purpose intend to create a platform that will maintain user profile and also food
product advertisements. intend to use Hadoop for click-stream analysis based on the user
profile and click-stream analysis; our website will display only that advertisement which
helps the customer in arriving decisions. In other words, contents of bpage displayed to
customer will be determined on the basis of user profile. With the advancement of
technology, large number of people are buying and selling food online. There are
commonly used techniques for online marketing such as use of banner cards, email
campaigns. Being one of the biggest cosmopolitan cities in India Delhi restaurants and its
sole websites are very busy sites. So the effective marketing depends largely on success of
online advertising and how fast is resonse is been given . Analysing the effectiveness of
website is a matter to concern for corporates that rely on b marketing. b purchasing of food
activities involves attracting and retaining customers. Traditional database technology is
indeed useful in managing the online stores. Hover, it has serious limitations. The data
generated by mouse clicks and corresponding logs are too large to be analysed by
traditional technology. New technology such as big data is being explored for finding
solution to above problems. In our project, have decided to use open source technology
Hadoop. Today the term big data draws a lot of attention, but behind the hype theres a
simple story. For decades, companies have been making business decisions based on
transactional data stored in relational databases. Beyond that critical data, hover, is a
potential treasure delightful things of non-traditional, less structured data: blogs, social
media, email, sensors, and photographs that can be mined for useful information.
Decrease in the cost of storage and increase in computing por have made it possible to
collect large data. As a result, more and more companies are now compelled to include
non-traditional yet potentially valuable data with their traditional enterprise data and using
it for their business intelligence analysis. To derive real business value from big data, need
the right tools to capture and organize a wide variety of data types from different sources,
and to be able to easily analyse it within the context of all enterprise data.
As the world is turning towards use of internet for every day-to-day activity, need for
viewing and selecting food of ones choice is kind of prime importance to the
restaurants.Same goes for the customers who are buying foods online the list of irrelevant
advertisements frustrates the user, slow or delay in response to the queries proves to be the
main reason for failures of most sites . But in our project have tried to demosntrate it
smooth for the users to select products by filtering available products based on individual
customers interests. The e-commerce field is emerging rapidly. Advertisers need a way to
promote their products in market. This way is provided by personalized websites like this
one. The reports provided by our website makes it easier for them to know the status of
their products and hence take necessary measures in order to come up for the faced losses.
To summarize, our project is a demonstration of using new technology like Big Data as an
adapter between the advertiser and customer for saving money and time.
vi
ACKNOWLEDGEMENT
Table of Contents
Appendix 2
Chapter-1 : Project Introduction
Page no.
1.1: Motivation .. 10
1.2: Overview . 10
1.3: Expected outcome 11
1.4: Gantt Chart . 12
1.5: Possible risks .. 13
Chapter-2: Methodology
2.1: System view .. 17
2.2: System components & Functionalities ... 18
2.3: Data & relational views . 19
Chapter-3: Design Criteria
3.1: System Design 21
3.2: Design Diagrams . 22
3.3: Existing System ... 23
3.4: Application areas . 25
3.5: Advantages of proposed system .. 25
3.6 System analysis 26
Chapter-4: Development & Implementation
4.1: Developmental feasibility 30
4.2: Implementation Specifications 31
References
List of Tables
Page no.
List of Figures
2.1: Showing the work progress over time on the project . 12
2.2: Showing the work progress over time . 13
2.3: Showing Big data insights 21
2.4: Architecture of online food promotion system .. 22
2.5: Working of a general restaurant ... 23
2.6: How HDFS makes incoming file stores across cluster 27
2.7: How map-reduce software frame work works . 29
2.8: Outlook of the website where customers will be looking for food .. 36
2.9: Section in the website where customers can see other customers feedback 37
2.10: Showing section in website for customer feedback ... 37
2.11: Database for menu and other details entry ... 38
2.12: Adding an item to the database 38
2.13: Customer feedback 39
2.14: Cluster Summary ... 39
2.15: Hadoop Task tracker .... 40
2.16: Failure in connecting localhost: 50070 . 41
2.17: Showing jdk1.8.0_77 installed in the system ... 42
2.18: Hadoop Version ... 42
2.19: Successful startup of MapReduce 43
2.20: Showing starting of Database file system in System ... 43
10
Appendix 2
CHAPTER-1
Introduction
1.1 Motivation
In restaurants all around the country, it seems to have become acceptable for customers to
send back a dish because they dont like it. Not because it was cold, or too salty, or because
of an untimely delay in the delivery, but purely because they happened to make the wrong
call. We have some sympathy. It happens to us lot of times when order for some food then
later change our mind, want to see foods that of similar kinds together in one place but too
tired to search each restaurant . Were all familiar with that pang of envy when our dining
companions choice looks better than our own. And, with the not inconsiderable costs of
eating out in some places, it is only human to begrudge paying for something weve
decided are not going to enjoy. However, what people dont tend to think about is that the
moment we complain to a server about a dish or food are certainly just wasting our
valuable time, subtly alter the very experience are paying for. Any restaurant will tell us
this. Everything changes; we tried to see how to reduce the hassle of customers by not
making them wait for food or ordering food for long time. So we are trying to make the
restaurants introduced with the new technology Hadoop Big-Data, the use of which will
make the whole process of ordering food much more easier for the customers and it wont
be necessary for the customers to wait and search in the different restaurants.
11
looking for food from different restaurants and trying to order the right food of their choice.
General database system that is available in market has a slow response time in comparison
to the query made to each database so ultimately as the pressure is building up on databases
its getting slow to response to the queries made to each database.[3] But with the recently
evolved technology Big Data we can make sure that the response is much faster as the
database works in parallaly.so our criteria is to make a database with faster response rate to
reduce the congestion of customers ordering food online .we assume that by getting faster
responses and reducing the congestion can make the online food purchasing much more
easier and convenient. Though we have few limitations as a Big data database requires a
high capacity system such as (8-10GB RAM) also of course, Real-Time Big Data Analytics
is not only positive as it also offers some challenges. It requires special computer
programing: The standard version of Hadoop is, at the moment, not yet suitable for realtime analysis. New tools need to be bought and used. There are hover quite some tools
available to do the job and Hadoop will be able to process data in real-time in the future.
Using real-time insights requires a different way of working within any workflow: for
example if any organization normally only receives insights once a week, which is very
common in a lot of organizations, receiving these insights every second will require a
different approach and way of working. Insights require action and instead of acting on a
weekly basis this action is now in real-time required. This will have an effect on the
culture. The objective should be to making a place it can be an organization
/restaurant/office an information-centric place.[7,8]
12
employees sharper insights. So, we expect our system to show a path to the restaurants to
make it one step ahead for the overall betterment and development and progress in the
business and try to reduce the hassle on the customers.
Figure 2.1: Showing the work progress over time on the project
This Gantt chart is showing the time taken by each works done in this project such as
creating website, collecting website details, adding different closures to the website, user
details, database, price details, offers and user tactics, cluster 1size, cluster 2 size, success
rate, response rate among clusters, bug rate.
13
14
protections
The key problems has been:[4,5]
I. Cost:
Data collection, aggregation, storage, analysis, and reporting all cost money. On top of this,
there will be compliancy costs to avoid falling foul on the issues, raising in the previous
point. These costs can be mitigated by careful budgeting during the planning stages, but
getting it wrong at that point can lead to spiraling costs, potentially negating any value
added to our bottom line by data-driven initiative. This is why starting with strategy is so
vital. A well-developed strategy will clearly set out what intend to achieve and the benefits
that can be gained so they can be balanced against the resources allocated to the project.
One of the restaurants that coordinating with us was worried about the costs of storing and
maintaining all the data it was collecting to the point that it was considering pulling the
plug on one particular analytics project, as the costs looked likely to exceed any potential
savings. By identifying and eliminating irrelevant data from the project, the restaurants re
able to bring costs back under control and achieve its objectives.
II. Bad Data:
We have come across many big data projects that start off on the wrong foot by collecting
irrelevant, out of date, or erroneous data. This usually comes down to insufficient time
being spent on designing the project strategy. The big data gold rush has led to a collect
everything and think about analyzing it later approach at many organizations. This not
only adds to the growing cost of storing the data and ensuring compliance, it leads to large
amounts of data that can become outdated very quickly. The real danger here is falling
behind competition. If some restaurants are not analyzing the right data, we wont be
drawing the right insights that will provide value. Meanwhile, competitors most likely will
be running their own data projects. And if they are getting it right, theyll take the lead.
Working with them restaurants, were able to show them how to cut the data down mostly
infographics, which clearly shod the relevant data while omitting a lot of the noise. Thats
just a simple checklist of the risks that every big data project needs to account for before
one cent is spent on infrastructure or data collecting. Businesses of all sizes should engage
wholeheartedly with big data projects. If they dont, they run the serious risk of being left
behind. But they also should be aware of the risks and enter into big data projects with their
eyes wide open. In one example, NBC used test audiences but paid for it when many in the
audience ranked successful shows such as Seinfeld poorly, and cheap copycats better.
Eventually, the marketers discovered people were only responding to familiarity, not
quality.[22]
15
Non-Functional Requirements:
ing food menus on the website,
Collecting data from database,
Establishment of Cluster connection
Users finding it easy to find the right food easily.
16
b) External interfaces.
When people will visit the website they will be making an ID for individual service to
provide them the right food from all the possible restaurants of Delhi.
System will generate an inquiry via Hiveql and fetch queries from Big data database.
c) Performance.
The main target to choose Big-data using Hadoop is because its speed the same query
being asked by both RDBMS and in HDFS the result was showing the difference
d) Attributes:
Our software is easily portable once the systems are enabled with or installed with the
HDFS. (It contains Hive QL method).
To maintain the software owners or supervisors has to update the database time to time
also users must put their feedback regarding which food they prefer and appreciate to have
on.
To ensure security for users separate user id will be made so users can see only the food
they are looking for.
We value customers personal security no bank details or any other details will be taken
during the time of registering oneself for the first time.
e) Design constraints:
The most difficult part of the project is to maintain a big database which includes 1000
Gb data of different restaurants that comprises of customers need.
HiveQL is used for fetching data HDFS is the base for making a Big-data. To make the
website HTML5, javascript,css being used.
Limitation of resources are there as since all the restaurants doesnt agree to share their
informations.
Operating systems should have windows 2000/ME onwards.
SAS/ACESS engine.
We were able to run our test queries through the SAS interface, which is executed in the
Hive environment within our Hadoop cluster.[19,21]
17
Chapter-2
Methodology
18
approach allows the restaurants to construct the big data framework of the future, while
building valuable resources and proprietary knowledge within the company. That provides
complete internal control in exchange for the duplication of much of the functionality of
the current system and allows for a future migration to a full-fledged big data platform that
will eventually allow both systems (conventional and big data) to merge.[23]
Functionalities:
The US computer software company is the latest to develop its products to cope with the
increasingly complex world of big data. The latest version of Hadoop Marketing Suite is
designed to allow those within the marketing functions of a business to perform analysis on
masses of historical data to predict future trends. The company states that the new
developments will allow digital marketers to improve a variety of digital marketing
strategies, including personalized engagement, multi-channel campaign execution, and
media monetization. This will be achieved through the ability to forecast campaign results,
performing risk analysis all through a predictive marketing dashboard. Brad Rencher,
senior vice president and general manager of Digital Marketing Business at Hadoop,
said:[5,9]
In the early days of digital marketing, analytics emerged to tell us what happened and, as
analytics got better, why it happened. Then solutions emerged to make it easier to act on
data and optimize results.
19
But the sheer amount of available data presents a challenge to quickly extract insights and
act while those insights are still valuable. The new predictive capabilities within the Digital
Marketing Suite address these challenges and help marketers turn big data into a big
opportunity.
The announcement is indicative of something hear a lot at the Big Data Insight Group
that big data analytics is going to become an invaluable tool in different teams throughout
all departments of a restaurant . Rather than being controlled by the IT department alone.
Finance, business, marketing and product development will all be using masses of data to
gain insights into various aspects of their organization and improve their planning and
performance accordingly.
To learn more about big data and the opportunities it could present to our organization,
regardless of size or sector, may wish to attend the 1st Big Data Insight Group Forum. So
more than anything our project will be focusing on navigating a path to insight and
business value using technologys hottest new trend.
20
keywords and phrases are trending online, the average price of a certain dish, and which
menu items are growing or shrinking in popularity.[15,18]
The food industry for far too long has made decisions based on its gut, says Justin Massa,
CEO and founder of Food Genius. The Chicago-based company tracks menu items at more
than 350,000 locations and has partnerships with food delivery services Seamless and Grub
Hub.Massa said that the data can help restaurants seize opportunities in their niches. The
data is going to tell with something and give with important context but the thing it comes
down to is the identity of brand. Thats going to tell us how were going to explore that
data.
Increasing customer satisfaction with internal data some technical companies are helping
restaurants improve operational efficiency. Avero, a restaurant software company, tracks
purchases and voided items at point of sales. Restaurants use the data to improve server
performance, develop tactics to increase sales and even identify thieving employees, says
Sandhya Rao, vice president of marketing and products. Rao says that restaurants may
target promotions to certain days or times of the month. According to a company case
study among the 30-plus upscale casual restaurants that Avero works with, the average
sales increase was five percent, or 250,000 each, over the course of a year. In future mobile
apps that allow customers to leave reviews, sign up for loyalty programs, take surveys, and
order food through their devices.
Operators can know customers better and customers can enjoy better experiences, which
is an encouraging environment to keep them coming back and bringing their friends,
said Jitendra Gupta, CEO.Another company, TapSavvy, is also using customer insights to
assist restaurants. After customers eat at one of the restaurants that TapSavvy serves, they
receive a tablet to fill out a survey and express criticisms or compliments.By letting
customers give feedback while theyre still in the restaurant, theyre less likely to take out
their aggression online, says TapSavvy co-founder Yaniv Tal.
If a customer leaves unhappy, word spreads very quickly, Tal says.
To be sure, many restaurants are still not using big data. Yet Massa says that they are
missing a potential opportunity to improve performance. He suggests that these businesses
might begin by collecting information themselves. For our tests, we simulated a typical
data warehouse-type workload where data is loaded in batch, and then queries are executed
to answer strategic (not operational) business questions.
21
Chapter-3
Design Criteria
22
have decided to use open source technology Hadoop. Today the term big data draws a lot
of attention, but behind the hype theres a simple story. For decades, companies have been
making business decisions based on transactional data stored in relational databases.
Beyond that critical data, hover, is a potential treasure delightful things of non-traditional,
less structured data: blogs, social media, email, sensors, and photographs that can be mined
for useful information.
Decrease in the cost of storage and increase in computing por have made it possible to
collect large data. As a result, more and more companies are now compelled to include
non-traditional yet potentially valuable data with their traditional enterprise data and using
it for their business intelligence analysis. To derive real business value from big data, need
the right tools to capture and organize a wide variety of data types from different sources,
and to be able to easily let the customer find the right food in the possible short time.
3.2: Design Diagrams
23
24
large organizations will have begun this type of do-it-with yourself approach. As we've
seen, as open source software, the price of a Hadoop-type framework (free) is attractive,
and it is relatively easy, providing the company has employees with the requisite skills to
begin to work up Hadoop applications using in-house data or data stored in the cloud.
Currently many organizations are trying to implement Big Data technology in there
database system which is a big step towards fast data transfer. The existing systems in the
market has shifted its database systems into Big-Data analytics Hadoop based work once
they got to know about its working methodology .Today big companies in the world such
as Twitter and Facebook uses Big-Data to handle or manage their database . But
experimenting with some Hadoop/NoSQL applications for the marketing department is a
far cry from developing a fully integrated big data system capable of capturing, storing and
analyzing large, multi-structured data sets. In fact, successful implementation of enterprisewide Hadoop frameworks is still relatively uncommon, and mostly the domain of very
large and experienced data-intensive companies in the financial services or the
pharmaceutical industries. As we have seen, many of those big data projects still primarily
involve structured data and depend on SQL and relational data models. Large-scale
analysis of totally unstructured data, for the most part, still remains in the rarified realm of
powerful Internet tech companies like Google, Yahoo, Facebook and Amazon, or massive
retailers like Wal-Mart.
Although cloud-based tools have obvious advantages, every company has different data
and different analytical requirements. Because so many big data projects are still largely
based on structured or semi structured data and relational data models that complement
current data management operations, many companies turn to their primary support
vendors -- like Oracle or SAP -- to help them create a bridge between old and new and to
incorporate Hadoop-like technologies directly into their existing data management
approach. Oracle's Big Data Appliance, for example, asserts that its preconfigured offering
-- once various costs are taken into account -- is nearly 40% less expensive than an
equivalent do-it-with itself built system and can be up and running in a third less time. And,
of course, the more fully big data technologies are incorporated directly into a company's
IT framework, the more complexity and potential for data sprawl grows. Depending on
configurations, full integration into a single, massive data pool (as advocated by big data
purists) means pulling in unstructured, unclean data to a company's central data reservoir
(even if that data is distributed) and potentially sharing it out to be analyzed, copied and
possibly altered by various users throughout the enterprise, often using different
configurations of Hadoop or NoSQL written by different programmers for different
reasons. Add to that the need to hire expensive Hadoop programmers and data scientists.
For traditional RDB managers, that type of approach raises the specter of untold additional
data disasters, costs and rescue work requests to already overwhelmed IT staff.
25
3.5
26
Fraud can be detected the moment it happens and proper measures can be taken to
limit the damage. The financial world is very attractive for criminals. With a realtime safeguard system, attempts to hack into any restaurants website are notified
instantly. So IT security department of any restaurant can take immediate
appropriate action.
Cost savings: The implementation of a Real-Time Big Data Analytics tools may be
expensive, it will eventually save a lot of money. There is no waiting time for
business leaders and in-memory databases (useful for real-time analytics) also
reduce the burden on a restaurants overall IT landscape, freeing up resources
previously devoted to responding to requests for reports.
Better sales insights, which could lead to additional revenue. Real-time analytics
tell exactly how well sales are doing and in case an internet retailer sees that a
product is doing extremely well, it can take action to prevent missing out or losing
revenue.
Keep up with customer trends: Insight into competitive offerings, promotions or
customer movements provides valuable information regarding coming and going
customer trends. Faster decisions can be made with real-time analytics that better
suit the (current) customer.
3.6 System Analysis
Hadoop (Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage for any kind
of data, enormous processing por and the ability to handle virtually limitless concurrent
tasks or jobs.) Big Data is a shift to scalable, elastic computing infrastructure; an explosion
in the complexity and variety of data available; and the por and value that come from
combining disparate data for comprehensive analysis make Hadoop a critical new platform
for data-driven enterprises like restaurants.
Our Database consists of two main components:
1. HDFS (Hadoop Distributed File System).
2. MapReduce
27
1. HDFS
The file store is called the Hadoop Distributed File System, or HDFS. HDFS provides
scalable fault-tolerant storage at low cost. The HDFS software detects and compensates for
hardware issues, including disk problems and server failure. HDFS stores files across a
collection of servers in a cluster. Files are decomposed into blocks, and each block is
written to more than one (the number is configurable, but three is common) of the servers.
This replication provides both fault-tolerance (loss of a single disk or server does not
destroy a file) and performance (any given block can be read from one of several servers,
improving system throughput).HDFS ensures data availability by continually monitoring
the servers in a cluster and the blocks that they manage. Individual blocks include
checksums. When a block is read, the checksum is verified, and if the block has been
damaged it will be restored from one of its replicas. If a server or disk fails, all of the data it
stored is replicated to some other node or nodes in the cluster, from the collection of
replicas. As a result, HDFS runs very ll on commodity hardware. It tolerates, and
compensate for, failures in the cluster. As clusters get large, even very expensive faulttolerant servers are likely to fail. Because HDFS expects failure, organizations can spend
less on servers and let software compensate for hardware issues.
Figure 2.6: How HDFS makes incoming file stores across cluster
28
Feature
Rack awareness
Description
Considers a nodes physical location when
allocating storage and scheduling tasks
Minimal data motion Hadoop moves compute processes to the data
on HDFS and not the other way around.
Processing tasks can occur on the physical node
where the data resides, which significantly
reduces network I/O and provides very high
aggregate bandwidth.
Utilities
Dynamically diagnose the health of the file
system and rebalance the data on different
nodes
Rollback
Allows operators to bring back the previous
version of HDFS after an upgrade, in case of
human or systemic errors
Standby NameNode Provides redundancy and supports high
availability (HA)
Operability
HDFS requires minimal operator intervention,
allowing a single operator to maintain a cluster of
1000s of nodes
Table 2.1: Features of HDFS
2. MapReduce
HDFS delivers inexpensive, reliable, and available file storage. That service alone, though,
would not be enough to create the level of interest, or to drive the rate of adoption, that
characterize Hadoop over the past several years. The second major component of Hadoop
is the parallel data processing system called MapReduce. Conceptually, MapReduce is
simple. MapReduce includes a software component called the job scheduler. The job
scheduler is responsible for choosing the servers that will run each user job, and for
scheduling execution of multiple user jobs on a shared cluster. The job scheduler consults
the NameNode for the location of all of the blocks that make up the file or files required by
a job. Each of those servers is instructed to run the users analysis code against its local
29
30
Chapter-4:
Development & Implementation
Use High Availability (HA) and dual power supplies for the master node's host
machine.
4-8 GBs of memory per processor core, with 6% overhead for virtualization.
Though Big-data takes powerful systems to process but it is not impossible to assemble
handful number of systems then using so many systems with limited or low power.
31
When installing Big Data Extensions must use VMware vCenter Single Sign-On to
provide user authentication. When logging in can pass authentication to the VMware
Single Sign-On server, which can configure with multiple identity sources such as Active
Directory and OpenLDAP(OpenLDAP is a free, open source implementation of the
Lightweight Directory Access Protocol (LDAP) developed by the OpenLDAP Project.) On
successful authentication, with username and password is exchanged for a security token
which is used to access VMware components such as Big Data Extensions.
Enable the vSphere Network Time Protocol on the ESXi hosts. The Network Time
Protocol (NTP) daemon ensures that time-dependent processes occur in sync across hosts.
Cluster Settings
We had to configure our cluster with the following settings.
32
Enabled Admission Control and set desired policy. The default policy is to tolerate one
host failure.
Set the virtual machine monitoring to virtual machine and Application Monitoring.
The Management Network VMkernel Port has vMotion and Fault Tolerance Logging
enabled.
Network Settings
Big Data Extensions deploys clusters on a single network. Virtual machines are deployed
with one NIC, which is attached to a specific Port Group. The environment determines how
this Port Group is configured and which network backs the Port Group.
33
Either a vSwitch or vSphere Distributed Switch can be used to provide the Port Group
backing a Serengeti cluster. vDS acts as a single virtual switch across all attached hosts
while a vSwitch is per-host and requires the Port Group to be configured manually.
When configuring network for use with Big Data Extensions, the following ports must be
open as listening ports.
Ports 8080 and 8443 are used by the Big Data Extensions plug-in user interface and the
Serengeti Command-Line Interface Client.
To prevent having to open a network firewall port to access Hadoop services, log into the
Hadoop client node, and from that node which can access cluster.
To connect to the Internet (for example, to create an internal Yum repository from which to
install Hadoop distributions), with may use a proxy.
Direct Attached Storage
Direct Attached Storage should be attached and configured on the physical controller to
present each disk separately to the operating system. This configuration is commonly
described as Just a Bunch of Disks (JBOD).We had to create VMFS Data-stores on Direct
Attached Storage using the following disk drive recommendations.
6-8 disk drives per host. The more disk drives per host, the better the performance.
34
40GB or more (recommended) disk space for the management server and Hadoop template
virtual disks.
Resource Requirements for the Hadoop Cluster
Data-store free space is not less than the total size needed by the Hadoop cluster, plus swap
disks for each Hadoop node that is equal to the memory size requested.
Network configured across all relevant hosts, and has connectivity with the network in use
by the management server.
HA is enabled for the master node if HA protection is needed. We have used shared storage
in order to use HA or FT to protect the Hadoop master node.
Hardware Requirements
Host hardware is listed in the VMware Compatibility Guide. To run at optimal
performance, install our vSphere and Big Data Extensions environment on the following
hardware.
Dual Quad-core CPUs or greater that have Hyper-Threading enabled. If we can estimate
our computing workload, consider using a more powerful CPU.
Used High Availability (HA) and dual power supplies for the master node's host machine.
4-8 GBs of memory per processor core, with 6% overhead for virtualization.
35
36
Chatper-5
Results & Testing
5.1: Result:
After starting collecting the menu of different restaurants around Delhi and Chittagong city
finally collected around 135Gb data which includes pictures, video, menu, restaurant
details .Our project has a vast area of exploration as began with creating a database that
will hold the menu, delivery details of the food item, time, price, picture of the food, are
have also added mail as feedback from the user.
Figure 2.8: Outlook of the website where customers will be looking for food
We tried to make the outlook as better as we can so that customers feels it as comfortable
as possible and spend some time looking for the items, its also user friendly as all the
options are nearby for a new user even we are advancing to add online immediate help for
the customer so that the customer can get the necessary help regarding their order and all
this will allow them to search and find the right food in much easier and faster way.
37
Figure 2.9: Section in the website where customers can see other customers feedback
38
39
40
It shows that have successfully initiated Hadoop single node cluster in our system .as its
showing the summary of our work where it includes total running nodes, running map
reduce tasks, occupied map reduced tasks capacity, average task per node .In the task
tracker status can see Hadoop running tasks and its status, non-running tasks and its
status, tasks from running jobs and local logs. Hadoop being successfully installed required
successful installation of JDK .There was three primary steps that had to start before testing
successful integration of Hadoop cluster in the system:
Starting all database filesystem admin dfs
Starting all mapreduce functions in Hadoop mapred
These attempts will allow the user or admin to start the logging in permission for the user
to access to the database made by Hadoop. After Starting the dfs and mapred in the
command prompt it will show the process that will confirm that it is successfully loading
the both dfs and mapred in the memory and start the local host for response to the system
for any query by the user and create the possible entry made by admin or request for
inquiry by the user.
41
5.2: Testing
have tried to test the single node cluster that made and successfully achieved two
addresses working properly which need to at-least make sure that a Hadoop single node
cluster has been developed in the system and can initiate data entry in it, it provides
multiple steps as have installed Hadoop version 1.2.1 which involves the system to work
map-reduce terminology that will allow the user to access the database gradually once the
database is of the size of estimated 135GB. Testing case 1, 2 successful as the local host
responded after starting MapReduce and database file system and formatting namenode. It
provided the information off detecting single node status in the system are currently
working on.
42
43
Now testing startup of successful Map reduce in the system if it cannot load it will show an
error message .If successful it will ask for user password and start MapReduce connection .
44
Chapter-6:
Conclusion & Future Improvements
45
or in the RDBMS for our queries. During subsequent rounds of testing, used compression
and added indexes and partitions to tune the data, Data Modeling Considerations in Hadoop
and Hive structures. As a final test, ran the same queries against our final data structures
using Impala. Impala bypasses the MapReduce layer used by Hive
Table Name
RDBMS
PAGE_CLICK_FACT 573.18
GB
Hadoop
(Text
File)
Hadoop
(Compressed
Sequence
File)
328.30
GB
42.28 GB
124.59 GB
46
6.2: Limitations
Due to limited amount of time could not collect ample amount of data also the created 3
clusters out of which had to shift to a single mirror cluster as carrying 3 clusters with huge
capabilities is not an easy task. Moreover collecting different restaurant menus also
involves the restaurant authority to approve and accept the proposal wanted to know the
following questions:
What is the percentage of viewers who clicks on the advertisement?
How many of the visitors actually purchase food from the store?
How much revenue/profit is generated by advertisement this menu will be shown in a
project work .But it was never easy to find out the actual benefit .As due to security issues
not all companies or restaurants will to provide the information. While installing Hadoop in
the system had to also keep in mind about the capacity of the system as it has to be above
3.5 GB of RAM and above 2.00 Ghz of processing speed .Which was not easy to get in
order to make different clusters ,In our computer lab at SET I -214 most of the computers
has 1 GB RAM ,also processing speed was not up to standard ,combining three computer
RAM could make one, If a single cluster g5ets down the whole system becomes
unresponsive which is a difficult situation to solve .
6.4: Scope of Improvement
There is ample amount of scope for us to improve this project as are covering only the
restaurants of Delhi and not sales support. It focuses on food products, advertisers
and customers. More specifically the system will be deigned to manage the product
information. System also used to provide:
1. Statistical analysis to advertiser, offers.
2. This statistical analysis is limited to provide mostly clicked food product, showing food
of users interest, report generation about food position in market.
3. Food displayed on dashboard must match to customers profile and it has a benefit of
collecting the delivery report as feedback which can be used to improve food matters.
With the advancement of technology, large number of people are buying and selling food
online. There are commonly used techniques for online marketing such as use of banner
cards, email campaigns. Being one of the biggest cosmopolitan cities in India Delhi
restaurants and its sole websites are very busy sites. So the effective marketing depends
largely on success of online advertising. Analyzing the effectiveness of website is a matter
47
to concern for corporates that rely on b marketing, purchasing of food activities involves
attracting and retaining customers. Traditional database technology is indeed useful in
managing the online stores. Hover, it has serious limitations, when it comes to analyzing
effectiveness of online ads. Here, need to find answers to daunting questions such as:
wers who clicks on the advertisement?
From this point of view, study of online food product promotion for Delhi restaurants
becomes an important aspect of b marketing. The data generated by mouse clicks and
corresponding logs are too large to be analyzed by traditional technology. New technology
such as big data is being explored for finding solution to above problems. In the paper,
have decided to use open source technology Hadoop. Today the term big data draws a lot
of attention, but behind the hype theres a simple story. For decades, companies have been
making business decisions based on transactional data stored in relational databases.
Beyond that critical data, hover, is a potential treasure delightful things of non-traditional,
less structured data: blogs, social media, email, sensors, and photographs that can be mined
for useful information. Decrease in the cost of storage and increase in computing power
have made it possible to collect large data. As a result, more and more companies are now
compelled to include non-traditional yet potentially valuable data with their traditional
enterprise data and using it for their business intelligence analysis. To derive real business
value from big data, we need the right tools to capture and organize a wide variety of data
types from different sources, and to be able to easily analyses it within the context of all
our enterprise data.
48
Conclusion:
As the world is turning towards use of internet for every day-to-day activity, need for
viewing and selecting food of ones choice is kind of prime importance to the restaurants.
The list of irrelevant advertisements frustrates the user, which proves to be the main reason
for failures of most sites. But our website makes it smooth for the users to select products
by filtering available products based on individual customers interests. The e-commerce
field is emerging rapidly. Advertisers need a way to promote their products in market. This
way is provided by personalized websites like this one. The reports provided by our
website makes it easier for them to know the status of their products and hence take
necessary measures in order to come up for the faced losses. So, our project is an effort to
minimize the hard work of people and restaurants and get the things they want in shortest
possible time from one place we tried to explore the possibilities that can be used in regards
to handles peoples food matters that relates the restaurants with evolving technology that
can reduce the hassle and make customers feel a wonderful experience ordering food
online while we try to make sure customers dont need to visit different restaurants online
in one place they are getting all their necessary items ,price, menu and can also view the
feedback from other customers .although databases dont solve all aspects of the big data
problem, several tools some based on databases get part-way there. Whats missing
is two side folded: First, we must improve statistics and machine learning algorithms to be
more robust and easier for unsophisticated users to apply, while simultaneously training
students in their intricacies. Second, we need to develop a data management ecosystem
around these algorithms so that users can manage and evolve their data, enforce
consistency properties over it, and browse, visualize, and understand their algorithms
49
References
[1] Running Hadoop on Ubuntu Linux,Windows(single-node cluster). http://www.michealnoll.com/tutorials/ running-Hadoop-on-Ubuntu-Linux-single-node-cluster, December
2012. [page 32,39,49,120-140]
[2] Hadoop: The Denitive Guide. OReilly Media. From Avro to ZooKeeper, May 2012.
[pqge 3-7]
[3] The Unied Modeling Language User Guide. Addison sley, October 1998. [page 9-20]
[4] Hortonworks Ari Zilka, CTO. Hadoop. 2011. [page,3,6,19,27]
[5] Jeffrey Dean and Sanjay Ghemawat. The google le system. IEEE, 2004. [page 40-60]
[6] ZHAI Yan-dong YANG Bin HUANG Lan*, WANG Xiao-i. Extraction of user prole
based on the hadoop framework. IEEE, 2009. [page 19-29]
[7] LI Chao-qing LI Xiang-yang. Several technical problems and solutions of mass data
processing. Journal China College of Insurance Management. [page 4,29,33,40,52]
[8] MIKE2.0. Big data denition. [page 1-7]
[9] Roger S. Pressman. Software Engineering: A Practitioners Approach. 7th edition, McGrawHill, 2012. [page 9]
[10] Howard Gobioff Sanjay Ghemawat and Shun-Tak Leung. Mapreduce: Simplied data
processing on large clusters. IEEE, 2004.[page 9,12,33]
[11] Pig Programming. OReilly Media inc., Alan gates, Octomeber 2011. [page 3-100]
[12] Apache Sqoop Cookbook, OReilly Media, Inc.,Kathleen Ting and Jarek Jarcec Cecho,July,2013.
[13].Indian
restaurants
scenario
in
current
days
over
time
http://india.blogs.nytimes.com/2012/05/01/in-india-more-food-and-more-suffering/?_r=0
[14] BUSINESS INTELLIGENCE AND ANALYTICS: FROM BIG DATA TO BIG IMPACT
by Hsinchun Chen, Roger H. L. Chiang, Veda C. Storey [ page 19-65]
[15] Big Data: A Revolution That Will Transform How we Live, Work and Think by Viktor
Mayer-Schonberger , Kenneth Cukier (John Murray Publishers Ltd). [page 1-7]
[16] From Databases to Big Data by Sam Madden Massachusetts Institute of Technology
50
[17] Benefit-Risk Analysis for Big Data Projects by Jules Polonetsky ,Omer Tene, Joseph
Jerome [page 33-50]
[18] Data Modeling Considerations in Hadoop and Hive by Clark Bradley, Ralph
Hollinshead, Scott Kraus, Jason Lefler, Roshan Taheri October 2013[page17,40,49]
[19]How big data is changing the database scenario for good
http://www.infoworld.com/article/3003647/database/how-big-data-is-changing-thedatabase-landscape-for-good.html