0% found this document useful (0 votes)
43 views4 pages

Big Data Researchpaper

This technical report discusses how big data analytics can provide suggestions to improve businesses by analyzing reviews and attributes of similar businesses. It also discusses related work on predicting future business attention, identifying product features in reviews, and discovering latent factors from ratings and review text to more accurately predict user preferences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views4 pages

Big Data Researchpaper

This technical report discusses how big data analytics can provide suggestions to improve businesses by analyzing reviews and attributes of similar businesses. It also discusses related work on predicting future business attention, identifying product features in reviews, and discovering latent factors from ratings and review text to more accurately predict user preferences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/344377369

Big Data Analytics on Business Improvements

Technical Report · September 2020


DOI: 10.13140/RG.2.2.26285.92642

CITATIONS READS
0 106

3 authors:

Bharath Vivekananda Swamy Naveen Prasath


New York University Karpagam Academy of Higher Education
1 PUBLICATION 0 CITATIONS 4 PUBLICATIONS 2 CITATIONS

SEE PROFILE SEE PROFILE

Rohith Shridhar Bukkambudhi


Polytechnic Institute of New York University
1 PUBLICATION 0 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Yelp DataSet Findings View project

All content following this page was uploaded by Rohith Shridhar Bukkambudhi on 25 September 2020.

The user has requested enhancement of the downloaded file.


Big Data Analytics on Business Improvements
Bharath Swamy Naveen Prasath Rohith Shridhar
New York University New York University New York University
New York New York New York
[email protected] [email protected] [email protected]

Abstract— that is most important to the business and future business


decisions. Big data analysts basically want the knowledge that
With this analytic we can give suggestions on things comes from analyzing the data.
that can be done to improve a particular business. We will
leverage Big Data technologies to identify what makes other Enterprises are increasingly looking to find
similar business “tick”. This analytic was developed using actionable insights into their data. Many big data projects
sample dataset provided by Yelp Inc. Some of the questions originate from the need to answer specific business questions.
that could be answered with this analytic are – How much of a With the right big data analytics platforms in place, an
business’s success is based on its location? How important is a enterprise can boost sales, increase efficiency, and improve
particular service to the success/failure of a business etc. operations, customer service and risk management.

Keywords—Analytics, Business Reviews, Hive, Pig, III. RELATED WORK


Hbase, Map Reduce, Yelp “Inferring Future Business Attention.” Bryan Hood,
Victor Hwang, Jennifer King. Carnegie Mellon University.
I. INTRODUCTION This paper mainly helps in predicting the future of
business at different points in time. Specifically, we predict
The idea is to provide suggestions to selected business
the number of reviews that a particular business should have at
on what they could do to improve their sales/profit. The core of
this analytics involves analyzing the reviews/tips and the the current time, and how many reviews it should have in the
attributes available for that business. Attributes of a business next six months. The majority of our work involves generating
can be defined as features/services offered by the business. E.g. sets of features to use in our model by two methods: simple
Wi-Fi, Parking, Delivery etc. manipulation of the given data and sentiment analysis on user-
provided reviews.
For example consider we have selected a Chinese
restaurant to run our analytics on. What we would do is select Mining Opinion Features in Customer Reviews
all other similar businesses which are located in close Minqing Hu and Bing Liu, Department of Computer
proximity to our selected business (1 mile radius). Now we Science, University of Illinois at Chicago
identify the list of attributes that are offered by other businesses
This paper is about summarizing all the customer
that are not offered by our selected business. We also use
reviews of a product in this e-commerce world. 1. Identify the
reviews and tips belonging to these businesses to do our
analysis. Specifically we will choose reviews that reference the features of the product that customers have expressed opinions
various attributes. We are also looking into checking the on (called opinion features) and rank the features according to
effects of increasing/decreasing operating hours on sales/profit their frequencies that they appear in the reviews.2. For each
which in turn helps improve the business. feature, we identify how many customer reviews have positive
or negative opinions. Below are the steps for opinion
We are also considering to do a study on ratings of summarization system.
chain restaurants. Basically how relevant is the ratings of a
particular branch of restaurant compared with the rest of the Hidden Factors and Hidden Topics:
branches as a whole.
Understanding Rating Dimensions with Review Text
According to the research done by the above
II. MOTIVATION mentioned people, we can increase the accuracy of predicting
Big data analytics refers to the process of collecting, user interest in a product by evaluating the user feedback.
organizing and analyzing large sets of data ("big data") to Traditional methods discard review text, which makes these
discover patterns and other useful information. Not only will latent factors difficult to interpret, since they ignore the very
big data analytics help you to understand the information text that justifies a user’s rating. For example - In order to
contained within the data, but it will also help identify the data predict how a user will respond to a product, we must uncover
the tastes of the user and the properties of the product. For Very Fast Estimation for Result and Accuracy of
example, in order to predict whether a user will enjoy Harry Big Data Analytics: the EARL System
Potter, it helps to know that the book is about wizards, as well This paper proposes a framework called EARL
as the user’s level of interest in wizardry. User feedback is (Early Accurate Result Library). It works by predicting the
required to discover these dimensions, which comes in the learning curve and choosing the appropriate sample size for
form of ratings and reviews. In this paper, they fuse latent achieving the desired error bound specified by user. The error
rating dimensions (such as those of latent-factor recommender estimation is done based on a technique called bootstrapping
systems) with latent review topics (such as those learned by that can be applied to arbitrary functions and data
LDA). This approach has several advantages. First, they can distributions. Therefore, this paper elucidate (a) the
obtain highly interpretable textual labels for latent rating functionality of EARL and its intuitive GUI interface whereby
dimensions, which helps us to ‘justify’ ratings with text. first-time users can appreciate the accuracy obtainable from
Second, this approach more accurately predicts product ratings increasing sample sizes by simply viewing the learning curve
by harnessing the information present in review text; this is displayed by EARL, (b) the usability of EARL, whereby
especially true for new products and users, who may have too conference participants can interact with the system to quickly
few ratings to model their latent factors, yet may still provide estimate the sample sizes needed to obtain the desired
substantial information from the text of even a single review. accuracies or response times, and then compare them against
Third, the newly discovered topics can be used to facilitate the accuracies and response times obtained in the actual
other tasks such as automated genre discovery, and to identify computations.
useful and representative reviews
IV. DESIGN
RATE: Recommendation-aware Trust Evaluation in (A)
Online Social Networks
In online social networks (OSNs), it is an open
challenge to select proper recommenders for predicting the
trustworthiness of a target. In real life, people who are close
and influential to us can usually make more proper and
acceptable recommendations. Based on this observation, we
present the idea of recommendation-aware trust evaluation
(RATE). We further model the recommender selection
problem into an optimal problem, with the objectives of higher
accuracy, lower risk (uncertainty), and less cost. Four metrics,
trustworthiness, influence, uncertainty, and cost, are identified
to measure the quality of recommenders. Experimental results,
with the real social network data set of opinions, validate the
effectiveness of RATE: it can predict trust with higher
accuracy (at least 24.64% higher), lower risk, and less cost
(about 30% lower). Keywords-recommendation-aware,
recommender selection, trust evaluation, online social
networks (OSNs).

Improving Restaurants by Extracting Subtopics from Yelp


Reviews
Yelp Dataset is parsed and stored into HDFS as
This paper says how the latent subtopics are
NoSql files, then we get the data of Business, Reviews, Users,
discovered from the Yelp restaurant reviews. For this, they
Check-in, and Tip. Sentimental Analysis using MapReduce is
introduced a new algorithm called online Latent Dirichlet
done on Review and Tip data and output of it with others
Allocation (LDA) algorithm. For problems with high-
inputs are sent to predictive algorithm using MapReduce. The
dimensional data, it becomes difficult to extract prominent or
raw output from it is shown as graphs.
relevant features. However, this data will often have a simpler
structure like topics in documents, user preferences, themes in
Yelp Dataset – The dataset is provided by Yelp
discussions, etc. So it is easy to approximate these effects by
(https://www.yelp.com/dataset_challenge/dataset).
using lower dimensional models such as LSI or LDA. By
Since our dataset is very big, we are using Hbase to
breaking the reviews down into latent subtopics using LDA, it
store the data. We parsed the JSON data using a JSON-simple
will be able to predict a restaurant’s star rating per hidden
package, and populated the JSON data into the distributed
topic. And these ratings per hidden topic allow to guess the
database. HBase is the Hadoop database, a distributed,
reasons for a restaurant’s Yelp rating, other than food quality.
scalable, big data store. Then, we are using Hive and Pig to
process the data and get desired result.
Database Schema Design: Intelligence and Reporting Tools (BIRT) is used to visualize
This simple database schema gives the better the top 30 attributes the people are considered about.
understanding of how each entity is related with each other.
This makes us to proceed with our project with a clear idea.

All data what we got were updated into Google


Fusion Tables. Google Fusion Tables is an experimental
application that lets you store, share, query, and visualize data
tables. It offers a REST API to manage tables, info window
templates, and styles. The query endpoint allows you to
manage data rows (insert/update/delete), and query the table
for all rows that match spatial or data conditions. The results of
Import Data into HBase
queries can be CSV or JSON, or used in the Google Maps API
The data in dataset is in the form of JSON objects. or Google Visualization API. We used Google Maps API for
We wrote a java program to convert JSON files into csv files. good visualization.
For better understanding, we parse these unstructured data, for
instance, VI. FUTURE WORK
1. We got a field called Votes on Reviews as below
{"cool":1,"funny":1,"useful":1} We had few obstacles while working on this project.
We broke it down into 3 Fields Votes_Cool, One of the important obstacle was that the analytic became
Votes_Funny, Votes_Useful unstable when used for on sub-categories. Eg. Sub Category:
2. We got a field called Attributes on Business as below Chinese restaurants vs Category: Restaurants. So would like to
"attributes": {"Take-out": true, "Wi-Fi": "no", "Good For": fix this issue. Also we have plan to make a good looking GUI
{"dessert": false, "latenight": false, "lunch": true, "dinner": and present our work to Yelp for the dataset Challenge.
false, "brunch": false, "breakfast": false}, "Caters": true, VII. CONCLUSION
"Noise Level": "average", "Takes Reservations": false, "Has
TV": true, "Delivery": false, "Ambience": {"romantic": false, Our analytics was a success, we could provide
"intimate": false, "touristy": false, "hipster": false, "divey": statistical data that could really help businesses make
false, "classy": false, "trendy": false, "upscale": false, informed decisions related to improving their business.
"casual": true}, "Parking": {"garage": false, "street": false, ACKNOWLEDGMENT
"validated": false, "lot": true, "valet": false}, "Wheelchair
Accessible": true, "Outdoor Seating": true} We thank Yelp for providing their dataset and other
We broke it down into 81 field list we could retrieve resources. As students, we got good understanding of Big
from all the business details we got. Data technologies. Thank you Yelp for giving us this
The tables are created in HBase and pig script is used opportunity.
to populate those HBase tables. We also thank Prof. Suzanne McIntosh for all the
help and guidance provided while we were working on this
V. RESULTS project.
The process of extracting data from source systems
REFERENCES
and bringing it into the data warehouse is commonly called
ETL, which stands for extraction, transformation, and loading. http://dl.acm.org/citation.cfm?id=2507163
So as said already here the data is in JSON format. We use http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=66
PHP in Extraction and Transformation phase. These JSON 23655
files are converted to csv files. Now all the csv files are loaded http://www.yelp.com/html/pdf/YelpDatasetChallengeWinner_
on to the Hbase. Hive queries were written to get those data InferringFuture.pdf
what we need for the experiment. http://www.cs.uic.edu/~liub/publications/aaai04-
Pig scripts are used to find the most considered featureExtract.pdf
attributes by the people from their reviews. Business http://www.yelp.com/html/pdf/YelpDatasetChallengeWinner_
ImprovingRestaurants.pdf

View publication stats

You might also like