0% found this document useful (0 votes)

56 views11 pages

Amazon Review Data Spark Example

The document describes how a pipeline model in Spark's machine learning library (mlib) works. It discusses how a pipeline of transformer and estimator stages processes a dataframe. The first stages tokenize and convert text to feature vectors. Logistic regression is then fit as an estimator and can transform test data as part of the pipeline model. Problems faced included large dataset sizes and improper JSON formatting, which were addressed.

Uploaded by

Sai Teja Pinninti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views11 pages

Amazon Review Data Spark Example

Uploaded by

Sai Teja Pinninti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Since we had almost 10,000 records of product “musical instruments”.

We divided them 5000

for training set and 5000 for test set .The Pipeline model of Sparks’s mlib is used to train the
classifier. Below is the sequence of steps carried out in the pipeline.

The correct version of Spark and Hadoop file system is very important while installing spark. I
have installed hadoop version 1.4.1 and so the spark version also has to be 1.4.0 .

http://spark.apache.org/downloads.html

CIS 612 – Advance topics in Database system 32

Pipeline model of Spark’s mlib:
How it works
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an
Estimator. These stages are run in order, and the input DataFrame is transformed as it passes
through each stage. For Transformer stages, the transform () method is called on the DataFrame.
For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of
the Pipeline Model, or fitted Pipeline), and that Transformer’s transform() method is called on
the DataFrame.

Above, figure the top row represents a Pipeline with three stages. The first two
(Tokenizer and HashingTF) are Transformers, and the third (Logistic Regression) is an
Estimator. The bottom row represents data flowing through the pipeline, where cylinders indicate
DataFrames.
In the below code data frame is created.The first parameter passed to the Createdataframe
function is the structure which has to be of the format “ID”, “text” and “label”.
The Split method is used to split the text file created by the python program into these 3
fields. The split fields and the schema of the records is passed to the Create dataframe function
which flows into the pipeline stages next stage Tokenizer.

CIS 612 – Advance topics in Database system 33

The Tokenizer.transform () method splits the raw text documents into words, adding a new
column with words to the DataFrame.

The HashingTF. Transform () method converts the words column into feature vectors, adding a
new column with those vectors to the DataFrame. Now, since LogisticRegression is an
Estimator, the Pipeline first calls LogisticRegression.fit () to produce a
LogisticRegressionModel. There are two main ways to pass parameters to an algorithm:
1 Set parameters for an instance. E.g., if lr is an instance of LogisticRegression, one could call
lr.setMaxIter (10) to make lr.fit () use at most 10 iterations. This API resembles the API
used in spark.mllib package.
2 Pass a ParamMap to fit () or transform (). Any parameters in the ParamMap will override
parameters previously specified via setter methods.

Parameters belong to specific instances of Estimators and Transformers. For example, if

we have two LogisticRegression instances lr1 and lr2, then we can build a ParamMap with both
maxIter parameters specified: ParamMap (lr1.maxIter -> 10, lr2.maxIter -> 20). This is useful if
there are two algorithms with the maxIter parameter in a Pipeline.

If the Pipeline had more stages, it would call the LogisticRegressionModel’s transform () method
on the DataFrame before passing the DataFrame to the next stage.

The Pipeline.fit () method is called on the original DataFrame, which has raw text documents
and labels. A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, it produces a
PipelineModel, which is a Transformer.

CIS 612 – Advance topics in Database system 34

This PipelineModel can then be for test data. The process for testing phase PipelineModel
has the same number of stages as the original Pipeline, but all Estimators in the original Pipeline
have become Transformers. When the Pipeline Model’s transform () method is called on a test
dataset, the data are passed through the fitted pipeline in order. Each stage’s transform () method
updates the dataset and passes it to the next stage.

CIS 612 – Advance topics in Database system 35

Start the spark shell pyspark by the command bin/pyspark

Execute by the command ./bin/spark-submit command

CIS 612 – Advance topics in Database system 36

The execution throws an error that it requires only Numpy 1.4 or higher version. It is solved by
correcting a bug in the init.py program of the mlib.

The input file has to be of format ID,Text and Label for the data frame to be processed .So the
file is changed to the required format before passing to the model.

CIS 612 – Advance topics in Database system 37

CIS 612 – Advance topics in Database system 38
Here ,I have printed few of the records and rest are passed to the file .

The correct label is predicted for the test set and the results are saved in the amazon_test_predict

CIS 612 – Advance topics in Database system 39

Ø Problems Faced During Project:

1. Amazon review dataset was very large dataset. Some of the datasets contain millions and
billions of data rows. We faced the error in opening datasets and processing them. We
were not able to process all these records. We choose musical instrument dataset because
it was like 10k rows and metadata was 89k rows.

2. After Downloading musical instrument dataset, we uploaded it in HDFS. Now the

column fields name was in small and capital letters in downloaded dataset. Hive by
default process column names in lower case letters. So, when we were fetching records
from musical instrument dataset (from HDFS) it was only fetching the rows whose
column names were in small letters. We thought these was the problem with JSONSerDe
JAR file but later we realize this problem was because of small letters. Column values
which were in capital give null values.

CIS 612 – Advance topics in Database system 40

3. Meta data was not in the strict JSON file format. Meta data file was very important in our
project because meta data contains products name and their prices. So here we must
convert this file into strict JSON file format. So, we wrote the python program to convert
improper JSON file to strict JSON file format. Improper JSON file format was because of
single quotes (‘) in Key and Values fields. There was no ‘\’ while using quotes inside the
value fields.

CIS 612 – Advance topics in Database system 41

Ø Conclusion: There has been a consistent growth in the number of reviews as well as
reviewers over the time span considered in the analysis. This obviously shows that more and
more buyers are relying on the reviews of the customers who have already bought the
product and are also writing reviews. This implies that the review system is an important
feature of the amazon online shopping system and improving the system will enhance the
shopping experience. Higher star ratings are common across all the reviewers and have an
overall high helpfulness rating as well.

Ø Future Work: The analysis done in this project is limited to Musical Instruments product
only because of RAM (Speed), Space and Dataset Size limitation. Something similar can be
done for other categories as well as for all the other categories together.

Ø Machine Learning: Other classification algorithm available in spark mlib can be applied
to predict spam reviews.

CIS 612 – Advance topics in Database system 42

Spark MLIB
No ratings yet
Spark MLIB
50 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Big Data Analytics in Intelligent Transportation Systems A Survey
No ratings yet
Big Data Analytics in Intelligent Transportation Systems A Survey
20 pages
Project Documents On Cricket Analysis
No ratings yet
Project Documents On Cricket Analysis
19 pages
Shopping Cart Items Recommendation PDF
No ratings yet
Shopping Cart Items Recommendation PDF
8 pages
Data Science and Machine Learning
No ratings yet
Data Science and Machine Learning
30 pages
AIot Lab Syllabus
No ratings yet
AIot Lab Syllabus
4 pages
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
Team Renegades MMLA Report
No ratings yet
Team Renegades MMLA Report
27 pages
Project Report Cricket20 20 Analysis
No ratings yet
Project Report Cricket20 20 Analysis
24 pages
Final Project Report
No ratings yet
Final Project Report
34 pages
Project Report-Cricket20-20 Analysis
No ratings yet
Project Report-Cricket20-20 Analysis
22 pages
Project Report Cricket20 20 Analysis
No ratings yet
Project Report Cricket20 20 Analysis
22 pages
Airplane Passanger Satisfication Prediction
No ratings yet
Airplane Passanger Satisfication Prediction
86 pages
Ip Final Project Word 01
No ratings yet
Ip Final Project Word 01
47 pages
Project File Railways
No ratings yet
Project File Railways
49 pages
SL-III Lab Manual
No ratings yet
SL-III Lab Manual
74 pages
PM Shri Kendriya Vidyalaya Pattom Shift Ii: Movie Data Analysis
No ratings yet
PM Shri Kendriya Vidyalaya Pattom Shift Ii: Movie Data Analysis
35 pages
DMV Lab Manual
No ratings yet
DMV Lab Manual
45 pages
MSC Academic Internship Config Manual IDS Improvement Using MIGBM Feature Selection
No ratings yet
MSC Academic Internship Config Manual IDS Improvement Using MIGBM Feature Selection
19 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
167 pages
Data Science Report
No ratings yet
Data Science Report
33 pages
Lecture 4-5
No ratings yet
Lecture 4-5
48 pages
A PROJECT WORK ON-pages-deleted
No ratings yet
A PROJECT WORK ON-pages-deleted
23 pages
Estimator
No ratings yet
Estimator
29 pages
BDA Lec11
No ratings yet
BDA Lec11
32 pages
Numpy Notes
No ratings yet
Numpy Notes
38 pages
Movie Ticket Booking
No ratings yet
Movie Ticket Booking
30 pages
DSBDA Manual
No ratings yet
DSBDA Manual
76 pages
IP PROJECT FILE WorldCup 2024-25
No ratings yet
IP PROJECT FILE WorldCup 2024-25
15 pages
Approved l7 Comp7067 2023-24 Sub Brief
No ratings yet
Approved l7 Comp7067 2023-24 Sub Brief
7 pages
Lab Manual
No ratings yet
Lab Manual
80 pages
Final
No ratings yet
Final
36 pages
Doubt Clearance Session (AI) On 29.12.2024
No ratings yet
Doubt Clearance Session (AI) On 29.12.2024
41 pages
Aadityaji
No ratings yet
Aadityaji
17 pages
Abhishek BDA File
No ratings yet
Abhishek BDA File
23 pages
Himanshu Gupta Configuration Manual
No ratings yet
Himanshu Gupta Configuration Manual
16 pages
AIL Quiz Loc
No ratings yet
AIL Quiz Loc
33 pages
AIL Quiz
No ratings yet
AIL Quiz
30 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
Internal Assessment Practical 2
No ratings yet
Internal Assessment Practical 2
2 pages
The Foreign Policy of Russia Changing Systems Enduring Interests 6th Edition Robert H. Donaldson
No ratings yet
The Foreign Policy of Russia Changing Systems Enduring Interests 6th Edition Robert H. Donaldson
59 pages
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
From Everand
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
Arun Manivannan
No ratings yet
CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
IP Assigment Edited
No ratings yet
IP Assigment Edited
20 pages
Kavin
No ratings yet
Kavin
13 pages
Medium Carbon Steel
No ratings yet
Medium Carbon Steel
1 page
5515 Management Project
No ratings yet
5515 Management Project
13 pages
Functional Python Programming
From Everand
Functional Python Programming
Steven Lott
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Toxicological Materia Medica
No ratings yet
Toxicological Materia Medica
104 pages
Sample Resume - 1yr DS
No ratings yet
Sample Resume - 1yr DS
2 pages
Ip Project
No ratings yet
Ip Project
16 pages
Design and Implementation of Domestic News Collection System
No ratings yet
Design and Implementation of Domestic News Collection System
20 pages
SampleQuestion - AIOL 2024
No ratings yet
SampleQuestion - AIOL 2024
5 pages
Project On Netflix Data Analysis
100% (1)
Project On Netflix Data Analysis
22 pages
Celce Murcia
No ratings yet
Celce Murcia
39 pages
Mergers & Acquisitions in Uganda: Recent Developments
No ratings yet
Mergers & Acquisitions in Uganda: Recent Developments
3 pages
Gw2014 Guide Admin
No ratings yet
Gw2014 Guide Admin
746 pages
XII IP Practical List 2023-24
No ratings yet
XII IP Practical List 2023-24
4 pages
Executive Summary: Municipality of Tigbauan Comprehensive Land Use Plan 2014-2024
No ratings yet
Executive Summary: Municipality of Tigbauan Comprehensive Land Use Plan 2014-2024
96 pages
IMRaD Research Format
100% (1)
IMRaD Research Format
62 pages
P Rafful
No ratings yet
P Rafful
74 pages
X-Cite 120Q User Guide
No ratings yet
X-Cite 120Q User Guide
42 pages
DBDAL LAB - MANUAL - Final
No ratings yet
DBDAL LAB - MANUAL - Final
93 pages
1 JBC Market Narrative
No ratings yet
1 JBC Market Narrative
28 pages
Igt 2021 Ipal
No ratings yet
Igt 2021 Ipal
21 pages
Asella Teaching & Referral Hospital Asella Teaching & Referral Hospital
No ratings yet
Asella Teaching & Referral Hospital Asella Teaching & Referral Hospital
43 pages
6 English NCERT Unit 1
No ratings yet
6 English NCERT Unit 1
19 pages
Tle 8
No ratings yet
Tle 8
17 pages
IELTS Speaking Test Part 1 - Arts
No ratings yet
IELTS Speaking Test Part 1 - Arts
5 pages
02 Waves Physics (Home Assignment-1)
No ratings yet
02 Waves Physics (Home Assignment-1)
14 pages
Lab Typhoon
No ratings yet
Lab Typhoon
4 pages
00 - LTE-EPS Fundamentals SD v2.0
No ratings yet
00 - LTE-EPS Fundamentals SD v2.0
10 pages
Black Denim Tear - Google Search
No ratings yet
Black Denim Tear - Google Search
1 page
Math 062 Glossary - English Language Document - United States - English Glossary Review
No ratings yet
Math 062 Glossary - English Language Document - United States - English Glossary Review
14 pages
Pharmaceutical Engineering Set 1
No ratings yet
Pharmaceutical Engineering Set 1
5 pages
Apache Spark Mllib Guide For Pipelining
No ratings yet
Apache Spark Mllib Guide For Pipelining
3 pages
Od 328834820054155100
No ratings yet
Od 328834820054155100
1 page
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Profession Through Vedic Astrology by DR S C Kursija
88% (8)
Profession Through Vedic Astrology by DR S C Kursija
22 pages
Certificate ELBTHH78385
No ratings yet
Certificate ELBTHH78385
2 pages
Systems Design (System Design Specification)
No ratings yet
Systems Design (System Design Specification)
3 pages
Music Note Detection 2023
No ratings yet
Music Note Detection 2023
2 pages
Oct '17 Newsletter
No ratings yet
Oct '17 Newsletter
8 pages
Sae j429 Vs Astm A354 BD PDF
No ratings yet
Sae j429 Vs Astm A354 BD PDF
2 pages
Organizational Learning
No ratings yet
Organizational Learning
3 pages
The Compete Ccna 200-301 Study Guide: Network Engineering Edition
From Everand
The Compete Ccna 200-301 Study Guide: Network Engineering Edition
Joe Spoto
5/5 (4)
Developing A Real-Time Musical Note Detection and Transcription System Using Machine Learning Algorithms
No ratings yet
Developing A Real-Time Musical Note Detection and Transcription System Using Machine Learning Algorithms
1 page
Terminal Lubrication General Guidelines - Crimp Lubrication Guidelines - 20170516 - Legal Approved
No ratings yet
Terminal Lubrication General Guidelines - Crimp Lubrication Guidelines - 20170516 - Legal Approved
1 page
This Is A Complete Dictionary of PALO MAYOMBE
86% (7)
This Is A Complete Dictionary of PALO MAYOMBE
39 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet

Amazon Review Data Spark Example

Uploaded by

Amazon Review Data Spark Example

Uploaded by

Since we had almost 10,000 records of product “musical instruments”.

We divided them 5000

CIS 612 – Advance topics in Database system 32

CIS 612 – Advance topics in Database system 33

Parameters belong to specific instances of Estimators and Transformers. For example, if

CIS 612 – Advance topics in Database system 34

CIS 612 – Advance topics in Database system 35

Execute by the command ./bin/spark-submit command

CIS 612 – Advance topics in Database system 36

CIS 612 – Advance topics in Database system 37

CIS 612 – Advance topics in Database system 39

2. After Downloading musical instrument dataset, we uploaded it in HDFS. Now the

CIS 612 – Advance topics in Database system 40

CIS 612 – Advance topics in Database system 41

CIS 612 – Advance topics in Database system 42

You might also like