0% found this document useful (0 votes)
56 views11 pages

Amazon Review Data Spark Example

The document describes how a pipeline model in Spark's machine learning library (mlib) works. It discusses how a pipeline of transformer and estimator stages processes a dataframe. The first stages tokenize and convert text to feature vectors. Logistic regression is then fit as an estimator and can transform test data as part of the pipeline model. Problems faced included large dataset sizes and improper JSON formatting, which were addressed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views11 pages

Amazon Review Data Spark Example

The document describes how a pipeline model in Spark's machine learning library (mlib) works. It discusses how a pipeline of transformer and estimator stages processes a dataframe. The first stages tokenize and convert text to feature vectors. Logistic regression is then fit as an estimator and can transform test data as part of the pipeline model. Problems faced included large dataset sizes and improper JSON formatting, which were addressed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Since we had almost 10,000 records of product “musical instruments”.

We divided them 5000


for training set and 5000 for test set .The Pipeline model of Sparks’s mlib is used to train the
classifier. Below is the sequence of steps carried out in the pipeline.

The correct version of Spark and Hadoop file system is very important while installing spark. I
have installed hadoop version 1.4.1 and so the spark version also has to be 1.4.0 .

http://spark.apache.org/downloads.html

CIS 612 – Advance topics in Database system 32


Pipeline model of Spark’s mlib:
How it works
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an
Estimator. These stages are run in order, and the input DataFrame is transformed as it passes
through each stage. For Transformer stages, the transform () method is called on the DataFrame.
For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of
the Pipeline Model, or fitted Pipeline), and that Transformer’s transform() method is called on
the DataFrame.

Above, figure the top row represents a Pipeline with three stages. The first two
(Tokenizer and HashingTF) are Transformers, and the third (Logistic Regression) is an
Estimator. The bottom row represents data flowing through the pipeline, where cylinders indicate
DataFrames.
In the below code data frame is created.The first parameter passed to the Createdataframe
function is the structure which has to be of the format “ID”, “text” and “label”.
The Split method is used to split the text file created by the python program into these 3
fields. The split fields and the schema of the records is passed to the Create dataframe function
which flows into the pipeline stages next stage Tokenizer.

CIS 612 – Advance topics in Database system 33


The Tokenizer.transform () method splits the raw text documents into words, adding a new
column with words to the DataFrame.

The HashingTF. Transform () method converts the words column into feature vectors, adding a
new column with those vectors to the DataFrame. Now, since LogisticRegression is an
Estimator, the Pipeline first calls LogisticRegression.fit () to produce a
LogisticRegressionModel. There are two main ways to pass parameters to an algorithm:
1 Set parameters for an instance. E.g., if lr is an instance of LogisticRegression, one could call
lr.setMaxIter (10) to make lr.fit () use at most 10 iterations. This API resembles the API
used in spark.mllib package.
2 Pass a ParamMap to fit () or transform (). Any parameters in the ParamMap will override
parameters previously specified via setter methods.

Parameters belong to specific instances of Estimators and Transformers. For example, if


we have two LogisticRegression instances lr1 and lr2, then we can build a ParamMap with both
maxIter parameters specified: ParamMap (lr1.maxIter -> 10, lr2.maxIter -> 20). This is useful if
there are two algorithms with the maxIter parameter in a Pipeline.

If the Pipeline had more stages, it would call the LogisticRegressionModel’s transform () method
on the DataFrame before passing the DataFrame to the next stage.

The Pipeline.fit () method is called on the original DataFrame, which has raw text documents
and labels. A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, it produces a
PipelineModel, which is a Transformer.

CIS 612 – Advance topics in Database system 34


This PipelineModel can then be for test data. The process for testing phase PipelineModel
has the same number of stages as the original Pipeline, but all Estimators in the original Pipeline
have become Transformers. When the Pipeline Model’s transform () method is called on a test
dataset, the data are passed through the fitted pipeline in order. Each stage’s transform () method
updates the dataset and passes it to the next stage.

CIS 612 – Advance topics in Database system 35


Start the spark shell pyspark by the command bin/pyspark

Execute by the command ./bin/spark-submit command

CIS 612 – Advance topics in Database system 36


The execution throws an error that it requires only Numpy 1.4 or higher version. It is solved by
correcting a bug in the init.py program of the mlib.

The input file has to be of format ID,Text and Label for the data frame to be processed .So the
file is changed to the required format before passing to the model.

CIS 612 – Advance topics in Database system 37


CIS 612 – Advance topics in Database system 38
Here ,I have printed few of the records and rest are passed to the file .

The correct label is predicted for the test set and the results are saved in the amazon_test_predict

CIS 612 – Advance topics in Database system 39


Ø Problems Faced During Project:

1. Amazon review dataset was very large dataset. Some of the datasets contain millions and
billions of data rows. We faced the error in opening datasets and processing them. We
were not able to process all these records. We choose musical instrument dataset because
it was like 10k rows and metadata was 89k rows.

2. After Downloading musical instrument dataset, we uploaded it in HDFS. Now the


column fields name was in small and capital letters in downloaded dataset. Hive by
default process column names in lower case letters. So, when we were fetching records
from musical instrument dataset (from HDFS) it was only fetching the rows whose
column names were in small letters. We thought these was the problem with JSONSerDe
JAR file but later we realize this problem was because of small letters. Column values
which were in capital give null values.

CIS 612 – Advance topics in Database system 40


3. Meta data was not in the strict JSON file format. Meta data file was very important in our
project because meta data contains products name and their prices. So here we must
convert this file into strict JSON file format. So, we wrote the python program to convert
improper JSON file to strict JSON file format. Improper JSON file format was because of
single quotes (‘) in Key and Values fields. There was no ‘\’ while using quotes inside the
value fields.

CIS 612 – Advance topics in Database system 41


Ø Conclusion: There has been a consistent growth in the number of reviews as well as
reviewers over the time span considered in the analysis. This obviously shows that more and
more buyers are relying on the reviews of the customers who have already bought the
product and are also writing reviews. This implies that the review system is an important
feature of the amazon online shopping system and improving the system will enhance the
shopping experience. Higher star ratings are common across all the reviewers and have an
overall high helpfulness rating as well.

Ø Future Work: The analysis done in this project is limited to Musical Instruments product
only because of RAM (Speed), Space and Dataset Size limitation. Something similar can be
done for other categories as well as for all the other categories together.

Ø Machine Learning: Other classification algorithm available in spark mlib can be applied
to predict spam reviews.

CIS 612 – Advance topics in Database system 42

You might also like