Amazon Review Data Spark Example
Amazon Review Data Spark Example
The correct version of Spark and Hadoop file system is very important while installing spark. I
have installed hadoop version 1.4.1 and so the spark version also has to be 1.4.0 .
http://spark.apache.org/downloads.html
Above, figure the top row represents a Pipeline with three stages. The first two
(Tokenizer and HashingTF) are Transformers, and the third (Logistic Regression) is an
Estimator. The bottom row represents data flowing through the pipeline, where cylinders indicate
DataFrames.
In the below code data frame is created.The first parameter passed to the Createdataframe
function is the structure which has to be of the format “ID”, “text” and “label”.
The Split method is used to split the text file created by the python program into these 3
fields. The split fields and the schema of the records is passed to the Create dataframe function
which flows into the pipeline stages next stage Tokenizer.
The HashingTF. Transform () method converts the words column into feature vectors, adding a
new column with those vectors to the DataFrame. Now, since LogisticRegression is an
Estimator, the Pipeline first calls LogisticRegression.fit () to produce a
LogisticRegressionModel. There are two main ways to pass parameters to an algorithm:
1 Set parameters for an instance. E.g., if lr is an instance of LogisticRegression, one could call
lr.setMaxIter (10) to make lr.fit () use at most 10 iterations. This API resembles the API
used in spark.mllib package.
2 Pass a ParamMap to fit () or transform (). Any parameters in the ParamMap will override
parameters previously specified via setter methods.
If the Pipeline had more stages, it would call the LogisticRegressionModel’s transform () method
on the DataFrame before passing the DataFrame to the next stage.
The Pipeline.fit () method is called on the original DataFrame, which has raw text documents
and labels. A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, it produces a
PipelineModel, which is a Transformer.
The input file has to be of format ID,Text and Label for the data frame to be processed .So the
file is changed to the required format before passing to the model.
The correct label is predicted for the test set and the results are saved in the amazon_test_predict
1. Amazon review dataset was very large dataset. Some of the datasets contain millions and
billions of data rows. We faced the error in opening datasets and processing them. We
were not able to process all these records. We choose musical instrument dataset because
it was like 10k rows and metadata was 89k rows.
Ø Future Work: The analysis done in this project is limited to Musical Instruments product
only because of RAM (Speed), Space and Dataset Size limitation. Something similar can be
done for other categories as well as for all the other categories together.
Ø Machine Learning: Other classification algorithm available in spark mlib can be applied
to predict spam reviews.