Assignment 02
Assignment 02
: 6107
Subject: 510303 - BDA
ASSIGNMENT: 02
Aim: Take any text or image dataset (e.g. Stanford Sentiment Treebank, Sentiment140, Amazon Product data)
and perform analysis on it.
Requirements:
• Software: PyCharm Professional
• Libraries: PySpark Module
• Dataset: movie.csv from kaggle
Theory: Text analysis, particularly Sentiment Analysis, is the process of determining whether a piece of text
conveys a positive, negative, or neutral sentiment. In this task, we analyze the IMDB movie reviews dataset
to classify each review as either positive or negative using PySpark, which allows for distributed processing of
large datasets.
Code:
from pyspark.sql import SparkSession
#%%
df = spark.read.csv("movie.csv", header=True, inferSchema=True)
df.printSchema()
df.show(5)
#%%
from pyspark.sql.functions import col, lower
df = df.na.drop()
df = df.withColumn('review', lower(col('review')))
#%%
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.pipeline import Pipeline
model = pipeline.fit(train_data)
#%%
predictions = model.transform(test_data)
print(f"Accuracy: {accuracy}")
Output:
+--------------------+---------+
| review|sentiment|
+--------------------+---------+
|Absolutely wonder...| 1|
+--------------------+---------+
+--------------------+---------+--------------------+
| review|sentiment| filtered|
+--------------------+---------+---------------------+
+--------------------+---------+---------------------+
+--------------------+---------+--------------------+----------+
| review|sentiment| rawFeatures|prediction|
+--------------------+---------+--------------------+----------+
Accuracy: 0.88
Conclusion: we used PySpark to perform sentiment analysis on the IMDB movie reviews dataset. We
preprocessed the text, tokenized it, removed stopwords, applied TF-IDF for feature extraction, and trained a
Logistic Regression model to classify reviews as positive or negative. The model achieved an accuracy of
about 88%, showing that PySpark is effective for handling large-scale text data and building machine learning
pipelines. With further tuning or more advanced models, performance could be improved. Overall, PySpark
proved to be an efficient tool for scalable text analysis tasks.