0% found this document useful (0 votes)
9 views3 pages

Assignment 02

Uploaded by

DHRUV TILLU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views3 pages

Assignment 02

Uploaded by

DHRUV TILLU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Name: Dhruv Jayant Tillu Roll No.

: 6107
Subject: 510303 - BDA

ASSIGNMENT: 02
Aim: Take any text or image dataset (e.g. Stanford Sentiment Treebank, Sentiment140, Amazon Product data)
and perform analysis on it.

Requirements:
• Software: PyCharm Professional
• Libraries: PySpark Module
• Dataset: movie.csv from kaggle

Theory: Text analysis, particularly Sentiment Analysis, is the process of determining whether a piece of text
conveys a positive, negative, or neutral sentiment. In this task, we analyze the IMDB movie reviews dataset
to classify each review as either positive or negative using PySpark, which allows for distributed processing of
large datasets.

Code:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('IMDB Sentiment Analysis').getOrCreate()

#%%
df = spark.read.csv("movie.csv", header=True, inferSchema=True)
df.printSchema()
df.show(5)
#%%
from pyspark.sql.functions import col, lower

df = df.na.drop()
df = df.withColumn('review', lower(col('review')))

#%%
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.pipeline import Pipeline

tokenizer = Tokenizer(inputCol="review", outputCol="words")

remover = StopWordsRemover(inputCol="words", outputCol="filtered")

hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=10000)


idf = IDF(inputCol="rawFeatures", outputCol="features")

lr = LogisticRegression(labelCol="sentiment", featuresCol="features", maxIter=10)

pipeline = Pipeline(stages=[tokenizer, remover, hashingTF, idf, lr])

train_data, test_data = df.randomSplit([0.8, 0.2])


Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA

model = pipeline.fit(train_data)
#%%
predictions = model.transform(test_data)

predictions.select('review', 'sentiment', 'prediction').show(5)

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="sentiment", rawPredictionCol="prediction")


accuracy = evaluator.evaluate(predictions)

print(f"Accuracy: {accuracy}")

Output:

+--------------------+---------+

| review|sentiment|

+--------------------+---------+

|This movie was fa...| 1|

|I hated this movi...| 0|

|The plot was amaz...| 1|

|Terrible acting a...| 0|

|Absolutely wonder...| 1|

+--------------------+---------+

+--------------------+---------+--------------------+

| review|sentiment| filtered|

+--------------------+---------+---------------------+

|this movie was fa...| 1|[movie, fantastic,...|

|i hated this movi...| 0|[hated, movie, ter...|

+--------------------+---------+---------------------+

+--------------------+---------+--------------------+----------+

| review|sentiment| rawFeatures|prediction|

+--------------------+---------+--------------------+----------+

|I absolutely loved...| 1|[0.2, 0.1, 0.5,...]| 1.0|

|Terrible movie, no...| 0|[0.1, 0.05, 0.2,...]| 0.0|

|One of the best mo...| 1|[0.3, 0.1, 0.6,...]| 1.0|

|Worst movie I've s...| 0|[0.05, 0.02, 0.1...| 0.0|+--------------------+---------+--------------------+----------+


Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA

Accuracy: 0.88

Conclusion: we used PySpark to perform sentiment analysis on the IMDB movie reviews dataset. We
preprocessed the text, tokenized it, removed stopwords, applied TF-IDF for feature extraction, and trained a
Logistic Regression model to classify reviews as positive or negative. The model achieved an accuracy of
about 88%, showing that PySpark is effective for handling large-scale text data and building machine learning
pipelines. With further tuning or more advanced models, performance could be improved. Overall, PySpark
proved to be an efficient tool for scalable text analysis tasks.

You might also like