0% found this document useful (0 votes)

9 views3 pages

Assignment 02

Uploaded by

DHRUV TILLU

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views3 pages

Assignment 02

Uploaded by

DHRUV TILLU

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Name: Dhruv Jayant Tillu Roll No.

: 6107
Subject: 510303 - BDA

ASSIGNMENT: 02
Aim: Take any text or image dataset (e.g. Stanford Sentiment Treebank, Sentiment140, Amazon Product data)
and perform analysis on it.

Requirements:
• Software: PyCharm Professional
• Libraries: PySpark Module
• Dataset: movie.csv from kaggle

Theory: Text analysis, particularly Sentiment Analysis, is the process of determining whether a piece of text
conveys a positive, negative, or neutral sentiment. In this task, we analyze the IMDB movie reviews dataset
to classify each review as either positive or negative using PySpark, which allows for distributed processing of
large datasets.

Code:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('IMDB Sentiment Analysis').getOrCreate()

#%%
df = spark.read.csv("movie.csv", header=True, inferSchema=True)
df.printSchema()
df.show(5)
#%%
from pyspark.sql.functions import col, lower

df = df.na.drop()
df = df.withColumn('review', lower(col('review')))

#%%
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.pipeline import Pipeline

tokenizer = Tokenizer(inputCol="review", outputCol="words")

remover = StopWordsRemover(inputCol="words", outputCol="filtered")

hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=10000)

idf = IDF(inputCol="rawFeatures", outputCol="features")

lr = LogisticRegression(labelCol="sentiment", featuresCol="features", maxIter=10)

pipeline = Pipeline(stages=[tokenizer, remover, hashingTF, idf, lr])

train_data, test_data = df.randomSplit([0.8, 0.2])

Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA

model = pipeline.fit(train_data)
#%%
predictions = model.transform(test_data)

predictions.select('review', 'sentiment', 'prediction').show(5)

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="sentiment", rawPredictionCol="prediction")

accuracy = evaluator.evaluate(predictions)

print(f"Accuracy: {accuracy}")

Output:

+--------------------+---------+

| review|sentiment|

+--------------------+---------+

|This movie was fa...| 1|

|I hated this movi...| 0|

|The plot was amaz...| 1|

|Terrible acting a...| 0|

|Absolutely wonder...| 1|

+--------------------+---------+

+--------------------+---------+--------------------+

| review|sentiment| filtered|

+--------------------+---------+---------------------+

|this movie was fa...| 1|[movie, fantastic,...|

|i hated this movi...| 0|[hated, movie, ter...|

+--------------------+---------+---------------------+

+--------------------+---------+--------------------+----------+

+--------------------+---------+--------------------+----------+

|I absolutely loved...| 1|[0.2, 0.1, 0.5,...]| 1.0|

|Terrible movie, no...| 0|[0.1, 0.05, 0.2,...]| 0.0|

|One of the best mo...| 1|[0.3, 0.1, 0.6,...]| 1.0|

|Worst movie I've s...| 0|[0.05, 0.02, 0.1...| 0.0|+--------------------+---------+--------------------+----------+

Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA

Accuracy: 0.88

Conclusion: we used PySpark to perform sentiment analysis on the IMDB movie reviews dataset. We
preprocessed the text, tokenized it, removed stopwords, applied TF-IDF for feature extraction, and trained a
Logistic Regression model to classify reviews as positive or negative. The model achieved an accuracy of
about 88%, showing that PySpark is effective for handling large-scale text data and building machine learning
pipelines. With further tuning or more advanced models, performance could be improved. Overall, PySpark
proved to be an efficient tool for scalable text analysis tasks.

Sledge - Thesis
No ratings yet
Sledge - Thesis
105 pages
Foc QP 3
No ratings yet
Foc QP 3
18 pages
The Unconcept PDF
No ratings yet
The Unconcept PDF
16 pages
GC WebCollect Error Codes Overview 6.8.2
100% (1)
GC WebCollect Error Codes Overview 6.8.2
65 pages
Writing A Good Summary
No ratings yet
Writing A Good Summary
28 pages
Experiment Ii Introduction To Computer Numerical Control Ii
No ratings yet
Experiment Ii Introduction To Computer Numerical Control Ii
5 pages
MM Configuration Tips
No ratings yet
MM Configuration Tips
10 pages
Social N Regional Notes
No ratings yet
Social N Regional Notes
5 pages
Evaluation of The Bangor Dyslexia Test (BDT) For Use With Adults
No ratings yet
Evaluation of The Bangor Dyslexia Test (BDT) For Use With Adults
38 pages
Soal UAS Bahasa Inggris Kelas 6 Semester 1
100% (2)
Soal UAS Bahasa Inggris Kelas 6 Semester 1
2 pages
Linux Cheat Sheet
No ratings yet
Linux Cheat Sheet
3 pages
Philosophical Counselling, Truth and Self-Interpretation: David A. Jopling
No ratings yet
Philosophical Counselling, Truth and Self-Interpretation: David A. Jopling
14 pages
Simple Present Tense
No ratings yet
Simple Present Tense
50 pages
6.4.5 Packet Tracer - Configure Static NAT
No ratings yet
6.4.5 Packet Tracer - Configure Static NAT
2 pages
Present Simple + Present Continuous Theory
No ratings yet
Present Simple + Present Continuous Theory
5 pages
"Sentiment Analysis of Imdb Movie Reviews": A Project Report
0% (1)
"Sentiment Analysis of Imdb Movie Reviews": A Project Report
22 pages
IMDb Movie Review Sentiment Analysis
No ratings yet
IMDb Movie Review Sentiment Analysis
18 pages
Holographic Microscopy With Python and Holopy
No ratings yet
Holographic Microscopy With Python and Holopy
8 pages
Sentiment Analysis Using Text Mining PDF
100% (1)
Sentiment Analysis Using Text Mining PDF
12 pages
The Study of Surah Yaseen Lesson 04
No ratings yet
The Study of Surah Yaseen Lesson 04
6 pages
6 Strategies For Teaching Special Education Classes
No ratings yet
6 Strategies For Teaching Special Education Classes
2 pages
19cse214: Theory of Computation: Case Study Report
No ratings yet
19cse214: Theory of Computation: Case Study Report
5 pages
Teacher Ila'S English Lesson FORM 4 2020: Activities
No ratings yet
Teacher Ila'S English Lesson FORM 4 2020: Activities
2 pages
NLP Final Mini Project
No ratings yet
NLP Final Mini Project
17 pages
L102 Mid 2022
No ratings yet
L102 Mid 2022
4 pages
Sentiment Analysis Based On Performance of Linear Support Vector Machine and Multinomial Naïve Bayes Using Movie Reviews With Baseline Techniques
No ratings yet
Sentiment Analysis Based On Performance of Linear Support Vector Machine and Multinomial Naïve Bayes Using Movie Reviews With Baseline Techniques
19 pages
Memorandum Dated June 26, 2023 (Additional Instructions For The Deadline Extension For The Lra's Data Conversion Close-Out Activities)
No ratings yet
Memorandum Dated June 26, 2023 (Additional Instructions For The Deadline Extension For The Lra's Data Conversion Close-Out Activities)
3 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
Data Science Project
No ratings yet
Data Science Project
24 pages
Taiko Drums - Trio
No ratings yet
Taiko Drums - Trio
5 pages
Alchemy of The Heart - Week 3 Article
No ratings yet
Alchemy of The Heart - Week 3 Article
2 pages
Sentiment Analysis On Online Reviews
No ratings yet
Sentiment Analysis On Online Reviews
11 pages
DL Project
No ratings yet
DL Project
21 pages
Sentiment Analysis Using Recurrent Neural Network
No ratings yet
Sentiment Analysis Using Recurrent Neural Network
7 pages
COMP 4650 6490 Assignment 3 2023-v1.1
No ratings yet
COMP 4650 6490 Assignment 3 2023-v1.1
6 pages
RPS Bhs Arab 2 Ekos 18 - 19
No ratings yet
RPS Bhs Arab 2 Ekos 18 - 19
5 pages
Cs221 Report
No ratings yet
Cs221 Report
16 pages
Maneesha Nidigonda Major Project
No ratings yet
Maneesha Nidigonda Major Project
11 pages
Synopsis
No ratings yet
Synopsis
8 pages
Pe3 - Week 3 4 - Classification of Dance
100% (1)
Pe3 - Week 3 4 - Classification of Dance
25 pages
Sentiment Analysis On IMDB Movie Reviews Using Machine Learning and Deep Learning Algorithms
No ratings yet
Sentiment Analysis On IMDB Movie Reviews Using Machine Learning and Deep Learning Algorithms
6 pages
Actual Reading 14
No ratings yet
Actual Reading 14
103 pages
Maneesha Nidigonda Verzeo Major Project
No ratings yet
Maneesha Nidigonda Verzeo Major Project
11 pages
An Enhanced Sentiment Analysis Using Machine Learning Methods in Imbalanced Movie Review Streams
No ratings yet
An Enhanced Sentiment Analysis Using Machine Learning Methods in Imbalanced Movie Review Streams
6 pages
Abstract
No ratings yet
Abstract
8 pages
Neural Networks
No ratings yet
Neural Networks
8 pages
Assignment Grade 11
No ratings yet
Assignment Grade 11
6 pages
F13 Final
No ratings yet
F13 Final
23 pages
"Sentiment Analysis of Imdb Movie Reviews": A Project Report
No ratings yet
"Sentiment Analysis of Imdb Movie Reviews": A Project Report
27 pages
Final Presentation
No ratings yet
Final Presentation
18 pages
BDA Report-Numbered
No ratings yet
BDA Report-Numbered
11 pages
AIML IA3 Loki & SG
No ratings yet
AIML IA3 Loki & SG
31 pages
Student Report Card Management Report
No ratings yet
Student Report Card Management Report
6 pages
Analyzing The Performance of Sentiment Analysis Using BERT DistilBERT and RoBERTa
No ratings yet
Analyzing The Performance of Sentiment Analysis Using BERT DistilBERT and RoBERTa
6 pages
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE332P LO VL2024250102168 2024-10-07 Reference-Material-I
18 pages
Ebooks File History of Modern Psychology 4th Edition by C. James Goodwin A All Chapters
100% (3)
Ebooks File History of Modern Psychology 4th Edition by C. James Goodwin A All Chapters
24 pages
Iscs 476
No ratings yet
Iscs 476
18 pages
Sentiment Analysis of IMDb Movie Reviews Using LSTM
No ratings yet
Sentiment Analysis of IMDb Movie Reviews Using LSTM
4 pages
BDA Report Final
No ratings yet
BDA Report Final
11 pages
Ai Project
No ratings yet
Ai Project
15 pages
Q 3
No ratings yet
Q 3
2 pages
Research Paper Text Classification
No ratings yet
Research Paper Text Classification
17 pages
Satish Deep Learning Lab MAnual
No ratings yet
Satish Deep Learning Lab MAnual
85 pages
Unit 3 4
No ratings yet
Unit 3 4
6 pages
Chatgpt Tweets Sentiment Analysis Using Machine Learning and Data Classification
No ratings yet
Chatgpt Tweets Sentiment Analysis Using Machine Learning and Data Classification
11 pages
Project Report
No ratings yet
Project Report
9 pages
Case Stufy
No ratings yet
Case Stufy
4 pages
Amazon Sentiment Analysis Documentation
No ratings yet
Amazon Sentiment Analysis Documentation
4 pages
Prac - 5 (Aam)
No ratings yet
Prac - 5 (Aam)
1 page
431 Paper
No ratings yet
431 Paper
5 pages
Sentiment Analysis
100% (1)
Sentiment Analysis
35 pages
MLT 09
No ratings yet
MLT 09
3 pages
Detailed Report
No ratings yet
Detailed Report
6 pages
Kindle Review Sentiment Analysis - Ipynb - Colab
No ratings yet
Kindle Review Sentiment Analysis - Ipynb - Colab
5 pages
OKE JUGA - Sentiment Analysis of IMDb Movie Reviews Using Long Short-Term Memory
No ratings yet
OKE JUGA - Sentiment Analysis of IMDb Movie Reviews Using Long Short-Term Memory
4 pages
MN10
No ratings yet
MN10
13 pages
RajSingh WIexp7
No ratings yet
RajSingh WIexp7
8 pages
Deep Learning IMDB Model
No ratings yet
Deep Learning IMDB Model
2 pages
MN2
No ratings yet
MN2
17 pages
Sentiment Analysis Using LSTM
No ratings yet
Sentiment Analysis Using LSTM
2 pages
Sentiment Analysis of Movie Reviews Using Machine Learning: Members
No ratings yet
Sentiment Analysis of Movie Reviews Using Machine Learning: Members
17 pages
Practical 2
No ratings yet
Practical 2
4 pages
Dupesh
No ratings yet
Dupesh
9 pages
DS - Lab Report.
No ratings yet
DS - Lab Report.
25 pages
PES1PG24CS018 Debjit DLTP Assignment-2 Sentiment Analysis Report
No ratings yet
PES1PG24CS018 Debjit DLTP Assignment-2 Sentiment Analysis Report
8 pages
PDS - Proj - Report-2 RISHI B VATSAL P ANISHA M
No ratings yet
PDS - Proj - Report-2 RISHI B VATSAL P ANISHA M
49 pages
NM Project
No ratings yet
NM Project
18 pages
Case Study NLP
No ratings yet
Case Study NLP
4 pages
Cream and Dark Brown Aesthetic Abstract Corner Project Presentation - 20250702 - 205800 - 0000
No ratings yet
Cream and Dark Brown Aesthetic Abstract Corner Project Presentation - 20250702 - 205800 - 0000
17 pages
OpenCV 3 Blueprints
From Everand
OpenCV 3 Blueprints
Joseph Howse
No ratings yet