0% found this document useful (0 votes)
15 views28 pages

Report Dhruv

The document is a summer internship report by Yash Goel, a BTech IT student, detailing his experience at Suven Consultants & Technology Pvt Ltd, where he developed a sentiment analysis project using the IMDb dataset. The report covers various aspects including the organization profile, project description, tech stack, and results of the internship. It highlights the skills gained in natural language processing and machine learning throughout the project.

Uploaded by

yashgoel1810
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views28 pages

Report Dhruv

The document is a summer internship report by Yash Goel, a BTech IT student, detailing his experience at Suven Consultants & Technology Pvt Ltd, where he developed a sentiment analysis project using the IMDb dataset. The report covers various aspects including the organization profile, project description, tech stack, and results of the internship. It highlights the skills gained in natural language processing and machine learning throughout the project.

Uploaded by

yashgoel1810
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIVERSITY SCHOOL OF INFORMATION,

COMMUNICATION AND TECHNOLOGY

ACADEMIC YEAR: 2024-2025

Summer Internship Report

at Suven Consultants & Technology Pvt Ltd.

Submitted By: Yash Goel


Course: BTech IT
Semester: 5th
Enrolment No: 00816401522
Index

S.No Topics Page No


1 Declaration 2
2 Acknowledgement 3
3 Abstract 4
3.1 Organization Profile
3.2 Role
4 Introduction 5
4.1 About Internship
4.2 Tech Stack
5 Problem Statement 7
6 Dataset Description 8
7 Project Description 9
8 Code Snippets 10
9 Result 17
10 Bibliography 18
11 Certification 19

1
Declaration
I, Yash Goel, a student of Computer Science Engineering, 5th Semester in
University School of Information, Communication & Technology, Dwarka
hereby declare that the work presented in this project report was undertaken
in October 2024 under the mentorship of Mr. Rocky Jagtiani.

The matter embodied in this project report has not been submitted by me or
anybody else to any institution for award of any other degree or diploma
except to University School of Information, Communication & Technology,
for the fulfilment of the requirements for the award of degree of Bachelor of
Technology.

Yash Goel
00816401522

2
Acknowledgement
I would like to take this opportunity to express my sincere gratitude to
Suven Consultants & Technology Pvt Ltd. for providing me with an
internship opportunity at their organization. I am truly grateful for the chance
to gain practical experience and knowledge in my field of study.

Learning here has been a wonderful learning experience for me, and I
have greatly appreciated the support and guidance of my supervisors. I
have also enjoyed getting to know my fellow interns and mentoring them
as part of a team.

I am thankful for the valuable skills and experience that I have gained during
my time at here, and I am confident that they will be of great benefit to me in
my future endeavours.

I would like to express my gratitude towards my teachers at University


School of Information Communication and Technology for encouraging
students in developing their skills, my parents for their consistent support
and mentor at USIC&T for their support.

Yash Goel
00816403221

3
Abstract
Organization Profile

Suven Consultants and Technology Pvt. Ltd. Headquartered in Mumbai,


Maharashtra is a premier IT training and consulting firm specializing in
providing industry-relevant technical skills and fostering talent development
for a variety of organizations and individual learners. Established with a
mission to bridge the skill gap between industry demands and job-ready
candidates, Suven Consultants offers a wide range of training programs and
services, particularly in cutting-edge technologies and engineering fields.

Suven Consultants and Technology Pvt. Ltd. strives to empower individuals


by equipping them with the knowledge and practical skills needed to succeed
in the ever-evolving tech landscape. Their vision is to be a leading force in
tech education and consultancy, continually adapting to technological
advancements and industry trends.

Role
During my tenure at Suven Consultants & Technology Pvt Ltd., I served as
a Intern where I Single- handedly built a sentiment analysis project from
scratch, encompassing new tech stack and brought the project from
inception to completion in a time frame of 1 month.

4
Introduction
About Internship
Throughout my internship, I independently executed a sentiment analysis
project using the IMDb dataset, building a complete machine learning
pipeline from data preprocessing to model evaluation. This journey
showcased a wide range of skills in NLP, data handling, and machine
learning, contributing to a deeper understanding of natural language
processing and its applications in sentiment analysis.

Data Preprocessing: Starting with raw data, I applied essential


preprocessing techniques to transform the text data into a format suitable
for machine learning. This stage involved handling missing values,
tokenization, stopword removal, and stemming/lemmatization, ensuring that
the data was clean and well-prepared for feature extraction

Feature Engineering
To enhance the predictive capability of the models, I carefully engineered
features that could capture the nuances of the text data. This involved using
techniques like bag-of-words and custom features based on the length and
structure of the reviews. These features helped improve the model's ability
to detect sentiment effectively.

Model Selection
I explored different algorithms for sentiment classification, including logistic
regression and a random classifier, to analyze and compare their
performance on the IMDb dataset. Using Scikit-learn, I trained both models
on the dataset, fine-tuning parameters to optimize their predictive accuracy
and efficiency.

Model Evaluation
Finally, I evaluated each model's performance using metrics like accuracy,
precision, recall, gaining insights into each model’s strengths and
limitations. This step allowed me to assess which model was best suited for
sentiment analysis on this dataset, refining the solution for real-world
applications.

5
Tech Stack -
Core of the sentiment analysis is developed in Python, using libraries for
data processing, feature extraction,model building and evaluation.

Pandas: Used for data manipulation and preprocessing, allowing efficient


cleaning, filtering, and transformation of the IMDb dataset.

NumPy: Employed for numerical operations, enabling efficient data handling


and mathematical calculations.

Matplotlib: Used for data visualization, helping to generate insightful plots


and charts for understanding data distribution and model performance.

Plotly: Provides interactive visualizations, making data exploration and


analysis more intuitive and engaging.

NLTK: Used for various natural language processing tasks, such as


tokenization, stopword removal, and lemmatization, to prepare text data for
machine learning.

WordCloud: Employed to generate word clouds, providing visual insight into


the most common words in positive and negative reviews.

Scikit-learn: Implements key machine learning components, including:


CountVectorizer: For feature extraction by transforming text data into a bag-
of-words model.

Logistic Regression and Random Classifier: Used for training models on


sentiment data, providing baseline and performance comparison.
Train-test Split: Splits data into training and testing sets, enabling model
validation and performance measurement.

Tokenization: Segments text into individual words or tokens, facilitating


further processing.

Stopword Removal: Eliminates common words that do not contribute to


sentiment, reducing noise in the data.

Lemmatization: Reduces words to their base form, ensuring uniformity and


improving model accuracy.

6
Problem Statement

In today’s digital landscape, online reviews play a significant role in shaping


consumer opinions, especially in industries like entertainment where feedback
on movies and shows can influence audience decisions. With the ever-
increasing volume of online reviews, manually analyzing each review to
determine its sentiment (positive or negative) is neither efficient nor scalable.

The goal of this project is to develop a Natural Language Processing (NLP)


solution to automate sentiment analysis on the IMDb movie review dataset. By
leveraging the capabilities of NLP, specifically using NLTK, and machine
learning algorithms, this project aims to classify each review as either positive
or negative. This classification will provide valuable insights into overall
audience sentiment for movies and aid in understanding the public's reaction
to specific content.

This classification will provide valuable insights into overall audience


sentiment for movies and aid in understanding the public's reaction to specific
content. Furthermore, it will enable filmmakers, marketers, and other
stakeholders in the entertainment industry to make data-driven decisions
based on the collective sentiment expressed in these reviews.

By utilizing the NLTK library, the project will implement various Natural
Language Processing techniques, such as text normalization and feature
extraction, to enhance the performance of machine learning models. The
analysis will involve training classifiers on a labeled dataset, allowing the
system to learn patterns associated with positive and negative sentimentsve
sentiments.

7
Dataset Description
Basic Statistics of Data:
 IMDB review dataset contained four columns: Ratings, Reviews,
Movies, Resenhas
 Number of Movies: 149780
 Number of Movies: 14205

Attrribue Information:
1. Review: User review in English language
2. Ratings: Rating between 1 to 10
3. Movies: Movie names
4. Resenhas: User review translation in Protuguese language

8
Project Description

This project involves the development of a sentiment analysis system


utilizing the IMDb movie review dataset, aimed at classifying reviews as
either positive or negative. The primary objective is to leverage Natural
Language Processing (NLP) techniques to automate the analysis of
user-generated reviews, providing insights into audience sentiment and
helping stakeholders in the entertainment industry make informed
decisions.

The project consists of several stages, beginning with data


preprocessing, where raw text data from reviews is cleaned and
prepared for analysis. This includes tokenization, which breaks the text
into manageable pieces, and stopword removal to eliminate common
words that do not contribute significantly to sentiment. Additionally,
lemmatization is applied to standardize words, ensuring that different
forms of a word are treated uniformly.

Following preprocessing, the project moves into feature engineering.


This stage focuses on converting the textual data into a numerical
format suitable for machine learning algorithms. Techniques such as the
bag-of-words model are employed to create feature vectors that capture
the essence of the reviews, highlighting sentiment-related patterns.

The next phase is model selection, where various machine learning


algorithms are trained on the prepared dataset. The emphasis is placed
on using logistic regression and a random classifier to evaluate their
effectiveness in accurately classifying the sentiments of the reviews.

9
Code Snippets

10
11
12
13
14
15
16
Result

My journey with the Suven Consultants and Technology & Pvt Ltd. on
sentiment analysis project utilizing the IMDb movie review dataset has
been highly rewarding, significantly enhancing my skills in natural
language processing and machine learning. Over the course of the
project, I successfully progressed through various stages, from data
preprocessing and feature engineering to model selection and
evaluation. This hands-on experience allowed me to apply theoretical
concepts in a practical context, deepening my understanding of
sentiment analysis techniques.

The successful classification of reviews as positive or negative


highlighted my ability to navigate challenges and make informed
decisions throughout the development process. Achieving reliable
sentiment analysis results not only showcased my technical skills but
also emphasized my commitment to delivering meaningful insights into
audience feedback. This project has enriched my knowledge and laid a
strong foundation for future endeavors in the field of data science and
machine learning.

17
Bibliography

1. https://www.kaggle.com
2. https://www.imdb.com/
3. https://www.nltk.org/
4. https://scikit-learn.org/stable/
5. https://pandas.pydata.org/
6. https://towardsdatascience.com
7. https://medium.com
8. https://towardsdatascience.com
9. https://stanfordnlp.github.io/CoreNLP/

18
Certification

19
20
21
22
23
24
25
26
27

You might also like