ProjectReport Sample-1
ProjectReport Sample-1
ProjectReport Sample-1
Submitted By
Bazgha Razi
(Reg No: 191380214)
Signature
Delhi Global Institute of Technology
ACKNOWLEDGEMENT
Bazgha Razi
Reg No: 191380214
B.Tech(CSE)
DECLARATION
Bazgha Razi
Reg No: 191380214
B.Tech(CSE)
CONTENTS
CERTIFICATE II
ACKNOWLEDGEMENT III
DECLARATION IV
1. INTRODUCTION
1.1 Introduction .…………………………………………………7
2. METHODOLOGY
2.1 Methodology ………………………………………………...9
3. PROJECT DETAILS
3.1 Name of the Project ………………………………………...10
3.2 Hardware Used ………………….……………….................11
3.3 Software Used ……………………………………………...11
3.4 Libraries Used ……………………………………………...11
3.5 Version Control/ IDE ……………………………................11
4. TWITTER SENTIMENT OPINION MINING
4.1 What is Twitter Sentiment Opinion Mining .…..…………..12
4.2 Problem Statement …………………...……..……………...12
4.3 Why Twitter Sentiment Opinion Mining ......………………13
4.4 Application Area …………………………..…….................13
4.5 Block Diagram ……………………………………………..15
5. WEEKLY TIME MANAGEMENT
5.1 Weekly work for project……………………………………16
6. TECHNOLOGY USED
6.1 What is an IDE?.....................................................................17
6.2 Python …………………………………………...................18
6.3 Matplotlib …………………………………………………..20
6.4 Numpy ……………………………………..…….................21
6.5 NLTK ……………………………………….……………...22
6.6 Scikit-Learn ..……….…………………….……...................23
7. BUILDING TWITTER SENTIMENT ANALYSIS
7.1 Import the Necessary Dependencies ….………....................24
7.2 Read and Load the Dataset………………….........................24
7.3 Exploratory Data Analysis …………………………………25
7.4 Data Visualization of Target Variables ...…………………..28
7.5 Data Preprocessing ……………………………....................29
7.6 Splitting Our Data Into Train and Test Subsets ……………35
7.7 Transforming the Dataset Using TF-IDF Vectorizer ………35
7.8 Function for Model Evaluation ………………………….....36
7.9 Model Building …………………………………………….36
7.10 Model Evaluation …………………………………………..41
8. TESTING
8.1 Testing ...……………………………….……...……………42
9. LEARNING’S AND VALUE ADDITIONS
9.1 Learning’s and value additions ...…….……….....................47
9.2 Theoretical v/s Practical Knowledge ....……………………49
10. LIMITATIONS
10.1 Limitations …………………………………………………50
11. CONCLUSION
11.1 Conclusion ………………………………………………….52
12. FUTURE SCOPE
12.1 Future Scope …...…………………………………………...53
13. REFERENCES
13.1 References…………...……………………………………...54
INTRODUCTION
Software Used
Windows (7,10,11)
VS Code (Python IDE)
Problem Statement
In this project, we try to implement an NLP Twitter sentiment
opinion mining model that helps to overcome the challenges of
sentiment classification of tweets. We will be classifying the tweets
into positive or negative sentiments. The necessary details regarding
the dataset involving the Twitter sentiment analysis project are:
The dataset provided is the Sentiment140 Dataset which consists
of 1,600,000 tweets that have been extracted using the Twitter
API.[7] The various columns present in this Twitter data are:
• target: the polarity of the tweet (positive or negative)
• ids: Unique id of the tweet
• date: the date of the tweet
• flag: It refers to the query. If no such query exists, then it is NO
QUERY.
• user: It refers to the name of the user that tweeted.
• text: It refers to the text of the tweet.
Why Twitter Sentiment Opinion Mining?
1. Understanding Customer Feedback: By analyzing the
sentiment of customer feedback, companies can identify areas
where they need to improve their products or services.
2. Reputation Management: Sentiment analysis can help
companies monitor their brand reputation online and quickly
respond to negative comments or reviews.
3. Political Analysis: Sentiment analysis can help political
campaigns understand public opinion and tailor their messaging
accordingly.
4. Crisis Management: In the event of a crisis, sentiment analysis
can help organizations monitor social media and news outlets
for negative sentiment and respond appropriately.
5. Marketing Research: Sentiment analysis can help marketers
understand consumer behavior and preferences, and develop
targeted advertising campaigns.
The key here is the Interpreter that is responsible for translating high-
level Python language to low-level machine language.
The way Python works is as follows:
1. A Python virtual machine is created where the packages
(libraries) are installed. Think of a virtual machine as a
container.
2. The python code is then written in .py files
3. CPython compiles the Python code to bytecode. This bytecode
is for the Python virtual machine.
4. When you want to execute the bytecode then the code will be
interpreted at runtime. The code will then be translated from the
bytecode into the machine code. The bytecode is not dependent
on the machine on which you are running the code. This makes
Python machine-independent.
Matplotlib
Matplotlib is a comprehensive library for creating static, animated,
and interactive visualizations in Python. Matplotlib makes easy things
easy and hard things possible.
• Create publication quality plots.
• Make interactive figures that can zoom, pan, update.
• Customize visual style and layout.
• Export to many file formats.
• Embed in JupyterLab and Graphical User Interfaces.
• Use a rich array of third-party packages built on Matplotlib.
Matplotlib is a Python 2D plotting library which produces publication
quality figures in a variety of hardcopy formats and interactive
environments across platforms. Matplotlib can be used in Python
scripts, the Python and IPython shells, the Jupyter notebook, web
application servers, and four graphical user interface toolkits.
There are several toolkits that are available that extend python
matplotlib functionality. Some of them are separate downloads, others
can be shipped with the matplotlib source code but have external
dependencies.
• Basemap: It is a map plotting toolkit with various map
projections, coastlines, and political boundaries.
• Cartopy: It is a mapping library featuring object-oriented map
projection definitions, and arbitrary point, line, polygon and
image transformation capabilities.
• Excel tools: Matplotlib provides utilities for exchanging data
with Microsoft Excel.
• Mplot3d: It is used for 3-D plots.
• Natgrid: It is an interface to the natgrid library for irregular
gridding of the spaced data.
There are various plots which can be created using python matplotlib.
Some of them are listed below:
Numpy
NumPy is the fundamental package for scientific computing in
Python. It is a Python library that provides a multidimensional array
object, various derived objects (such as masked arrays and matrices),
and an assortment of routines for fast operations on arrays, including
mathematical, logical, shape manipulation, sorting, selecting, I/O,
discrete Fourier transforms, basic linear algebra, basic statistical
operations, random simulation and much more.
At the core of the NumPy package, is the ndarray object. This
encapsulates n-dimensional arrays of homogeneous data types, with
many operations being performed in compiled code for performance.
There are several important differences between NumPy arrays and
the standard Python sequences:
• NumPy arrays have a fixed size at creation, unlike Python lists
(which can grow dynamically). Changing the size of
an ndarray will create a new array and delete the original.
• The elements in a NumPy array are all required to be of the
same data type, and thus will be the same size in memory. The
exception: one can have arrays of (Python, including NumPy)
objects, thereby allowing for arrays of different sized elements.
• NumPy arrays facilitate advanced mathematical and other types
of operations on large numbers of data. Typically, such
operations are executed more efficiently and with less code than
is possible using Python’s built-in sequences.
• A growing plethora of scientific and mathematical Python-based
packages are using NumPy arrays; though these typically
support Python-sequence input, they convert such input to
NumPy arrays prior to processing, and they often output NumPy
arrays. In other words, in order to efficiently use much (perhaps
even most) of today’s scientific/mathematical Python-based
software, just knowing how to use Python’s built-in sequence
types is insufficient - one also needs to know how to use NumPy
arrays.
NLTK
The Natural Language Toolkit (NLTK) is a platform used for building
Python programs that work with human language data for applying in
statistical natural language processing (NLP).
It contains text processing libraries for tokenization, parsing,
classification, stemming, tagging and semantic reasoning. It also
includes graphical demonstrations and sample data sets as well as
accompanied by a cook book and a book which explains the
principles behind the underlying language processing tasks that
NLTK supports.
The Natural Language Toolkit is an open source library for the
Python programming language originally written by Steven Bird,
Edward Loper and Ewan Klein for use in development and education.
It comes with a hands-on guide that introduces topics in
computational linguistics as well as programming fundamentals for
Python which makes it suitable for linguists who have no deep
knowledge in programming, engineers and researchers that need to
delve into computational linguistics, students and educators.
Scikit-Learn
Scikit-Learn is a library in Python that provides many unsupervised
and supervised learning algorithms. It’s built upon some of the
technology you might already be familiar with, like NumPy, pandas,
and Matplotlib!
The functionality that scikit-learn provides include:
• Regression, including Linear and Logistic Regression
• Classification, including K-Nearest Neighbors
• Clustering, including K-Means and K-Means++
• Model selection
• Preprocessing, including Min-Max Normalization
Scikit-learn (Sklearn) is the most useful and robust library for
machine learning in Python. It provides a selection of efficient tools
for machine learning and statistical modeling including classification,
regression, clustering and dimensionality reduction via a consistence
interface in Python.
Building twitter sentiment opinion mining
Output:
Step-3: Exploratory Data Analysis
3.1: Five top records of data
Output:
Output:
Output:
Output:
Output:
3.7: Checking for null values
Output:
Output:
Output:
Output:
Step-4: Data Visualization of Target Variables
Output:
Output:
Step-5: Data Preprocessing
In the above-given problem statement, before training the model, we performed
various pre-processing steps on the dataset that mainly dealt with removing
stopwords, removing special characters like emojis, hashtags, etc. The text
document is then converted into lowercase for better generalization.
Subsequently, the punctuations were cleaned and removed, thereby reducing the
unnecessary noise from the dataset. After that, we also removed the repeating
characters from the words along with removing the URLs as they do not have
any significant importance.
5.1: Selecting the text and Target column for our further analysis
Output:
Output:
Output:
Output:
Output:
Output:
Output:
Output:
5.17: Separating input feature and label
Output:
Output:
• Accuracy Score
• ROC-AUC Curve
• Logistic Regression
The idea behind choosing these models is that we want to try all the classifiers
on the dataset ranging from simple ones to complex models, and then try to find
out the one which gives the best performance among them.
9.1: Model-1
Output:
Output:
9.4: Plot the ROC-AUC Curve for model-2
Output:
9.5: Model-3
Output:
AUC Score: All three models have the same ROC-AUC score.
We, therefore, conclude that the Logistic Regression is the best model for the
above-given dataset.
• Unit testing done to show that the unit does not satisfy the
application and /or its implemented software does not match the
intended designed structure.
Integration Testing
Focuses on combining units to evaluate the interaction among them
Sadly, this isn’t really the case. Twitter’s users do not represent the general
population of the world, or even the population of their particular regions. They
tend to be younger, more left-leaning, and more affluent than the overall
population. Just think about it when was the last time your grandfather used
Twitter? This means that if you’re doing research on a possible customer base
or political issue, focusing on Twitter alone can lead you to somewhat
erroneous conclusions.
2. RETWEETS
A major part of Twitter is the ability to ‘retweet’ a specific tweet, often with
your own commentary added to it. Retweets are often understood as
endorsements of a position, but it is not rare to see a tweet retweeted along with
a criticism of it.
The character limit on tweets often means users abbreviate words. These words
tend to be stop words, so this may not be a particularly problematic issue. But
when conversations focus on specific topics, people often invent acronyms on
the fly or omit words whose presence can be inferred by humans. These
acronyms and omissions could possibly be detected by a machine learning
technique, but that’s a whole other layer that complicates the matter.
In addition, Twitter data is going to have some errors due to autocorrect (and
normal spelling errors too, of course). Unlike spelling errors, autocorrect errors
are harder to detect since the problematic word is a real word, just used in the
wrong way. Again, there are certainly tools that can correct for this, but using
them adds an additional layer of complexity to any Twitter-based research
project.
[15] Liu, S., Li, F., Li, F., Cheng, X., &Shen, H.. Adaptive
cotraining SVM for sentiment classification on tweets. In
Proceedings of the 22nd ACMinternational conference on
Conference on information & knowledgemanagement (pp. 2079-
2088). ACM,2013.