0% found this document useful (0 votes)
30 views5 pages

Sentiment Analysis

The document outlines a project to build a sentiment analysis model for classifying tweets as positive, neutral, or negative using Python and various libraries. It details the process of data collection and preprocessing using the Sentiment140 dataset, including steps like tokenization and feature engineering. Additionally, it provides Python code snippets for loading and analyzing the dataset, as well as visualizing the distribution of sentiments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views5 pages

Sentiment Analysis

The document outlines a project to build a sentiment analysis model for classifying tweets as positive, neutral, or negative using Python and various libraries. It details the process of data collection and preprocessing using the Sentiment140 dataset, including steps like tokenization and feature engineering. Additionally, it provides Python code snippets for loading and analyzing the dataset, as well as visualizing the distribution of sentiments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

1.

Sentiment Analysis of Tweets (Text Classification)


Objective: Build a sentiment analysis model to classify tweets as positive, neutral, or negative
based on the text content.

Tools and Technologies:


Programming Language: Python
Libraries: pandas, nltk, scikit-learn, matplotlib, seaborn
Dataset: Use the Sentiment140 dataset (available on Kaggle) or the Twitter API to gather
tweet data.
Day-by-Day Breakdown:
Data Collection and Preprocessing

Load the dataset (e.g., Sentiment140) using pandas.


Preprocess the text by removing stopwords, special characters, and converting the text to
lowercase.
Tokenize the text and apply stemming/lemmatization.
Feature Engineering and Model Selection

PYTHON CODE
import re
import numpy as np
import pandas as pd
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.stem import WordNetLemmatizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report
# Importing the dataset
DATASET_COLUMNS=['target','ids','date','flag','user','text']
DATASET_ENCODING = "ISO-8859-1"
df = pd.read_csv('Project_Data.csv', encoding=DATASET_ENCODING,
names=DATASET_COLUMNS)
df.sample(5)

df.head()

df.columns

output: Index(['target', 'ids', 'date', 'flag', 'user', 'text'], dtype='object')

print('length of data is', len(df))

output: length of data is 241985

df. shape

output: (241985, 6)

df.info()

output:
<class 'pandas.core.frame.DataFrame'>

RangeIndex: 241985 entries, 0 to 241984

Data columns (total 6 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 target 241985 non-null int64

1 ids 241985 non-null int64

2 date 241985 non-null object

3 flag 241984 non-null object

4 user 241984 non-null object

5 text 241932 non-null object

dtypes: int64(2), object(4)

memory usage: 11.1+ MB

df.dtypes

output:

print('Count of columns in the data is: ', len(df.columns))

print('Count of rows in the data is: ', len(df))

output:

Count of columns in the data is: 6


Count of rows in the data is: 241985

ax = df.groupby('target').count().plot(kind='bar', title='Distribution of data',legend=False)

ax.set_xticklabels(ax.get_xticklabels(), rotation=0) # Rotate existing labels if needed

labels = [item.get_text() for item in ax.get_xticklabels()]

labels = ['Negative' if label == '0' else 'Positive' for label in labels] # Replace '0' and '4' with your
actual target values

ax.set_xticklabels(labels, rotation=0)

text, sentiment = list(df['text']), list(df['target'])

output:

import seaborn as sns

sns.countplot(x='target', data=df)

ouput:

You might also like