CCL MiniProject

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Dept: Computer Engineering

Subject: Cloud Computing Lab Subject Code: CSL605


Year/Semester: TE-VI Date: Page No:
Student Name: Roll No. Division:

Mini Project
Analysing Social Media Reaction on Political Issues using Machine
Learning

Abstract:
Currently, Twitter is one of the most popular social media platforms that enables its user to post
their thoughts on anything, commonly in the form of limited word length. The massive number of
Twitter users has made Twitter a valuable source of data in analysing people behaviour and
tendency in reacting to a certain political issue. Unfortunately, the textual postings are difficult to
analyse as the dimension of the data is too high to be clustered. One needs to find the most
appropriate method to cluster Twitter posting with an acceptable clustering result. This study
presents the clustering of Twitter users based on the most common words used by the users in
reacting to a trending political issue. A comparative study between hierarchical clustering and k-
means clustering methods are presented and discussed in this study, as well as the word trend or
main topic of the issue by histogram and word cloud.
List of Abbreviations:
AI: Artificial Intelligence
DFD: Data Flow Diagram
Et al: And Others
ML: Machine Learning
Sci-kit Learn: Library for Machine Learning.
UML: Unified Modelling Language
1. INTRODUCTION

1.1 Introduction
The fact that social media has massive users that share their thoughts on particular topic,
makes social media be the most valuable dataset to analyze human behavior and trend in
a certain time and place. The postings of the social media vary from individual facts and
opinion to facts and opinion cited from news. Twitter is considered as on appropriate site
to get dataset of what people share publicly, thanks to its large number of users and
datasets.
The consideration of choosing clustering as method is due to the fact that the variable
targets is not set or labelled. Thus, one needs to identify each target with the help of
unsupervised machine learning which is clustering. The traits of each text will be clustered
based on the most common similar words tweeted by the users.
Key Features:
i. To group Similar data points together and discover underlying pattern using
K-means.
ii. The most frequent words plot that will show the words which appear related
to particular topic.
iii. Wordcloud that will show the most frequent words in regard to the issue.
Those words that appear on the word cloud is the main topic used or trend
from the twitter user
iv. Dendrogram which shows relationships between similar set of data.

1.2 Motivation
Politics, in general, is the platform by which people create, maintain, and change the laws
that govern their lives. As a result, conflict and collaboration are inextricably connected
in politics. On the one hand, the presence of conflicting views, competing expectations,
competing needs, and competing interests is expected to result in conflict over the rules
under which people live.

1.3 Problem Statement and Objective


Politics, in general, is the platform by which people create, maintain, and change the laws
that govern their lives. As a result, conflict and collaboration are inextricably connected
in politics.
So, Political issues are one of the most important things that can become a discussion for
Twitter users. If scientists or government want to monitor the trend opinion regarding
some political issue in a certain time, then it is needed to make a computer program to
extract the dataset from Twitter. The statistical and analytical result can be really a
beneficial report for the monitoring purposes.

1.4 Organization of the Report


This report is divided into three parts i.e.,
1. Introduction which is discussed above.
2. Literature Survey which will give idea about existing system and its Limitation and
Mini project Objectives.
3. Implementation which will give idea of details of hardware and software which are
required for this project and lastly conclusion and Future work during development of
project.
2. Literature Survey

2.1 Survey of Existing System


1. Twitter as Valuable and Significant Dataset Source
Twitter is a social media platform which enables the users to “tweet” (term used for
Twitter postings). Several studies by (Doshi et al. (2017), El Rahman et al. (2019))
exploited Twitter datasets in order to cluster them in different kind of purposes. Twitter
datasets usage varies from human studies to market research.
A study presented in Sechelea, et al. (2015) aimed to create an algorithm which has an
ability to determine the main topics of interest from the twitter posting or tweets
datasets. Furthermore, this study also presented the visualization method that has
function to track the twitter activity based on the geographical location.

2. Hierarchical Clustering and K-Means as Effective Clustering Methods


Hierarchical clustering method is one of the unsupervised machine learning served in
such tree structure and also can be thought on flat clustering methods. There are many
studies regarding on increasing the effectiveness of hierarchical clustering method.
The other clustering method is k-means, as a part of machine learning. On behalf of its
simplicity and straightforwardness, k-means has become the most common used
clustering algorithm for big data. One can use k-means for large datasets. The datasets
that are being clustered can be in the form of a distribution across some machines.

2.2 Limitation Existing System or Research Gap


There are some algorithms for clustering but they have few limitations:
One of the problems with k-means is that the data needs to follow a circular format. The
way k-means calculates the distance between data points has to do with a circular path, so
non-circular data isn't clustered correctly.
1. Choosing k manually.
2. Being dependent on initial values.
3. Clustering data of varying sizes and density.
4. Clustering outliers
5. Scaling with number of dimensions

2.3 Mini Project Contribution/ Objectives


There is limitation to some of the algorithm as well the cleaning of the text will need to be
improved to avoid insignificant meaning to the clustering analysis. Therefore, in this project,
those limitations will be overcome and also further improved.
Social media can quickly become a breeding ground for misinformation. For example, during
the COVID-19 pandemic, almost 50% of US adults saw a lot or some fake news about the
crisis, and almost 70% say that fake news causes a great deal of confusion.
To counteract this, governments must invest in social media listening to help them identify
inaccuracies and respond accordingly — especially as citizens will be looking to government
social media accounts to provide them with accurate and objective information.
By doing the analysis we are able to know that how many false tweets are been made.
3. Implementation

3.1 Details of Hardware & Software


The design part of the “Analysing Social Media Reaction on Political Issues using
Machine Learning” is divided in to two sections which consist of the hardware and the
software part. Before the software the design part can be developed, the hardware part is
first completed to provide a platform for the software to work. We need to install some
libraries for effective working of the application. We install scikit-learn and its
dependencies through Python.

Hardware Requirement:
Processor: Intel i3 or more
Ram: 8GB

Libraries Development:
Scikit-learn (SkLearn) is the most useful and robust library for machine learning in Python.
It provides a selection of efficient tools for machine learning and statistical modelling
including classification, regression, clustering and dimensionality reduction via a
consistence interface in Python. This library, which is largely written in Python, is built
upon NumPy, SciPy and Matplotlib.

Programming Language:
Python is an old and very popular language designed in 1991 by Guido van Rossum. It is
open source and is used for web and Internet development (with frameworks such as
Django, Flask, etc.), scientific and numeric computing (with the help of libraries such as
NumPy, SciPy, etc.), software development, and much more.

Operating System Support:


All of the new developments and algorithms in scikit-learn runs on the following desktop
operating systems: Windows, Linux, macOS, FreeBSD, NetBSD, OpenBSD.

NumPy:
NumPy is a package that defines a multi-dimensional array object and associated fast math
functions that operate on it. It also provides simple routines for linear algebra and Fourier
transform and sophisticated random-number generations. NumPy replaces both Numeric
and Num array.

Pandas:
Pandas is a Python package providing fast, flexible, and expressive data structures
designed to make working with “relational” or “labelled” data both easy and intuitive. It
aims to be the fundamental high-level building block for doing practical, real-world data
analysis in Python. Additionally, it has the broader goal of becoming the most powerful
and flexible open-source data analysis/manipulation tool available in any language.
Matplotlib:
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib
is a multi-platform data visualization library built on NumPy arrays and designed to work
with the broader SciPy stack. It was introduced by John Hunter in the year 2002.

Flask:
Flask is a web framework written in Python that provides developers with tools to build
web applications. It is based on Werkzeug’s (WSGI) toolkit and Jinja templating engine.

3.2 Cloud Platform Used for Deployment

Render:
Render is a unified cloud to build and run all your apps and websites with free TLS
certificates, a global CDN, DDoS protection, private networks, and auto deploys from
Git.

Steps for Deployment:


1. Create a GitHub Repository
Create a new repository on GitHub.com.
• Open your terminal or command prompt.
• Change the current working directory to your local project using
the cd command.
• Use the git init command to initialize the local directory as a Git repository.
• Add the files in your new local repository using the git add . command.
• Commit the files that you’ve staged in your local repository using the git
commit -m "Initial commit" command.
• At the top of your repository on GitHub.com’s Quick Setup page, click to
copy the remote repository URL.
• In your terminal or command prompt, add the URL for the remote repository
where your local repository will be pushed using git remote add origin
<remote-repository-URL>.
• Verify that a remote named ‘origin’ has been added by running git remote -v.
• Push your local commits to GitHub by running git push origin main.

2. On Render
• Create New Web Services
• Connect to GitHub Repository
3. Environment Setup

4. Deployed Status

3.3 Conclusion and Future Work


Social media analyses are an emerging field where there are more problems than ready
solutions. This report we have presented a proposed system for the acquisition, analysis
and visualization of Twitter data related to political issues. Twitter messages are harvested
and stored in a database and the data is processed to eliminate the possible trivial term.
The noise words removal of the Twitter data will give good result in defining the most
common words used by the Twitter user. We presented a clustering algorithm capable of
identifying hot topics of interest in a tweet data set. There are several other directions for
future work. First, to improve the clustering result to avoid insignificant meaning to the
clustering analysis.
4. Annexure
4.1 Published Papers/ Proof of concept
[1] Analyzing the Behavior of Youth to Sociality
Using Social Media Mining
Ardra, Blessy Merin Varughese, Merline Susan Joseph, Preethi Elsa Thomas, Sherly K.
K.,
Department of Information Technology,
Rajagiri School of Engineering and Technology
Kochi, India
International Conference on Intelligent Computing and Control System (ICICCS 2017)
Published by: ©2017 IEEE

[2] Clustering and Sentiment Analysis on Twitter Data


Shreya Ahuja, Gaurav Dubey
Department of Computer Science and Engineering,
Amity University,
Noida, India
2nd International Conference on Telecommunication and Networks (TEL-NET 2017)
Published by: ©2017 IEEE

[3] Twitter Topic Progress Visualization using Micro-clustering


Hashimoto, T., Kusaba, A., Shepard, D., Kuboyama, T., Shin, K. and Uno, T.
9th International Conference on Pattern Recognition Applications and Methods (ICPRAM
2020)
Published by: ©2022 by SCITEPRESS – Science and Technology Publications.

Marks:

R1 R2 R3 Total
Sign
(3 Marks) (5 Marks) (7 Marks) (15 Marks)

R1 R2 Total
Sign
(5 Marks) (5 Marks) (10 Marks)

You might also like