0% found this document useful (0 votes)
162 views61 pages

Team14 Mini Report FINAL

The document is a mini project report submitted to Jawaharlal Nehru Technological University, Hyderabad by 4 students - Jogu Kranthi kumar, Gurram NithishKumar, Gattu Prathyusha, and Madishetty Sindhuja - under the guidance of their assistant professor Mrs D. Sriveni. The report is on automated review classification using machine learning and is submitted in partial fulfillment of the requirements for a Bachelor of Technology degree in Computer Science and Engineering. It includes sections on vision and mission of the university and department, as well as outcomes related to the program and project.

Uploaded by

R.Sandhya Rani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views61 pages

Team14 Mini Report FINAL

The document is a mini project report submitted to Jawaharlal Nehru Technological University, Hyderabad by 4 students - Jogu Kranthi kumar, Gurram NithishKumar, Gattu Prathyusha, and Madishetty Sindhuja - under the guidance of their assistant professor Mrs D. Sriveni. The report is on automated review classification using machine learning and is submitted in partial fulfillment of the requirements for a Bachelor of Technology degree in Computer Science and Engineering. It includes sections on vision and mission of the university and department, as well as outcomes related to the program and project.

Uploaded by

R.Sandhya Rani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 61

A Mini Project Report on

AUTOMATED REVIEW
CLASSIFICATION USING ML
Submitted to

Jawaharlal Nehru Technological University, Hyderabad

in partial fulfillment of requirements for the award of the degree of

BACHELOR OF TECHNOLOGY

in

COMPUTER SCIENCE AND ENGINEERING

By

Jogu Kranthi kumar (18BD1A054W)


Gurram NithishKumar(18BD1A054Q)
Gattu Prathyusha (18BD1A054L)
Madishetty Sindhuja (18BD1A0554)

Under the guidance of

Mrs D.SRIVENI
Assistant Professor
Department of CSE

Department of Computer Science and Engineering


KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
Approved by AICTE, Affiliated to JNTUH
3-5-1206, Narayanaguda, Hyderabad – 500029
2021-2022
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
(Accredited by NBA & NAAC, Approved By A.I.C.T.E., Reg by Govt of Telangana
State & Affiliated to JNTU, Hyderabad)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

This is to certify that the project entitled AUTOMATED REVIEW CLASSIFICATION USING ML

being submitted by

J. Kranthi Kumar (18BD1A054W)

G. Nithish Kumar (18BD1A054Q)

G. Prathyusha (18BD1A054L)

M.Sindhuja (18BD1A0554)

In partial fulfilment for the award of Bachelor of Technology in Computer Science and Engineering
affiliated to the Jawaharlal Nehru Technological University, Hyderabad during the year 2021-22.

Internal Guide Head of the Department

(Mrs. D. Sriveni) (Dr. S. Padmaja)

Submitted for Viva Voce Examination held on

External Examiner

Unit of Keshav Memorial Educational Society


#: 3-5-1026 Narayanaguda Hyderabad 500029.
040-3261407 www.kmit.in e-mail: [email protected]

Vision of KMIT 
Producing quality graduates trained in the latest technologies and related tools and striving to make India a
world leader in software and hardware products and services.To achieve academic excellence by imparting
in depth knowledge to the students, facilitating research activities and catering to the fast growingandever-
changing industrial demands and societal needs. 

Mission of KMIT 
 To provide a learning environment that inculcates problem solving skills, professional, ethical
responsibilities, lifelong learning through multi modal platforms and prepare students to become
successful professionals.
 To establish industry institute Interaction to make students ready for the industry.
 To provide exposure to students on latest hardware and software tools.
 To promote research based projects/activities in the emerging areas of technology convergence.
 To encourage and enable students to not merely seek jobs from the industry but also to create new
enterprises.
 To induce a spirit of nationalism which will enable the student to develop, understand lndia's
challenges and to encourage them to develop effective solutions.
 To support the faculty to accelerate their learning curve to deliver excellent service to students.
Vision & Mission of CSE 
Vision of the CSE 
To be among the region's premier teaching and research Computer Science and Engineering  departments
producing globally competent and socially responsible graduates in the most conducive  academic
environment. 
Mission of the CSE 

 To provide faculty with state of the art facilities for continuous professional development and
research, both in foundational aspects and of relevance to emerging computing trends.

 To impart skills that transform students to develop technical solutions for societal needs and
inculcate entrepreneurial talents. 
 To inculcate an ability in students to pursue the advancement of knowledge in various specializations
of Computer Science and Engineering and make them industry-ready.
 To engage in collaborative research with academia and industry and generate adequate resources for
research activities for seamless transfer of knowledge resulting in sponsored projects and
consultancy. 
 To cultivate responsibility through sharing of knowledge and innovative computing solutions that
benefit the society-at-large. 
 To collaborate with academia, industry and community to set high standards in academic excellence
and in fulfilling societal responsibilities.
PROGRAM OUTCOMES (POs) 

1. Engineering Knowledge: Apply the knowledge of mathematics, science, engineering fundamentals and
an engineering specialization to the solution of complex engineering problems.

2. Problem analysis: Identify formulate, review research literature, and analyse complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and
engineering sciences 
3. Design/development of solutions: Design solutions for complex engineering problem and design system
component or processes that meet the specified needs with appropriate consideration for the public health
and safety, and the cultural societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information to
provide valid conclusions. 
5. Modern tool usage: Create select, and, apply appropriate techniques, resources, and modern engineering
and IT tools including prediction and modelling to complex engineering activities with an understanding of
the limitations. 
6. The engineer and society: Apply reasoning informed by the contextual knowledge to societal, health,
safety. legal und cultural issues and the consequent responsibilities relevant to professional engineering
practice. 
7. Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts and demonstrate the knowledge of, and need for sustainable
development. 
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practice. 
9. Individual and team work: Function effectively as an individual, and as a member or leader in diverse
teams and in multidisciplinary settings. 
10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and
design documentation make effective presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one's own work, as a member and leader in a team, to manage
projects and in multidisciplinary environments. 
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.

PROGRAM SPECIFIC OUTCOMES (PSOs) 

PSO1: An ability to analyse the common business functions to design and develop appropriate Information
Technology solutions for social upliftment. 
PSO2: Shall have expertise on the evolving technologies like Mobile Apps, CRM, ERP, Big Data, etc. 

PROGRAM EDUCATIONAL OBJECTIVES (PEOs) 

PEO1: Graduates will have successful careers in computer related engineering fields or will be able to
successfully pursue advanced higher education degrees. 
PEO2: Graduates will try and provide solutions to challenging problems in their profession by applying
computer engineering principles. 
PEO3: Graduates will engage in life-long learning and professional development by rapidly adapting
changing work environment. 
PEO4: Graduates will communicate effectively, work collaboratively and exhibit high levels of
professionalism and ethical responsibility.
PROJECT OUTCOMES

P1: To pick current days problems and simplify the process using modern tools and technologies. 
P2: One comes to know which data pre-processing technique to be selected when several techniques
are prevailed. 
P3: One can learn how to choose a particular algorithm to meet the project requirements.
P4: Among all the scales used to judge a model’s performance, one can learn which metric to be used.

LOW - 1
 MEDIUM - 2
 HIGH - 3

PROJECT OUTCOMES MAPPING PROGRAM OUTCOMES

 
PO  PO1  PO2  PO3  PO4  PO5  PO6  PO7  PO8  PO9  PO10  PO11  PO12
P1  1 3 3 1 2 1 3 1
P2  3 2 1 2 1 2 2

P3 3 3 2 1 1 2 2

P4 2 1 3 1 2

PROJECT OUTCOMES MAPPING PROGRAM SPECIFIC OUTCOMES

PSO  PSO1  PSO2

P1  3 1

P2  2

P3 1

P4 1
PROJECT OUTCOMES MAPPING PROGRAM EDUCATIONAL OBJECTIVES

PEO  PEO1  PEO2  PEO3  PEO4

P1  3 3 1 2

P2  3 3 1

P3 1 3 2

P4 2

Mini Project Title: AUTOMATED REVIEW CLASSIFICATION USING ML


Internal Guide Name: Mrs. D. SRIVENI

Internal Guide Signature:

Date:15/12/2021(CO PO SUBMISSION DATE)

Team Number:14

Team details:

JOGU KRANTHI KUMAR (18BD1A054W)

GURRAM NITHISH KUMAR (18BD1A054Q)

GATTU PRATHYUSHA (18BD1A054L)

MADISHETTY SINDHUJA (18BD1A0554)


DECLARATION
We hereby declare that the project report entitled “AUTOMATED REVIEW
CLASSIFICATION USING ML” is done in the partial fulfillment for the award of the Degree in
Bachelor of Technology in Computer Science and Engineering affiliated to Jawaharlal Nehru
Technological University, Hyderabad. This project has not been submitted anywhere else.

JOGU KRANTHI KUMAR (18BD1A054W)

GURRAM NITHISH KUMAR (18BD1A054Q)

GATTU PRATHYUSHA (18BD1A054L)

MADISHETTY SINDHUJA (18BD1A0554)


ACKNOWLEDGMENT
We take this opportunity to thank all the people who have rendered their full support

to our project work.

We render our thanks to Dr. MaheshwarDutta, B.E., M Tech., Ph.D., Principal who

encouraged us to do the Project.

We are grateful to Mr. Neil Gogte, Director for facilitating all the amenities required

for carrying out this project.

We express our sincere gratitude to Mr. S. Nitin, Director and Mrs.Anuradha, Dean

Academics for providing an excellent environment in the college.

We are also thankful to Dr. S. Padmaja, Head of the Department for providing us

with both time and amenities to make this project a success within the given schedule.

We are also thankful to our guide Mrs.D.Sriveni, for his/her valuable guidance and

encouragement given to us throughout the project work.

We would like to thank the entire CSE Department faculty, who helped us directly

and indirectly in the completion of the project. We sincerely thank our friends and family

for their constant motivation during the project work.

JOGU KRANTHI KUMAR (18BD1A054W)

GURRAM NITHISH KUMAR (18BD1A054Q)

GATTU PRATHYUSHA (18BD1A054L)

MADISHETTY SINDHUJA (18BD1A0554)


CONTENT

DESCRIPTION PAGE NO.

ABSTRACT i
LIST OF FIGURES ii
CHAPTERS
1. INTRODUCTION 1-3
1.1. Purpose of the Project 2
1.2. Problem with the Existing System 2
1.3. Proposed System 2
1.4. Scope of the Project 3
1.5. Architecture Diagram 4
2. SOFTWARE REQUIREMENTS 6-9
SPECIFICATIONS
2.1. What is SRS 6
2.2. Role of SRS 6
2.3. Requirements Specification Document 6
2.4. Functional Requirements 7
2.5. Non-Functional Requirements 7
2.6. Software Requirements 9
2.7. Hardware Requirements 9
3. LITERATURE SURVE 11-15
3.1. Technologies Used 12
4. SYSTEM DESIGN 17-21
4.1. Introduction to UML 17
4.2. UML Diagrams 17
4.2.1. Use Case diagram 17
4.2.2. Sequence diagram 19
4.2.3. Class diagram 20
5. IMPLEMENTATION 23-26
5.1. Pseudo code 23
5.2. Code Snippets 24
6. TESTING 28-30
6.1. Introduction to Testing 28
6.2. Software Test Lifecycle 29
6.3. Test Cases 30
7. SCREENSHOTS 32-34
7.1. Mounting data and loading required 32
libraries
7.2. load text from training and test data 32
7.3. Vectorizationand fitting data 33
7.4. Plotting confusion matrix and testing 33
7.5. Testing results 34
FURTHER ENHANCEMENTS 36
CONCLUSION 38
REFERENCES 40
ABSTRACT

Social media has given ample opportunity to the consumer in terms of gauging the
quality of the products by reading and examining the reviews posted by the users of
online shopping platforms. Moreover, online platforms such as “Amazon.com ” provides
an option to the users to label a review as 'Helpful if they find the content of the review
valuable. This helps both consumers and manufacturers to evaluate general preferences
in an efficient manner by focusing mainly on the selected helpful reviews. However, the
recently posted reviews get comparatively fewer votes and the higher voted reviews get
into the users' radars first. This study deals with these issues by building an automated
text classification system to predict the helpfulness of online reviews irrespective of the
time they are posted. The study is conducted on the data collected from Amazon.com
consisting of the reviews on fine food. The focus of previous research has mostly
remained on finding a correlation between the review helpfulness measure and review
content based features. In addition to finding significant content-based features, this
study uses three different approaches to predict the review helpfulness which includes
vectorized features, review and summary centric features, and word embedding-based
features. Moreover, the conventional classifiers used for text classification such as
Support vector machine, Logistic regression, and Multinomial naive Bayes are compared
with a decision tree-based ensemble classifier, namely Extremely randomized trees. It is
found that the extremely randomized trees classifier outperforms the conventional
classifiers except in the case of vectorized features with unigrams and bigrams. Among
the features, vectorized features perform much better compared to other features. This
study also found that the content-based features such as review polarity, review
subjectivity. Review character and word count, review average word length, and
summary character count are significant predictors of the review helpfulness.

i
LIST OF FIGURES

LIST OF FIGURES PAGE NO

1 Proposed System 3
.
4
1 Architecture diagram 4
.
5
4.3 Use Case Diagram 18
.1
4.3 Sequence Diagram 20
.2
4.3 Class Diagram 21
.3.

ii
REVIEW CLASSIFICATION USING ML

CHAPTER -1

1. INTRODUCTION

1
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

1.1 Purpose of Project

The purpose of this project is to provide review classification for any product.Why
reviews are important is the fact that they help boost customer loyalty towards a brand. A
person who takes the time to leave a positive review on Yelp about a particular brand on
product is likely to come back for more business should the need arise . There are a
number of productsavailable in the market. Customers needs to read the positive and
negative reviews for each product and choose the best one. This proposed system
provides weather the reviews are positive or negativeandhelps in calculating how good a
product is.

1.2 Problems with Existing System

The focus of previous research has mostly remained on finding a correlation


between the review helpfulness measure and review content based features. In addition
to finding significant content-based features, this study uses three different approaches to
predict the review helpfulness which includes vectorized features, review and summary
centric features, and word embedding-based features.

1.3 Proposed System

A sentiment model predicts whether the opinion given in a piece of text is positive,
negative, or neutral. Text classification is the process of categorizing the text into a
group of words. By using NLP, text classification can automatically analyze text and
then assign a set of predefined tags or categories based on its context. NLP is used for
sentiment analysis, topic detection, and language detection. In a bag of words, a vector
represents the frequency of words in a predefined dictionary of a word list.
Of all data, text is the most unstructured form and so means we have a lot of cleaning
to do. These pre-processing steps help convert noise from high dimensional features to the
low dimensional space to obtain as much accurate information as possible from the text.
Data classification is the process of classifying or categorizing the raw texts into
2
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

predefined groups. In other words, it is the phenomenon of labeling the unstructured


texts with their relevant tags that are predicted from a set of predefined categories.

1.4 Scope of the Project

The scope of Text classification using sentimental analysis includes majority divided
into two categories: A bag of words model: In this case, all the sentences in our dataset
are tokenized to form a bag of words that denotes our vocabulary.

1.5 Architecture Diagram

3
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

The data or review set with which the model will be trained are converted to vectors and
machine learning algorithm process them. The algorithm is trained for maximum
accuracy. A predictive Model is generated to classify the reviews or data on its own.
When we give vectorised document of test data, this predictive model should classify
them as positive or negative.

4
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

CHAPTER -2

5
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

2 SYSTEM RERUIREMENT SPECIFICATIONS

2.1 What is SRS?


Software Requirement Specification (SRS) is the starting point of the software
developing activity. As system grew more complex it became evident that the goal of the
entire system cannot be easily comprehended. Hence the need for the requirement phase
arose. The software project is initiated by the client needs. The SRS is the means of
translating the ideas of the minds of clients (the input) into a formal document (the
output of the requirement phase.)

The SRS phase consists of two basic activities:


Problem/Requirement Analysis:
The process is order and more nebulous of the two, deals with understand the problem,
the goal and constraints.
Requirement Specification:
Here, the focus is on specifying what has been found giving analysis such as
representation, specification languages and tools, and checking the specifications are
addressed during this activity.
The Requirement phase terminates with the production of the validate SRS
document. Producing the SRS document is the basic goal of this phase.

2.2 Role of SRS


The purpose of the Software Requirement Specification is to reduce the
communication gap between the clients and the developers. Software Requirement
Specification is the medium though which the client and user needs are accurately
specified. It forms the basis of software development. A good SRS should satisfy all the
parties involved in the system.

2.3 Requirements Specification Document


A Software Requirements Specification (SRS) is a document that describes the
6
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

nature of a project, software or application. In simple words, SRS document is a manual


of a project provided it is prepared before you kick-start a project/application. This
document is also known by the names SRS report, software document. A software
document is primarily prepared for a project, software or any kind of application.

There are a set of guidelines to be followed while preparing the software


requirement specification document. This includes the purpose, scope, functional and
non-functional requirements, software and hardware requirements of the project. In
addition to this, it also contains the information about environmental conditions required,
safety and security requirements, software quality attributes of the project etc.
The purpose of SRS (Software Requirement Specification) document is to describe
the external behaviour of the application developed or software. It defines the operations,
performance and interfaces and quality assurance requirement of the application or
software. The complete software requirements for the system are captured by the SRS.
This section introduces the requirement specification document for Word Building Game
using Alexa which enlists functional as well as non-functional requirements.

2.4 Functional Requirements


For documenting the functional requirements, the set of functionalities supported by
the system are to be specified. A function can be specified by identifying the state at
which data is to be input to the system, its input data domain, the output domain, and the
type of processing to be carried on the input data to obtain the output data. Functional
requirements define specific behaviour or function of the application. Following are the
functional requirements:
FR1) The reply login to Google colab through drive.
FR2) Give input. Enter data set for which you want to classify

FR3) Execute the code by pressing the play button.

2.5 Non-Functional Requirements


A non-functional requirement is a requirement that specifies criteria that can be used
to judge the operation of a system, rather than specific behaviours. Especially these are
the constraints the system must work within. Following are the non-functional

7
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

requirements:
NFR 1)Performance Requirements:
● The application must have a minimum processor speed so that there are some
restrictions on what type of computer can use it. However, this will be as small as
possible to enable a broad range of clients to use the application.
● From studies it can be seen that speed was a common issue while dealing with large
set of data /reviews.
● The system must also aim to use minimum hard disk space yet keep the quality of the
available facility as high as possible.

NFR 2) Safety Requirements:

● The biggest risk with this is security and fraud.


● If the dataset is biased, the algorithm will not produce optimal results.

NFR3) Security Requirements:

● The system must automatically log out all customers after a period of inactivity.
● The system should not leave any cookies on the customer’s computer containing the
user’s password.
● Information of users such as IP addresses will be kept private so that third parties
cannot gain access to this personal information in order to keep within the Data
Protection Act.

NFR 4) Maintainability:
● We can save the information in our Google drive and we can use the tools that are
available on googleColab.

NFR 6) Portability:
● So The end-user part is fully portable and any system using any web browser should
be able to use the features of the system, including any hardware platform that is
available or will be available in the future.
● An end-user is using this system on any OS.
● The system shall run on PC, Laptops.

NFR 7) Performance:
 The performance of the developed applications can be calculated by using following
methods: Measuring enables you to identify how the performance of your application
stands in relation to your defined performance goals and helps you to identify the
bottlenecks that affect your application performance.

8
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

 It helps you identify whether your application is moving toward or away from your
performance goals.
 Defining what you will measure, that is, your metrics, and defining the objectives
for each metric is a critical part of your testing plan.
 Performance objectives include the following:
Response time, Latency throughput or Resource utilization.

2.6 Software Requirements


Operating System : Windows 10 or MAC OS.
Platform : Google Colab

Programming Language : python

2.7 Hardware Requirements


Processor : Pentium 4 and above.
Hard Disk : 300 GB or above.
RAM : 2 GB or above.
Internet : 4 Mbps or above (Wireless).

9
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

CHAPTER -3

10
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

3 LITERATURE SURVEY

In this paper [1] proposed by VrushaU.Suryawanshi et al. the methodology


explained proposes that the text document that is being considered to be classified
should be assigned to some pre-defines classes before classification. The best and
the most efficient method to classify any particular document is the k- nearest
method. The classification done is basically by extracting a set of the most important
key words out of the text document and then analyzing them. There are many
algorithms for classifying documents. The best suited algorithm for classifying
documents in this particular proposed method is the k-nearest method. The proposed
method uses Virtual Private Networks for security reasons. the provisions of this
systems enables the user to store large amounts of information so that it can be used
for training the classifier an classify the documents as well. The system is
completely automated and eliminates the need for human interaction. The security
systems keep the process protected. This methodology eliminates the overhead
processing.

In this paper [2] proposed by AimanMoldagulova and RosnafisahBte. Sulaiman,


the proposed method uses the k- nearest algorithm to classify the documents. The
value of the k in this particular system is a very crucial part in this system. The
determinations of the value of k represents to which all categories the selected
document is tested against. K-nearest is easy to implement and efficient than other
classification algorithms. The two criterion kept in mind while determining the
value of k is the validation error rate and the training error rate. The results show
best efficiency when the value of k is between 1 ad 50.

In this paper [3] proposed by Ari Aulia Hakim et al, the proposed system employs
the term-frequency and inverse document frequency to prioritize the keywords by

11
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

weighting it. The keywords are weighted by using term frequency (TF) in order to
find the number of times thekeyword has occurred in the document. Then Inverse
Document Frequency (IDF) is used to find the number of documents in which the
keywords are found.This method is found to be very efficient but there seems to be a
small defect. Since the method considers all the keywords and weights them, the
keyword with the highest frequency might not relate to the corresponding category.
Sometimes this system may give a wrong output.

In this paper [4] proposed by SeyyedMohammaHosseinDadgaret all, the proposed


methodology suggests 3 parts to the system:
1. Pre-processing
2. Feature extraction using TF-IDF
3. Classification based on Support Vector Machine (SVM)

SVM is a very good classifier. It classifies the documents based on the lowest
structural risk principle and creation of hyperplanes. The main disadvantage of this
system is that the SVM uses all the keyword regardless of whether they are
important or not. Sometimes this might give a wrong output.

In this paper [5] proposed by AkshitaBhandari, the proposed idea suggests an


improvement in the Apriori algorithm. This algorithm is used to extract frequent itemsets
from the database and then get the association rule for discovering the knowledge. There
are 2 requirements needed for this algorithm namely minimum support and minimum
confidence. Minimum support is used find frequent itemsets and the minimum
confidence is used to find the association rules.
In this system, the size of the database is drastically reduced. Due to this the results of
the Apriori algorithm will give promising results.
3.1. Technologies Used :

PYTHON

Python is the most widely used programming language for AI and ML. The reason for it
to be the leader in this is because of its simple syntax and versatility, also due to its open-
source availability. It includes are many built - in libraries just for AI and ML.
Examples:
12
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

• Scikit-Learn
• Tensorflow
• Keras
Key Features:
• Python is easy to learn.
• There is no need to recompile the source code. Make modification and the results can
be seen easily.
• It is independent of the operating system.

PANDAS
Pandas is an open - source library that provides high - performance data manipulation in
Python. A lot of processing is required for data analysis, such as cleaning, restructuring,
merging etc. There are different tools for this purpose. But Pandas is mostly preferred
due to simplicity, fast working.
Key Features:
• It works on Data Frame objects, which is fast and efficient.
• Used for dataset reshaping and pivoting.
• Alignment and Integration of data is done in case of any missing data.
• Provides the functionality of Time Series.
• The datasets can be performed in different formats such as, matrix, tabular,
heterogeneous, time series.
• It can be integrated with other libraries such as SciPy, scikit learn.
NUMPY
NumPy means numeric python, and is a python library which is used to work with a
single andmulti-dimensional array. It has a very powerful data structure. The multi-
dimensional arrays and matrices are therefore computed in an optimal way. It can handle
vast amount of data in a convenient and efficient way.
Advantages:
• Used to compute array.
• Multidimensional arrays are implemented efficiently.
• It has the ability to perform scientific calculations.
• Used to reshape the data stored in multidimensional arrays.
13
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

• Contains built - in functions for linear algebra and random number generation.
MATPLOTLIB
Matplotlib is a Python packages used for data visualization. It can be used in Jupyter
notebook and web application. It has a procedural interface named the Pylab, which is
designed to resemble MATLAB, it is a proprietary programming language developed by
MathWorks.
SCIKIT LEARN
In python Sklearn is a robust library for machine learning. In python it provides the
selection of efficient tools and statistical modeling that includes classification,
regression, and clustering and dimensionality reduction via a consistence interface. This
library is built upon numpy, scipy and matplotlib. It is not only focusing on loading,
manipulating and summarizing the data, but also focusing on modeling the data.

TENSORFLOW

TensorFlow an open-source platform used for creating ML applications. It also allows


developers to create applications of machine learning using libraries, various tools and
community resources. It allows developers to create ML applications using various tools,
libraries, and community resources. It uses dataflow and differentiable programming to
perform various tasks.
Architecture of TensorFlow works in three parts:
1. Data Pre-processing
2. Model Building
3. Training and estimating the model
It is called as TensorFlow as it takes inputs as a multi-dimensional array. A flowchart
can be constructed with the operations that are to be performed on that input. The input
is given at one end and then it flows through this system and performs multiple
operations and gives output at another end.
Components:
• Tensor: A tensor is a vector or an n-dimensional matrix which represents all types of
data. In tensor every value holds an identical data type with a known shape. This shape
of the data represents the dimensionality of the matrix or an array.
14
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

• Graph: TensorFlow uses a graph framework. All the series computations that are done
during the training are gathered and described by the graph.

KERAS

Keras is an open-source and high-level neural network library. It can be run on Theano,
TensorFlow and CNTK. Francois Chollet, one of the Google engineers developed this
library. It supports Convolution Networks, Recurrent Networks and their combination. It
is made user- friendly, extensible and modular to facilitate faster experimentation of
deep neural networks. It uses backend library to resolve the problem of low-level
computations as it cannot handle it.
Features:
• It is a multi-backend and supports multi-platform that helps the encoders come together
for coding.
• All concepts can be easily grasped.
• Supports fast prototyping.
• It runs on both CPU and GPU
• Models can be produced easily using keras.

GOOGLE COLAB

• It Colab for short, is a Google Research product, which allows developers to write and
execute Python code through their browser. Google Colab is an excellent tool for deep
learning tasks.

15
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

CHAPTER -4

16
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

4. SYSTEM DESIGN

4.1 Introduction to UML

The Unified Modeling Language allows the software engineer to express an analysis
model using the modeling notation that is governed by a set of syntactic, semantic and
pragmatic rules. A UML system is represented using five different views that describe
the system from distinctly different perspective. Each view is defined by a set of
diagram, which is as follows:
1. User Model View
This view represents the system from the users’ perspective. The analysis
representation describes a usage scenario from the end-users’ perspective.
2. Structural Model View
In this model, the data and functionality are arrived from inside the system. This
model view models the static structures.
3. Behavioural Model View
It represents the dynamic of behavioural as parts of the system, depicting he
interactions of collection between various structural elements described in the
user model and structural model view.
4. Implementation Model View
In this view, the structural and behavioural as parts of the system are represented
as they are to be built.
5. Environmental Model View
In this view, the structural and behavioural aspects of the environment in which
the system is to be implemented are represented.

4.2 UML Diagrams

4.2.1 UseCaseDiagram

To model a system, the most important aspect is to capture the dynamic behaviour.
To clarify a bit in details, dynamic behaviour means the behaviour of the system when it

17
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

is running/operating.

So only static behaviour is not sufficient to model a system rather dynamic behaviour is
more important than static behaviour. In UML there are five diagrams available to model
dynamic nature and use case diagram is one of them. Now as we have to discuss that the
use case diagram is dynamic in nature there should be some internal or external factors
for making the interaction.
These internal and external agents are known as actors. So use case diagrams are
consisting of actors, use cases and their relationships. The diagram is used to model the
system/subsystem of an application. A single use case diagram captures a particular
functionality of a system. So to model the entire system numbers of use case diagrams
are used.

Fig 4.2.1 – Use Case Diagram

18
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

Use case diagrams are used to gather the requirements of a system including internal
and external influences. These requirements are mostly design requirements. So when a
system is analysed to gather its functionalities use cases are prepared and actors are
identified. In brief, the purposes of use case diagrams can be as follows:
a. Used to gather requirements of a system.
b. Used to get an outside view of a system.
c. Identify external and internal factors influencing the system.
d. Show the interacting among the requirements are actors.

4.2.2 Sequence Diagram

Sequence diagrams describe interactions among classes in terms of an exchange of


messages over time. They're also called event diagrams. A sequence diagram is a good
way to visualize and validate various runtime scenarios. These can help to predict how a
system will behave and to discover responsibilities a class may need to have in the
process of modelling a new system.
The aim of a sequence diagram is to define event sequences, which would have a desired
outcome. The focus is more on the order in which messages occur than on the message
per se. However, the majority of sequence diagrams will communicate what messages
are sent and the order in which they tend to occur.

BasicSequence Diagram Notations

Class Roles or Participants


Class roles describe the way an object will behave in context. Use the UML
object symbol to illustrate class roles, but don't list object attributes.
Activation or Execution Occurrence
Activation boxes represent the time an object needs to complete a task. When an
object is busy executing a process or waiting for a reply message, use a thin grey
rectangle placed vertically on its lifeline.
Messages
Messages are arrows that represent communication between objects. Use half-
arrowed lines to represent asynchronous messages.
Asynchronous messages are sent from an object that will not wait for a response
19
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

from the receiver before continuing its tasks.


Lifelines
Lifelines are vertical dashed lines that indicate the object's presence over time.

Destroying Objects
Objects can be terminated early using an arrow labelled "<< destroy >>" that
points to an X. This object is removed from memory. When that object's lifeline
ends, you can place an X at the end of its lifeline to denote a destruction
occurrence.

Loops
A repetition or loop within a sequence diagram is depicted as a rectangle. Place
the condition for exiting the loop at the bottom left corner in square brackets [].
Guards

Fig 4.2.2 – Sequence Diagram

20
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

When modelling object interactions, there will be times when a condition must be met
for a message to be sent to an object. Guards are conditions that need to be used
throughout UML diagrams to control flow.

4.2.3 Class Diagram

Class diagrams are the main building blocks of every object oriented methods. The class
diagram can be used to show the classes, relationships, interface, association, and
collaboration. UML is standardized in class diagrams. Since classes are the building
block of an application that is based on OOPs, so as the class diagram has appropriate
structure to represent the classes, inheritance, relationships, and everything that OOPs
have in its context. It describes various kinds of objects and the static relationship in
between them.

Fig 4.2.3 – Class Diagram

21
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

The main purpose to use class diagrams are:


1. This is the only UML which can appropriately depict various aspects of OOPs
concept.
2. Proper design and analysis of application can be faster and efficient.
3. It is base for deployment and component diagram.
Each class is represented by a rectangle having a subdivision of three
compartmentsname,attributes and operation.

22
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

CHAPTER -5

23
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

5. IMPLEMENTATION

5.1 Pseudo Code

Step 1: Mount the training and testing dataset into colabinorder to get usage accessibility
Step 2: Load the required libraries

Step 3: Create a function to load the text and labels from train and test set
Step 4: Import required libraries for text preprocessing

Step 5: Write the code for vectorizing the data

Step 6: Split the data into train and test data.


Step 7: Train the model with training data
Step 8: Testing the algorithm with the test data

24
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

5.2 Code Snippets

from google.colab import drive
drive.mount('/content/drive')

#loading the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.python.keras import models, layers, optimizers
import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
import bz2
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
import re

#Creating a function to load the text and labels from train and test set

def get_labels_and_texts(file):
      target = {0:'Negative', 1:'Positive'}
      file1 = open(file, 'r')
      Lines = file1.readlines()

      label = [target[int(label[9])-1] for label in Lines]
      reviews = [review[11:] for review in Lines]
25
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

      df = pd.DataFrame(data = {"label":label, "reviews": reviews})
      return df
train = get_labels_and_texts('/content/drive/MyDrive/data/train.ft.txt').sample(140000, rando
m_state=42)
test = get_labels_and_texts('/content/drive/MyDrive/data/test.ft.txt')

#text pre-processing

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import plot_confusion_matrix, plot_precision_recall_curve

#vectorization
vec = TfidfVectorizer(ngram_range=(1,2),min_df=3, max_df=0.9, strip_accents='unicode', us
e_idf=1,smooth_idf=1,sublinear_tf=1)
encoder = LabelEncoder()
X_train = vec.fit_transform(train['reviews'])
X_test = vec.transform(test['reviews'])
Y_train = encoder.fit_transform(train['label'])
Y_test = encoder.transform(test['label'])

#training the model

log_model = LogisticRegression(C=4, dual=True, solver='liblinear', random_state=42)
log_model.fit(X_train, Y_train)

#testing the data for accuracy

predicts = log_model.predict_proba(X_test)
plot_confusion_matrix(log_model, X_test, Y_test)

26
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

#testing the data with some sentences

data=pd.DataFrame(data={"label":[1], "reviews": ['akhanda is a blockbuster movie. Music


was excellent']})
my_test_X=vec.transform(data['reviews'])
predicts = log_model.predict_proba(my_test_X)
print(predicts)

27
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

CHAPTER -6

28
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

6. TESTING

6.1 Introduction to Testing

Testing is the process of evaluating a system or its component(s) with the intent to
find whether it satisfies the specified requirements or not. Testing is executing a system
in order to identify any gaps, errors, or missing requirements in contrary to the actual
requirements.

According to ANSI/IEEE 1059 standard, Testing can be defined as - A process of


analyzing a software item to detect the differences between existing and required
conditions (that is defects/errors/bugs) and to evaluate the features of the software item.

Who does Testing?

It depends on the process and the associated stakeholders of the project(s). In the IT
industry, large companies have a team with responsibilities to evaluate the developed
software in context of the given requirements. Moreover, developers also conduct testing
which is called Unit Testing. In most cases, the following professionals are involved in
testing a system within their respective capacities:

● Software Tester
● Software Developer
● Project Lead/Manager
● End User
Levels of testing include different methodologies that can be used while conducting
software testing. The main levels of software testing are:

● Functional Testing
● Non-functional Testing

Functional Testing
This is a type of black-box testing that is based on the specifications of the software
that is to be tested. The application is tested by providing input and then the results are
examined that need to conform to the functionality it was intended for. Functional testing
of a software is conducted on a complete,integratedsystem to evaluate the system.

29
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

6.2 Software Testing Life Cycle


The process of testing a software in a well planned and systematic way is known as
software testing lifecycle (STLC).
Different organizations have different phases in STLC however generic Software Test Life
Cycle (STLC) for waterfall development model consists of the following phases.

1. Requirements Analysis
2. Test Planning
3. Test Analysis
4. Test Design

● Requirements Analysis
In this phase testers analyze the customer requirements and work with developers
during the design phase to see which requirements are testable and how they are
going to test those requirements.
It is very important to start testing activities from the requirements phase itself
because the cost of fixing defect is very less if it is found in requirements phase
rather than in future phases.

● Test Planning

In this phase all the planning about testing is done like what needs to be tested,
how the testing will be done, test strategy to be followed, what will be the test
environment, what test methodologies will be followed, hardware and software
availability, resources, risks etc. A high level test plan document is created which
includes all the planning inputs mentioned above and circulated to the
stakeholders.

● Test Analysis
After test planning phase is over test analysis phase starts, in this phase we need
to dig deeper into project and figure out what testing needs to be carried out in
each SDLC phase. Automation activities are also decided in this phase,
ifautomation needs to be done for software product, how will the automation be
30
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

done, how much time will it take to automate and which features need to be
automated. Non functional testing areas(Stress and performance testing) are also
analyzed and defined in this phase.

● Test Design
In this phase various black-box and white-box test design techniques are used to
design the test cases for testing, testers start writing test cases by following those
design techniques, if automation testing needs to be done then automation scripts
also needs to written in this phase.

6.3 Test Cases

● A positively looking like negative sentence or review should be classified as


negative.
● A negatively looking like positive sentence or review should be classified as
positive.

31
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

CHAPTER -7

32
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

7.SCREENSHOTS

7.1 Mounting the data and loading required libraries

Fig 7.1 - Mounting data and loading required libraries

33
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

7.2 Creating a function to load text from training and test data

Fig 7.2 – Creating a function to load text from training and test data

34
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

7.3 Vectorization and fitting data into ML algorithm

Fig 7.3 – Vectorization and fitting data into ML algorithm

35
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

7.4 Plotting confusion matrix and testing the algorithm

Fig 7.4 –plotting confusion matrix and testing the algorithm

36
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

7.5 Test results

Fig 7.5 – Test result

37
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

CHAPTER -8

38
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

FUTURE ENHANCEMENTS

Sentiment analysis is a uniquely powerful tool for businesses that are looking to measure
attitudes, feelings and emotions regarding their brand. To date, the majority of sentiment
analysis projects have been conducted almost exclusively by companies and brands
through the use of social media data, survey responses and other hubs of user-generated
content. By investigating and analyzing customer sentiments, these brands are able to get
an inside look at consumer behaviors and, ultimately, better serve their audiences with
the products, services and experiences they offer.
Algorithms have long been at the foundation of most forms of analytics, including social
media and sentiment analysis. With recent years bringing big leaps in machine learning
and artificial intelligence, many analytics solutions are looking to these technologies to
replace algorithms. Unfortunately for organizations looking to leverage sentiment
analysis to measure audience emotions, machine learning isn’t yet ready to tackle the
complex nuances of text and how we talk, especially on social media channels that are
rife with slang, sarcasm, double meanings and misspellings. These make it difficult for
artificial intelligence systems to accurately sort and classify sentiments on social media.
And, with any analysis project, accuracy is crucial. It is uncertain if machine learning
will progress to the point that it is capable of accurately analyzing text, or if sentiment
analysis projects will have to find a new basis to avoid the current plateau of algorithms.

The game is now currently limited to Amazon Smart Speakers but maybe in future the
same

39
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

CHAPTER -9

40
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

CONCLUSION

Text classification is a fundamental machine learning problem with applications across


various products and is commonly used NLP tasks. In this guide, we have broken down
the text classification workflow into several steps. In this project, we saw how text
classification can be performed using Python and other libraries. We performed
sentimental analysis of twitter reviews. The accuracy value shows the percentage of
testing data set which were classified correctly by the model.
With more and more organizations turning to sentiment analysis to measure and predict
outcomes, as well as better understand consumer behaviors, these tools are quickly
building a reputation that is going to help propel it forward into the future and towards
deeper and more accurate conclusions and insights.

41
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

CHAPTER -10

42
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

REFERENCES

1. Text Classification with Python and Scikit-Learn (stackabuse.com)


2. D.D. Lewis Representation and Learning in Information Retrieval, PhD
dissertationDept. of Computer Science, Univ.of Massachusetts, Amherst (1992)
3. McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text
classification. In: Learning for Text Categorization: Papers from the AAAI Workshop,
AAAI Press 1998. 41-48 Technical Report WS-98-05.
4. Y. Yang, J.O. Pedersen A comparative study on feature selection in text
categorizationIn International Conference on Machine Learning (1997), pp. 412-420
5. Devi, M, Phore, M, Kumari, P, 2013. Mind Reading Computers. International Journal
of Advanced Research in Computer Engineering & Technology, 2(12).
6. Dhall, A and Goecke, R, 2012. Group Expression Intensity Estimation in Videos via
Gaussian Processes. Proceedings of the International Conference on Pattern
Recognition ICPR2012, Tsukuba, Japan, 11-15 Nov 2012.
7. Adam G Hart, William S Carpenter, Estelle Hlustik-Smith, Matt Reed, and Anne E
Goodenough. Testing the potential of twitter mining methods for data acquisition:
Evaluating novel opportunities for ecological research in multiple taxa. Methods in
Ecology and Evolution, 9(11):2194–2205, 2018.
8. Prabhsimran Singh, Ravinder Singh Sawhney, and Karanjeet Singh Kahlon.
Sentiment analysis of demonetization of 500 & 1000 rupee banknotes by indian
government. ICT Express, 4(3):124–129, 2018.

ehagias A., Petridis V.,


K

Kaburlasos V.,
Fragkou P., “A Comparison of
Word- and
43
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

Sense-Based Text Categorization


Using
Several Classification
Algorithms”, JIIS,
Volume 21, Issue 3, 2003, pp.
227-247.
Kehagias A., Petridis V.,
Kaburlasos V.,
Fragkou P., “A Comparison of
Word- and
Sense-Based Text Categorization
Using
Several Classification
Algorithms”, JIIS,
Volume 21, Issue 3, 2003, pp.
227-247.

44
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

9. ehagias A., Petridis V.,


Kaburlasos V.,
10. Fragkou P., “A Comparison
of Word- and
11. Sense-Based Text
Categorization Using
12. Several Classification
Algorithms”, JIIS,
13. Volume 21, Issue 3, 2003,
pp. 227-247.
14. ehagias A., Petridis V.,
Kaburlasos V.,
15. Fragkou P., “A Comparison
of Word- and
16. Sense-Based Text
Categorization Using

45
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

17. Several Classification


Algorithms”, JIIS,
18. Volume 21, Issue 3, 2003,
pp. 227-247.
19. ehagias A., Petridis V.,
Kaburlasos V.,
20. Fragkou P., “A Comparison
of Word- and
21. Sense-Based Text
Categorization Using
22. Several Classification
Algorithms”, JIIS,
23. Volume 21, Issue 3, 2003,
pp. 227-247.
24. ehagias A., Petridis V.,
Kaburlasos V.,

46
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY
REVIEW CLASSIFICATION USING ML

25. Fragkou P., “A Comparison


of Word- and
26. Sense-Based Text
Categorization Using
27. Several Classification
Algorithms”, JIIS,
28. Volume 21, Issue 3, 2003,
pp. 227-2

47
KESHAV MEMORIAL INSTITUTE OF TECHNOLOGY

You might also like