83% found this document useful (6 votes)
7K views67 pages

Sentimental Analysis Project Documentation

This document is a project report on sentiment analysis using AI and deep learning submitted by 5 students. It includes an introduction to the domain, problem definition, proposed solution, objectives, feasibility study, system analysis, design, implementation, testing, outputs, conclusion, and future enhancements. Sentiment analysis of social media posts will be performed using natural language processing and deep learning models to classify sentiments as positive, negative, or neutral.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
83% found this document useful (6 votes)
7K views67 pages

Sentimental Analysis Project Documentation

This document is a project report on sentiment analysis using AI and deep learning submitted by 5 students. It includes an introduction to the domain, problem definition, proposed solution, objectives, feasibility study, system analysis, design, implementation, testing, outputs, conclusion, and future enhancements. Sentiment analysis of social media posts will be performed using natural language processing and deep learning models to classify sentiments as positive, negative, or neutral.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 67

A PROJECT REPORT

ON
SENTIMENTAL ANALYSIS USING AI-DEEP LEARNING

A project report submitted in partial fulfillment of the requirements for the award of
the degree of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE & ENGINEERING

By
A. Supraja 172P1A0505
D. Indira 172P1A0517
H. Neelesh 172P1A0534
G. Yashaswini 172P1A0523
C. Vyshnavi 172P1A0513

UNDER THE ESTEEMED GUIDANCE OF


P. Narasimhaiah
Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


AN ISO 9001:2015 CERTIFIED INSTITUTION
CHAITANYA BHARATHI INSTITUTE OF TECHNOLOGY
(Sponsored by Bharathi Educational Society)
(Affiliated to J.N.T.U.A., Anantapuramu, Approved by AICTE, New Delhi)
Recognized by UGC Under the Sections 2(f)&12(B) of UGC Act,1956
(Accredited by NAAC)
Vidyanagar, Proddatur-516 360, Y.S.R.(Dist.), A.P.
2017-2021
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
AN ISO 9001:2015 CERTIFIED INSTITUTION
CHAITANYA BHARATHI INSTITUTE OF TECHNOLOGY
(Sponsored by Bharathi Educational Society)
(Affiliated to J.N.T.U.A. Anantapuramu, Approved by AICTE, New Delhi)
Recognized by UGC under the Sections 2(f) &12(B) of UGC Act, 1956
Accredited by NAAC
Vidyanagar, Proddatur-516 360, Y.S.R. (Dist.), A.P.

CERTIFICATE
This is to certify that the project work entitled “SENTIMENTAL ANALYSIS USING AI-
DEEP LEARNING” is a bonafide work of A. SUPRAJA (172P1A0505),
D.INDIRA(172P1A0517),H.NEELESH(172P1A0534),G.YASHASWINI(172P1A0523),C.VYS
HNAVI(172P1A0513) submitted to Chaitanya Bharathi Institute of Technology, Proddatur in
partial fulfillment of the requirement for the award of degree of Bachelor of Technology In
COMPUTER SCIENCE AND ENGINEERING. The work reported here in does not form part of
any other thesis on which a degree has been awarded earlier.
This is to further certify that they have worked for a period of one semester for preparing
their work under our supervision and guidance.

INTERNAL GUIDE HEAD OF THE DEPARTMENT


P. NARASIMHAIAH G. SREENIVASA REDDY
Assistant Professor Associate Professor

PROJECT CO-ORDINATOR
N. SRINIVASAN, M.Tech.,(ph.D)

INTERNAL EXAMINER EXTERNAL EXAMINER


DECLARATION BY THE CANDIDATES

We are A. Supraja, D. Indira, H. Neelesh, G. Yashaswini, C. Vyshnavi


bearing respective Roll No: (172P1A0505), (172P1A0517), (172P1A0534), (172P1A0523),
(172P1A0513) hereby declare that the Project Report entitled “SENTIMENTAL ANALYSIS
USING AI-DEEP LEARNING” under the guidance of P. NARASIMHAIAH, Assistant Professor,
Department of CSE is submitted in partial fulfillment of the requirements for the award of the
degree of Bachelor of Technology in Computer Science & Engineering.
This is a Record of Bonafide work carried out by us and the results embodied in this Project
Report have not been reproduced or copied from any source. The results embodied in this Project
Report have not submitted to any other University or Institute for the Award of any other Degree or
Diploma.

A.SUPRAJA 172P1A0505
D.INDIRA 172P1A0517
H.NEELESH 172P1A0534
G.YASHASWINI 172P1A0523
C.VYSHNAVI 172P1A0513

Dept. of Computer Science & Engineering

Chaitanya Bharathi Institute of Technology

Vidyanagar,Proddatur,Y.S.R.(Dist.)

ACKNOWLEDGEMENT
An endeavor over a long period can be successful only with the advice and supports of many
well wishers. We take this opportunity to express our gratitude and appreciation to all of them.

We are extremely thankful to our beloved Chairman Sri V.Jayachandra Reddy who took
keen interest and encouraged us in every effort throughout this course.

We owe our gratitude to our principal Dr. G.Sreenivasula Reddy, M.Tech.,Ph.D.for


permitting us to use the facilities available to accomplish the project successfully.

We express our heartfelt thanks to G. Sreenivasa Reddy B.Tech.,Ph.D Head of Dept – CSE
for his kind attention and valuable guidance to us throughout this course.

We also express our deep sense of gratitude towards N. Srinivasan, Project Co-Ordinator,
Dept. of COMPUTER SCIENCE AND ENGINEERING for her support and guidance in
completing our project.

We express our profound respect and gratitude to our project guide P. Narasimhaiah,
B.Tech., for her valuable support and guidance in completing the project successfully.

We are highly thankful to Mr. M. Naresh Raju Try Logic Soft Solutions AP Pvt. Limited,
Hyderabad, who has been kind enough to guide us in the preparation and execution of this project.

We also thank all the teaching and non-teaching staff of the Dept. of COMPUTER
SCIENCE AND ENGINEERING. For their support throughout our B.Tech. course.
We express our heartful thanks to My Parents for their valuable support and encouragement
in completion of my course. Also I express my heartful regards to my Friends for being supportive
in completion of my project.
TABLE OF CONTENTS

CONTENT PAGE NO
ABSTRACT I

TABLE OF CONTENTS Ⅱ

1.INTRODUCTION 1

1.1 Domain Description 1

1.2 About Project 6

1.2.1 Problem Definition 6

1.2.2 Proposed Solution 6

1.3 Objective 8

2. SENTIMENTAL ANALYSIS SURVEY 9

2.1 Theoretical Background 9

2.2 Existing System 9

2.3 Proposed System 9

2.4 Advantages of proposed system 10

2.5 Feasibility Study 10

2.5.1 Operational Feasibility 10

2.5.2 Technical Feasibility 11

2.5.2.1 Survey of Technology 11

2.5.2.2 Feasibility of Technology 11

2.5.3 Economic Feasibility 11


3.SYSTEM ANALYSIS 12

Specifications 12

Software Requirements 13

Hardware Requirements 13

Module Description 13

4. DESIGN 17
Block Diagram 17

Data Flow Diagrams 18

Context Level DFD 19

Top Level DFD 20

Detailed Level DFD 20

Unified Modelling Language 21

Use Case Diagram 22

Sequence Diagram 24

Collaboration Diagram 25

Activity Diagram 26

5. IMPLEMENTATION 28

6. TESTING 47

Black Box Testing 47

White Box Testing 48

7. OUTPUT SCREENS 49
8. CONCLUSION 55

9. FUTURE ENHANCEMENT 56

10 BIBLIOGRAPHY 57
ABSTRACT

We are doing this major project on SENTIMENT ANALYSIS USING AI-DEEP


LEARNING.
This project addresses the problem of sentimental analysis or opinion mining in
social media like Facebook and Twitter, that is classifying the tweets or people opinions
according to the sentiment expressed in them : positive, negative, neutral.

In social media, the information is present in large amount. Extracting information


from social media gives us several usage in various fields. In the field of biomedical and
healthcare, extracting information from social media is providing number of benefits such as
knowledge about the latest technology and also updates of current situation in medical field
etc.
Due to the enormous amount of data and opinions being produced, shared and
transferred, Everyday across the internet and other media, sentiment analysis has become
one of the most active research fields in natural language processing.

Sentiment Analysis or opinion mining is done using Machine Learning models and
Deep Learning models. Deep learning has made a great breakthrough in the field of speech
and image recognization.

In the implemented system, tweets or people opinions are collected and


sentimental analysis is performed on them, Based on the results of sentimental analysis few
suggestions can be provided to the user.
In our project we are using NLP for processing the data and Deep Learning
algorithms and Machine Learning algorithms to build a model.


LIST OF FIGURES

Figure Name of Figures: Page Number:


Number:

1.1..1 Image for Deep Learning 1

1.1..2 Image for Machine Learning 2

1.1.3 Image for Computer Vision 3

1.1.4 Image for Autonomous Vehicles 4

1.1.5 Image for Bots based on Deep learning 5

4.1.1 Block Diagram for Sentimental Analysis 17

4.2.1 Context Level DFD for Sentimental Analysis 19

4.2.2 Top Level DFD for Sentimental Analysis 20

4.2.3.1 Detailed Level DFD for Sentimental Analysis 21

4.3.1 UseCase Diagram 23

4.3.2 Sequence Diagram 24

4.3.3 Collaboration Diagram 25

4.3.4 Activity Diagram 26

4.3.5 Data Dictionary 27

5.2.1 Image for Random Forest Algorithm 39


CHAPTER 1

INTRODUCTION

10.1 DOMAIN

DESCRIPTION What is Deep

Learning ?

Deep Learning is a class of Machine Learning which performs much better on


unstructured data. Deep learning techniques are outperforming current machine learning
techniques. It enables computational models to learn features progressively from data at
multiple levels. The popularity of deep learning amplified as the amount of data available
increased as well as the advancement of hardware that provides powerful computers.

Fig 1.1.1 Image for Deep Learning

Deep learning has emerged as a powerful machine learning technique that learns
multiple layers of representations or features of the data and produces state-of-the-art
prediction results. Along with the success of deep learning in many other application domains,
deep learning is also popularly used in sentiment analysis in recent years.

1
What is Machine Learning?

Machine learning is an application of artificial intelligence (AI) that provides systems


the ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it learn for themselves.

Fig 1.1.2 Image for Machine Learning

Machine Learning (ML) is coming into its own, with a growing recognition that ML
can play a key role in a wide range of critical applications, such as data mining,Natural
language processing, image recognition, and expert systems. ML provides potential solutions
in all these domains and more, and is set to be a pillar of our future civilization.

“A computer program is said to learn from experience E with respect to some task T
and some performance measure P, if its performance on T, as measured by P, improves with
experience E.” -- Tom Mitchell, Carnegie Mellon University

 Some machine learning methods:

Machine learning algorithms are often categorized as supervised or unsupervised.


Supervised machine learning:
Supervised machine learning algorithms can apply what has been learned in the past
to new data using labelled examples to predict future events.

Unsupervised machine learning:


Unsupervised machine learning algorithms are used when the information used to
train is neither classified nor labelled.

Deep Learning Examples in Real Life:

1.Computer Vision

High-end gamers interact with deep learning modules on a very frequent basis.
Deep neural networks power bleeding-edge object detection, image classification, image
restoration, and image segmentation.

Fig 1.1.3 Image for Computer Vision

So much so, they even power the recognition of hand-written digits on a computer
system. To wit, deep learning is riding on an extraordinary neural network to empower
machines to replicate the mechanism of the human visual agency.

2.Autonomous Vehicles

The next time you are lucky enough to witness an autonomous vehicle driving
down, understand that there are several AI models working simultaneously. While some
models pin-point pedestrians, others are adept at identifying street signs. A single car can be
informed by millions of AI models while driving down the road. Many have considered AI-
powered car drives safer than human riding.

Fig 1.1.4 Image for Autonomous Vehicles

3.Automated Translation

Automated translations did exist before the addition of deep learning. But deep
learning is helping machines make enhanced translations with the guaranteed accuracy that
was missing in the past. Plus, deep learning also helps in translation derived from images –
something totally new that could not have been possible using traditional text-based
interpretation.

4.Bots Based on Deep Learning

Take a moment to digest this – Nvidia researchers have developed an AI system that
helps robots learn from human demonstrative actions. Housekeeping robots that perform
actions based on artificial intelligence inputs from several sources are rather common. Like
human brains process actions based on past experiences and sensory inputs, deep-learning
infrastructures help robots execute tasks depending on varying AI opinions.
Fig 1.1.5 Image for Bots based on Deep Learning

5.Sentiment based News Aggregation

Carolyn Gregorie writes in her Huffington Post piece: “the world isn’t falling apart, but
it can sure feel like it.” And we couldn’t agree more. I am not naming names here, but you
cannot scroll down any of your social media feed without stumbling across a couple of global
disasters – with the exception of Instagram perhaps.

News aggregators are now using deep learning modules to filter out negative news and
show you only the positive stuff happening around. This is especially helpful given how
blatantly sensationalist a section of our media has been of late.

Machine Learning Examples in real life:

1 . Image Recognition

Image recognition is a well-known and widespread example of machine learning in

the real world. It can identify an object as a digital image, based on the intensity of the pixels
in black and white images or colour images.

Real-world examples of image recognition:

 Label an x-ray as cancerous or not


 Assign a name to a photographed face (aka “tagging” on social media)
2. Medical Diagnosis

Machine learning can help with the diagnosis of diseases. Many physicians use

chatbots with speech recognition capabilities to discern patterns in symptoms.

In the case of rare diseases, the joint use of facial recognition software and
machine learning helps scan patient photos and identify phenotypes that correlate with rare
genetic diseases.

3. Sentimental Analysis

Sentimental analysis is a top notch machine learning application that refers to

sentiment classification , opinion mining , and analyzing emotions using this model, machines

groom themselves to analyze sentiments based on the words. They can identify if the words
are said in a positive, negative or neutral notion. Also they can define the magnitude of these
words.

10.2 ABOUT PROJECT

The growth of the internet due to social networks such as facebook, twitter, Linkedin,
instagram etc. has led to significant users interaction and has empowered users to express
their opinions about products, services, events, their preferences among others. It has also
provided opportunities to the users to share their wisdom and experiences with each other.
The faster development of social networks is causing explosive growth of digital content. It
has turned online opinions, blogs, tweets, and posts into a very valuable asset for the
corporates to get insights from the data and plan their strategy. Business organizations need to
process and study these sentiments to investigate data and to gain business insights.
Traditional approach to manually extract complex features, identify which feature is relevant,
and derive the patterns from this huge information is very time consuming and require
significant human efforts. However, Deep Learning can exhibit excellent performance via
Natural Language Processing (NLP) techniques to perform sentiment analysis on this massive
information. The core idea of Deep Learning techniques is to identify complex features
extracted from this vast amount of data without much external intervention using deep neural
networks. These algorithms automatically learn new complex features. Both automatic feature
extraction and availability of resources are very important when comparing the traditional
machine learning approach and deep learning techniques. Here the goal is to classify the
opinions and sentiments expressed by users.

It is a set of techniques / algorithms used to detect the sentiment (positive, negative,


or neutral) of a given text. It is a very powerful application of natural language
processing (NLP) and finds usage in a large number of industries. It refers to the use
of NLP, text analysis, computational linguistics, and biometrics to systematically identify,
extract, quantify, study different states and subjective information. The sentiment analysis
sometimes goes beyond the categorization of texts to find opinions and categorizes them as
positive or negative, desirable or undesirable. Below figure describes the architecture of
sentiment classification on texts. In this, we modify the provided reviews by applying specific
filters, and we use the prepared dataset by applying the parameters and implement our
proposed model for evaluation.
Another challenge of microblogging is the incredible breadth of topic that is
covered. It is not an exaggeration to say that people tweet about anything and everything.
Therefore, to be able to build systemsto mine sentiment about any given topic, we need a
method for quickly identifying data that can be used for training. In this paper, we explore one
method for building such data: using hashtags (e.g.,#bestfeeling, #epicfail, #news) to identify
positive, negative, and neutral tweets to use for training threeway sentiment classifiers.

The online medium has become a significant way for people to express their

opinions and with social media, there is an abundance of opinion information available. Using

sentiment analysis, the polarity of opinions can be found, such as positive, negative, or neutral
by analyzing the text of the opinion. Sentiment analysis has been useful for companies to get
their customer's opinions on their products predicting outcomes of elections , and getting
opinions from movie reviews. The information gained from sentiment analysis is useful for
companies making future decisions. Many traditional approaches in sentiment analysis uses
the bag of words method. The bag of words technique does not consider language
morphology, and it could incorrectly classify two phrases of having the same meaning
because it could have the same bag of words. The relationship between the collection of
words is considered instead of the relationship between individual words . When determining
the overall sentiment, the sentiment of each word is determined and combined using a
function . Bag of words also ignores word order, which leads to phrases with negation in them
to be incorrectly classified.

10.3 OBJECTIVE

To address this solution , we should collect the data from various sources like different
websites , pdfs and word document. After collecting the data we will convert it into csv file
then, we will break the data into individual sentences.Then.by using Natural Language
Processing(NLP) we eliminate stop words. Stop words are those words which are referred as
useless words in the sentence or the extra data which are of no use. For example, ”the” , ”a” ,
“an”, “in” , are some of the examples of stop words in English. After that the algorithm naïve
bayes is used to train the model. ANN algorithm works in backend to generae pickle file.

Confusion matrix is used for validation technique and Accuracy is used for model
defect.
CHAPTER 2

SENTIMENTAL ANALYSIS SURVEY

2.1 THEOROTICAL BACKGROUND

This project Sentimental Analysis addresses the problem of sentimental analysis


or opinion mining in social media like Facebook and Twitter, that is classifying the
tweets or people opinions according to the sentiment expressed in them : positive
,negative, neutral.
In social media, the information is present in large amount. Extracting
information from social media gives us several usage in various fields. In the field of
biomedical and healthcare, extracting information from social media is providing number of
benefits such as knowledge about the latest technology and also updates of current situation in
medical field etc.

2.2 EXISTING SYSTEM WITH DRAWBACKS

Existing approaches to Sentimental Analysis are Knowledge-based techniques


(lexical-based approach), Statistical methods, and Hybrid approaches.

Knowledge-based techniques make use of prebuilt lexicon sources containing polarity


of sentiment words SentiWordNet(SWN) for determining the polarity of a tweet. Lexicon
based approach suffers from poor recognition of sentiment.

Statistical methods involve machine learning(such as SVM) and deep learning


approaches, both approaches require labeled training data for polarity detection.

Hybrid approach of sentiment analysis exploits both statistical methods and


knowledge-based methods for polarity detection. It inherits high accuracy from machine
learning(statistical methods) and stability from the lexicon-based approach.

2.3 PROPOSED SYSTEM WITH FEATURES

In the proposed system , sentimental analysis is done using natural language


processing , which defines a relation between user posted tweet and opinion and in addition ,
suggestions of people.
Truly listening to a customers voice requires deeply understanding what they have
expressed in natural language. NLP is a best way to understand natural language used and
uncover the sentiment behind it. NLP makes speech analysis easier.

Without NLP and access to the right data , it is difficult to discover and collect insight
necessary for driving business decisions. Deep Learning algorithms are used to build a
model.

2.4 ADVANTAGES OF PROPOSED SYSTEM

The advanced techniques like natural language processing is used for the sentimental
analysis.It makes our project very accurate.

NLP defines a relation between user posted tweet and opinion and in addition
suggestions of people.

NLP is a best way to understand natural language used by the people and uncover the
sentiment behind it.NLP makes speech analysis easier.

2.5 FEASIBILITY STUDY:


As the name implies, a feasibility analysis is used to determine the viability of an idea,
such as ensuring a project is legally and technically feasible as well as economically
justifiable. It tells us whether a project is worth the investment—in some cases, a project may
not be doable. There can be many reasons for this, including requiring too many resources,
which not only prevents those resources from performing other tasks but also may cost more
than an organization would earn back by taking on a project that isn’t profitable.

2.5.1 Operational Feasibility:


The number of people working on this project are 3 to 4. These persons should have
knowledge on the technologies from the domain of Artificial Intelligence (A.I.), those are
understanding of Machine Learning (M.L.) and its types. Working of Natural Language
Processing (N.L.P).
2.5.2 Technical Feasibility:
Technical feasibility is the study which assesses the details of how you intend to
deliver a product or service to customers. Think materials, labour, transportation, where your
business will be located, and the technology that will be necessary to bring all this together.
It’s the logistical or tactical plan of how your business will produce, store, deliver and track
its products or services.

2.5.2.1 Survey of Technology:


For our project we have chosen the Artificial Intelligence aka A.I. Technology as we
found that by using this technology, we can complete our project and get out desired output
for the users.

2.5.2.2Feasibility of Technology:
For our project from Machine Learning aka M.L., we have chosen Unsupervised
Machine Learning task to train our data on the GloVe, i.e., Global Vectors for Word
Representation Dataset. Later, training on this dataset, we’ll then, give our inputs to the model
and it’ll display the top N Sentences.

2.5.3 Economic Feasibility:


Our project is economically supportive as it’s required small or medium amount of
resources which will cost up-to medium amount for those resources.
CHAPTER 3

SYSTEM ANALYSIS

System analysis is conducted for the purpose of studying a system or its


parts in order to identify its objectives. It is a problem-solving technique that
improves the system and ensures that all the components of the system work
efficiently to accomplish their purpose.

3.1 SPECIFICATION
Functional requirements

The following are the functional requirements of our project:

• A training dataset has to be created on which training is performed.

• A testing dataset has to be created on which testing is performed.

Non Functional Requirements:

• Maintainability: Maintainability is used to make future maintenance easier,


meet new requirements.
• Robustness: Robustness is the quality of being able to withstand stress,
pressures or changes in procedure or circumstance.
• Reliability: Reliability is an ability of a person or system to perform and
maintain its functions in circumstances.
• Size: The size of a particular application play a major role, if the size is less then
efficiency will be high.
• Speed: If the speed is high then it is good. Since the no of lines in our code is
less, hence the speed is high.
3.2 SOFTWARE REQUIREMENTS

One of the most difficult tasks is that, the selection of the software, once system
requirement is known that is determining whether a particular software package fits the
requirements.

PROGRAMMING LANGUAGE PYTHON

TECHNOLOGY PYCHARM

OPERATING SYSTEM WINDOWS 10

BROWSER GOOGLECHROME

Table3.2.1SoftwareRequirements

3.3 HARDWARE REQUIREMENTS


The selection of hardware is very important in the existence and proper
working of any software. In the selection of hardware, the size and the capacity
requirements are also important.

PROCESSOR INTEL CORE

RAM CAPACITY 4GB

HARDDISK 1TB

I/O KEYBOARD,MONITER,MOUSE

Table 3.3.1 Hardware Requirements

3.4 MODULE DESCRIPTION

For predicting the literacy rate of India, our project has been divided into
following modules:

1. Data Analysis & Pre-processing


2. Model Training &Testing

3. Accuracy Measures

4. Prediction & Visualization

1. Data Analysis & Pre-processing

Data Analysis is done by collecting raw data from different literacy


websites.Data pre-processing technique involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in
certain behaviours or trends, and is likely to contain many errors. Data pre-processing is a
proven method of resolving such issues. We use pandas module for Data Analysis and pre-
processing

Pandas:

In order to be able to work with the data in Python, we'll need to read the csv file
into a Pandas Data Frame. A Data Frame is a way to represent and work with tabular data.
Tabular data has rows and columns, just like our csv file.

2. Model Training &Testing

For Literacy rate prediction, we perform “converting into 2D array” and


“scaling using normalization” operations on data for further processing. We use fit_transform
to center the data in a way that it has 0 mean and 1 standard error. Then, we divide the data
into x_train and y_train. Our model will get the 0-th element from x_train and try to predict
the 0-th element from y_train. Finally, we reshape the x_train data to match the requirements
for training using keras. Now we need to train our model using the above data.

The algorithm that we have used is Linear Regression


Linear Regression:

Linear Regression is a machine learning algorithm based on supervised learning. It


is a statistical approach for modeling relationship between a dependent variable with a given
set of independent variables. Here we refer dependent variables as response and independent
variables as features for simplicity.

Simple Linear Regression is an approach for predicting a response using a single feature. It is
assumed that the two variables are linearly related. Hence, we try to find a linear function that
predicts the response value(y) as accurately as possible as a function of the feature or
independent variable(x).

For predicting the literacy rate of any given year, first we need predict the population for that
year. Then the predicted population is given as input to the model which predict literacy rate

For the algorithm which predict population, year is taken as independent variable. And the
predicted population is taken as independent variable for the literacy prediction algorithm.

Testing:

In testing, now we predict the data. Here we have 2 steps: predict the literacy rate
and plot it to compare with the real results.Using fit transform to scale the data and then
reshape it for the prediction. Predict the data and rescale the predicted data to match its real
values. Then plot real and predicted literacy rate on a graph. Then calculate the accuracy.

We use Sklearn and Numpy python module for Training and testing

Sklearn:

It features various classification, regression and clustering algorithms


including support vector machines, random forests, gradient boosting, k-means
and DBSCAN, and is designed to interoperate with the Python numerical and scientific
libraries NumPy.
Numpy:

Numpy is the core library for scientific computing in Python. It provides a high-
performance multidimensional array object, and tools for working with these arrays.It is used
for Numerical Calculations

3. Accuracy Measures

The Accuracy of the model is to be evaluated to figure out the correctness of


the prediction. The proposed model got 87% Accuracy.

4. Prediction & Visualization

Using the Proposed model prediction is made for coming years. Graphs are
used to visualize state wise literacy rate predictions. We use Matplotlib python module for
Visualization

Matplotlib:

It is a plotting library for the Python programming language and its


numerical mathematics extension NumPy. It provides an object-oriented API for embedding
plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt,
or GTK+. There is also a procedural "pylab" interface based on a state
machine (like OpenGL), designed to closely resemble that of MATLAB, though its use is
discouraged.[3] SciPy makes use of Matplotlib.
CHAPTER 4

DESIGN

4.1 BLOCK DIAGRAM


The block diagram is typically used for a higher level, less detailed description aimed
more at understanding the overall concepts and less at understanding the details of
implementation.

Figure 4.1.1 Block Diagram for Sentimental Analysis


4.2 DATA FLOW DIAGRAMS:

Data flow diagram (DFD) is a graphical representation of “flow” of data


through an information system, modelling its process concepts. Often they are a
preliminary step used to create an overview of the system which can later be elaborated.
DFD’s can also be used for the visualization of data processing (structured design).

A DFD shows what kinds of information will be input to and output from the

system, where the data will come from and go to, and where the data will be stored. It
doesn’t show information about timing of processes, or information about whether
processes will operate in sequence or parallel. A DFD is also called as “bubble chart”.

DFD Symbols:

In the DFD, there are four symbols:

• A square define a source or destination of system data.

• An arrow indicates dataflow. It is the pipeline through which the information


flows.

• A circle or a bubble represents transforms dataflow into outgoing dataflow.

• An open rectangle is a store, data at reset or at temporary repository of data.

Dataflow: Data move in a specific direction from an origin to a destination.

Process: People, procedures or devices that use or produce (Transform) data. The
physical component is not identified.

Sources: External sources or destination of data, which may be programs, organizations


or other entity.
Data store: Here data is stored or referenced by a process in the system’s #

In our project, we had built the data flow diagrams at the very beginning of
business process modelling in order to model the functions that our project has to carry
out and the interaction between those functions together with focusing on data
exchanges between processes.

4.2.1 Context level DFD:

A Context level Data flow diagram created using select structured systems
analysis and design method (SSADM). This level shows the overall context of the
system and its operating environment and shows the whole system as just one process. It
does not usually show data stores, unless they are “owned” by external systems, e.g. are
accessed by but not maintained by this system, however, these are often shown as
external entities. The Context level DFD is shown in fig.3.2.1

Figure 4.2.1 Context Level DFD for Sentimental Analysis

The Context Level Data Flow Diagram shows the data flow from the application
to the database and to the system.

4.2.2 Top level DFD:


A data flow diagram is that which can be used to indicate the clear progress of a
business venture. In the process of coming up with a data flow diagram, the level one
provides an overview of the major functional areas of the undertaking. After presenting
the values for most important fields of discussion, it gives room for level two to be
drawn.

Figure 4.2.2 Top Level DFD for Sentimental Analysis

After starting and executing the application, training and testing the dataset can
be done as shown in the above figure

4.2.3 Detailed Level Diagram

This level explains each process of the system in a detailed manner. In first
detailed level DFD (Generation of individual fields): how data flows through individual
process/fields in it are shown.

In second detailed level DFD (generation of detailed process of the individual


fields):

how data flows through the system to form a detailed description of the individual
processes.
Figure 4.2.3.1 Detailed level DFD for Sentimental Analysis

After starting and executing the application, training the dataset is done by using
dividing into 2D array and scaling using normalization algorithms, and then testing is
done.

Figure 4.2.3.2 Detailed level dfd for Sentimental Analysis

After starting and executing the application, training the dataset is done by using
linear regression and then testing is done.

4.3 UNIFIED MODELLING LANGUAGE DIAGRAMS:

The Unified Modelling Language (UML) is a Standard language for specifying,


visualizing, constructing and documenting the software system and its components. The
UML focuses on the conceptual and physical representation of the system. It captures
the decisions and understandings about systems that must be constructed. A UML
system is represented using five different views that describe the system from distinctly
different perspective. Each view is defined by a set of diagram, which is as follows.
• User Model View

i. This view represents the system from the user’s perspective.

ii. The analysis representation describes a usage scenario from the end-
users perspective.
• Structural Model View

i. In this model the data and functionality are arrived from inside the
system.

ii. This model view models the static structures.

 Behavioral Model View

It represents the dynamic of behavioral as parts of the system, depicting


the interactions of collection between various structural elements described in the user
model and structural model view.

• Implementation model View

In this the structural and behavioral as parts of the system are


represented as they are to be built.
• Environmental Model View

In this the structural and behavioral aspects of the environment


in which the system is to be implemented are represented.

4.3.2 Use Case Diagram:

Use case diagrams are one of the five diagrams in the UML for modeling the
dynamic aspects of the systems (activity diagrams, sequence diagram, state chart
diagram, collaboration diagram are the four other kinds of diagrams in the UML for
modeling the dynamic aspects of systems).Use case diagrams are central to modeling
the behavior of the system, a sub-system, or a class. Each one shows a set of use cases
and actors and relations.
Figure 4.3.1 USECASE DIAGRAM
4.3.3 Sequence Diagram:
Sequence diagram is an interaction diagram which is focuses on the time
ordering of messages. It shows a set of objects and messages exchanged between these
objects. This diagram illustrates the dynamic view of a system.

Figure 4.3.2 Sequence Diagram


4.3.4 Collaboration Diagram:
Collaboration diagram is an interaction diagram that emphasizes the structural
organization of the objects that send and receive messages. Collaboration diagram and
sequence diagram are isomorphic.

Figure 4.3.3 Collaboration Diagram


4.3.5 Activity Diagram:
An Activity diagram shows the flow from activity to activity within a system it
emphasizes the flow of control among objects.

Figure 4.3.4 Activity Diagram


4.3.6 DATA DICTIONARY

Fig 4.3.5 Data Dictionary


CHAPTER 5

IMPLEMENTATION

Implementation is the stage of the project when the theoretical design is turned out
into a working system. Thus it can be considered to be the most critical stage in achieving a
successful new system and in giving the user, confidence that the new system will work and
be effective.

The implementation stage involves careful planning, investigation of the existing


system and it’s constraints on implementation, designing of methods to achieve changeover
and evaluation of changeover methods.

The project is implemented by accessing simultaneously from more than one system
and more than one window in one system. The application is implemented in the Internet
Information Services 5.0 web server under the Windows XP and accessed from various
clients.
5.1 TECHNOLOGIES USED

What is Python?

Python is an interpreter, high-level programming language for general-purpose


programming by “Guido van Rossum” and first released in 1991, Python has a design
philosophy that emphasizes code readability, and a syntax that allows programmers to express
concepts in fewer lines of code, notably using significant whitespace. It provides constructs
that enable clear programming on both small and large scales. Python features a dynamic type
system and automatic memory management. It supports multiple programming paradigms,
including object-oriented, imperative, functional, procedural, and has a large and
comprehensive standard library.

Python interpreters are available for many operating systems. Python, the reference
implementation of Python, is open source software and has a community-based development
model, as do nearly all of its variant implementations. Python is managed by the non-profit
Python Software Foundation.
Python is a general purpose, dynamic, high level and interpreted programming
language. It supports object-oriented programming approach to develop applications. It is
simple and easy to learn and provides lots of high level data structures.

 Windows XP
 Python Programming

 Open source libraries: Pandas, NumPy, SciPy, matplotlib, OpenCV

Python Versions
Python 2.0 was released on 16 October 2000 and had many major new features,
including a cycle-detecting, garbage collector, and support for Unicode. With this release, the
development process became more transparent and community-backed.

Python 3.0 (initially called Python 3000 or py3k) was released on 3 December 2008
after a long testing period. It is a major revision of the language that is not completely
backward-compatible with previous versions. However, many of its major features have been
back ported to the Python 2.6.xand 2.7.x version series, and releases of Python 3 include the
2to3 utility, which automates the translation of Python 2 code to Python 3.

Python 2.7's end-of-life date (a.k.a. EOL, sunset date) was initially set at 2015, then
postponed to 2020 out of concern that a large body of existing code could not easily be
forward-ported to Python 3.In January 2017, Google announced work on a Python 2.7 to go
Trans compiler to improve performance under concurrent workloads.

Python 3.6 had changes regarding UTF-8 (in Windows, PEP 528 and PEP 529) and
Python 3.7.0b1 (PEP 540) adds a new "UTF-8 Mode" (and overrides POSIX locale).

Why Python?
 Python is a scripting language like PHP, Perl, and Ruby.

 No licensing, distribution, or development fees


 It is a Desktop application.
 Linux, windows

 Excellent documentation
 Thriving developer community

 For us job opportunity

Libraries Of python:

Python's large standard library, commonly cited as one of its greatest


strengths,provides tools suited too many tasks. For Internet-facing applications, many
standard formats and protocols such as MIME and HTTP are supported. It includes modules
for creating graphical user interfaces, connecting to relational databases, generating
pseudorandom numbers, arithmetic with arbitrary precision decimals, manipulating regular
expressions, and unit testing.

Some parts of the standard library are covered by specifications (for example, the Web
Server Gateway Interface (WSGI) implementation wsgiref follows PEP 33), but most
modules are not.

They are specified by their code, internal documentation, and test suites (if supplied).
However, because most of the standard library is cross-platform Python code, only a few
modules need altering or rewriting for variant implementations.

As of March 2018, the Python Package Index (PyPI), the official repository for
thirdparty Python software, contains over 130,000 packages with a wide range of
functionality, including:

• Graphical user interfaces

• Web frameworks

• Multimedia

• Databases

• Networking

• Test frameworks

• Automation

• Web scraping

• Documentation
• System administration

5.2 MACHINE LEARNING

Machine Learning is an application of artificial intelligence (AI) that provides system


the ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it learn for themselves.
Basics of python machine learning:

• You'll know how to use Python and its libraries to explore your data with the help of
matplotlib and Principal Component Analysis (PCA).
• And you'll preprocess your data with normalization and you'll split your data into training
and test sets.
• Next, you'll work with the well-known K-Means algorithm to construct an unsupervised
model, fit this model to your data, predict values, and validate the model that you have
built.
• As an extra, you'll also see how you can also use Support Vector Machines (SVM) to
construct another model to classify your data.

Why Machine Learning?

• It was born from pattern recognition and theory that computers can learn without being
programmed to specific tasks.
• It is a method of Data analysis that automates analytical model building.

Machine learning tasks are typically classified into two broad categories, depending
on whether there is a learning "signal" or "feedback" available to a learning system. They are

Supervised learning:The computer is presented with example inputs and their desired
outputs, given by a "teacher", and the goal is to learn a general rule thatmapsinputs to outputs.
As special cases, the input signal can be only partially available, or restricted to special
feedback:
Semi-supervised learning: the computer is given only an incomplete training signal: a
training set with some (often many) of the target outputs missing.

Active learning:the computer can only obtain training labels for a limited set of instances
(based on a budget), and also has to optimize its choice of objects to acquire labels for. When
used interactively, these can be presented to the user for labelling.

Reinforcement learning: training data (in form of rewards and punishments) is given only as
feedback to the program's actions in a dynamic environment, such asdriving a vehicleor
playing a game against an opponent.

Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to
find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden
patterns in data) or a means towards an end (feature learning).

Inregression, also a supervised problem, the outputs are continuous rather than
discrete.

Regression: The analysis or measure of the association between one variable (the dependent
variable) and one or more other variables (the independent variables), usually formulated in
an equation in which the independent variables have parametric coefficients, which may
enable future values of the dependent variable to be predicted.

Figure 4.1.1 Regression Structure

What is Regression Analysis?


Regression analysis is a form of predictive modelling technique which investigates the
relationship between a dependent(target) and independent variable (s) (predictor). This
technique is used for forecasting, time series modelling and finding thecausal
effectrelationshipbetween the variables. For example, relationship between rash driving and
number of road accidents by

Types of Regression:
1. Linear Regression
2. Logistic Regression
3. Polynomial Regression
4. Stepwise Regression
5. Ridge Regression
6. Lasso Regression
7. Elastic Net Regression

1. Linear Regression: -It is one of the most widely known modelling techniques. Linear
regression is usually among the first few topics which people pick while learning predictive
modelling. In this technique, the dependent variable is continuous, independent variable(s)
can becontinuousordiscrete, and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one
or more independent variables (X) using a best fit straight line (also known as regression line).

2. Logistic Regression: -Logistic regression is used to find the probability of


event=Success and event=Failure. We should use logistic regression when the dependent
variable is binary (0/ 1, True/ False, Yes/ No) in nature. Here the value of Y ranges from 0 to
1 and it can have represented by following equation.

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence

ln(odds) = ln(p/(1-p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk

3. Polynomial Regression: -A regression equation is a polynomial regression equation if


the power of independent variable is more than 1. The equation below represents a
polynomial equation:

y=a+b*x^2
4. Stepwise Regression: -This form of regression is used when we deal with multiple
independent variables. In this technique, the selection of independent variables is done with
the help of an automatic process, which involves no human intervention.

This feat is achieved by observing statistical values like R-square, t-stats and AIC
metric to discern significant variables. Stepwise regression basically fits the regression model
by adding/dropping co-variants one at a time based on a specified criterion. Some of the most
commonly used Stepwise regression methods are listed below:

• Standard stepwise regression does two things. It adds and removes predictors as
needed for each step.
• Forward selection starts with most significant predictor in the model and adds variable
for each step.
• Backward elimination starts with all predictors in the model and removes the least
significant variable for each step.
The aim of this modelling technique is to maximize the prediction power with
minimum number of predictor variables. It is one of the methods to handle higher
dimensionality of data set.

5. Ridge Regression: -Ridge Regression is a technique used when the data suffers from
multi collinearity (independent variables are highly correlated). In multi collinearity, even
though the least squares estimate (OLS) are unbiased; their variances are large which deviates
the observed value far from the true value. By adding a degree of bias to the regression
estimates, ridge regression reduces the standard errors.

Above, we saw the equation for linear regression. Remember? It can be represented as:

y=a+ b*x

This equation also has an error term. The complete equation becomes:

y=a+b*x+e (error term), [error term is the value needed to correct for a prediction
error between the observed and predicted value]

=> y=a+y= a+ b1x1+ b2x2++e, for multiple independent variables.


In a linear equation, prediction errors can be decomposed into two sub components.
First is due to the biased and second is due to the variance. Prediction error can occur due to
any one of these two or both components. Here, we’ll discuss about the error caused due to
variance.

Ridge regression solves the multi collinearity problem throughshrinkage parameterλ


(lambda). Look at the equation below.

In this equation, we have two components. First one is least square term and other one
is lambda of the summation of β2 (beta- square) where β is the coefficient. This is added to
least square term in order to shrink the parameter to have a very low variance.

Important Points:

• The assumptions of this regression is same as least squared regression except


normality is not to be assumed
• It shrinks the value of coefficients but doesn’t reaches zero, which suggests no feature
selection feature
• This is a regularization method and uses l2 regularization.

6. Lasso Regression: -Similar to Ridge Regression, Lasso (Least Absolute Shrinkage


and Selection Operator) also penalizes the absolute size of the regression coefficients. In
addition, it is capable of reducing the variability and improving the accuracy of linear
regression models. Look at the equation below:
Lasso regression differs from ridge regression in a way that it uses absolute values in
the penalty function, instead of squares. This leads to penalizing (or equivalently constraining
the sum of the absolute values of the estimates) values which causes some of the parameter
estimates to turn out exactly zero. Larger the penalty applied, further the estimates get shrunk
towards absolute zero. This results to variable selection, out of given n variables.

Important Points:

• The assumptions of this regression is same as least squared regression except


normality is not to be assumed
• It shrinks coefficients to zero (exactly zero), which certainly helps in feature selection

• This is a regularization method and usesl1 regularization

• If group of predictors are highly correlated, lasso picks only one of them and shrinks
the others to zero
7. Elastic Net Regression: -Elastic Net is hybrid of Lasso and Ridge Regression
techniques. It is trained with L1 and L2 prior as regularize. Elastic-net is useful when there
are multiple features which are correlated. Lasso is likely to pick one of these at random,
while elastic-net is likely to pick both.

A practical advantage of trading-off between Lasso and Ridge is that, it allows Elastic-
Net to inherit some of Ridge’s stability under rotation.

Important Points:

• It encourages group effect in case of highly correlated variables


• There are no limitations on the number of selected variables

• It can suffer with double shrinkage

Beyond these 7 most commonly used regression techniques, you can also look at
other models likeBayesian,EcologicalandRobust regression.

Classification

A classification problem is when the output variable is a category, such as “red” or


“blue” or “disease” and “no disease”. A classification model attempts to draw some
conclusion from observed values. Given one or more inputs a classification model will try
to predict the value of one or more outcomes. For example, when filtering emails “spam” or
“not spam”, when looking at transaction data, “fraudulent”, or “authorized”. In short
Classification either predicts categorical class labels or classifies data (construct a model)
based on the training set and the values (class labels) in classifying attributes and uses it in
classifying new data.

There are a number of classification models. Classification models include

1.Logistic regression

2.Decision tree

3. Random forest

4. Naive Bayes.

1.Logistic Regression

Logistic regression is a supervised learning classification algorithm used to predict


the probability of a target variable. The nature of target or dependent variable is
dichotomous, which means there would be only two possible classes. In simple words, the
dependent variable is binary in nature having data coded as either 1 (stands for success/yes)
or 0 (stands for failure/no).
Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It
is one of the simplest ML algorithms that can be used for various classification problems
such as spam detection, Diabetes prediction, cancer detection etc.

2.Decision Tree

Decision Trees are a type of Supervised Machine Learning (that is you explain what
the input is and what the corresponding output is in the training data) where the data is
continuously split according to a certain parameter. The tree can be explained by two entities,
namely decision nodes and leaves.

3.Random Forest

Random Forest is a popular machine learning algorithm that belongs to the


supervised learning technique. It can be used for both Classification and Regression problems
in ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of


decision trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random forest
takes the prediction from each tree and based on the majority votes of predictions, and it
predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:
Fig 5.2.1 Image for Random Forest Algorithm

4.Naive Bayes

It is a classification technique based on Bayes’ Theorem with an assumption of


independence among predictors. In simple terms, a Naive Bayes classifier assumes that the
presence of a particular feature in a class is unrelated to the presence of any other feature.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c),
P(x) and P(x|c). Look at the equation below:
5.3 Deep learning

Deep Learning is a class of Machine Learning which performs much better on


unstructured data. Deep learning techniques are outper- forming current machine learning
techniques. It enables computational models to learn features progressively from data at
multiple levels. The popularity of deep learning amplified as the amount of data available
increased as well as the advancement of hardware that provides powerful computers.

Deep learning has emerged as a powerful machine learning technique that learns
multiple layers of representations or features of the data and produces state-of-the-art
prediction results. Along with the success of deep learning in many other application domains,
deep learning is also popularly used in sentiment analysis in recent years.

Deep Learning Algorithms:

There are two types of algorithms

1.Structured Algorithm

2.Unstructured Algorithm

1. Structured Algorithm

One of the structured algorithm is

Artificial Neural Network

Artificial Neural Networks are computational models and inspire by the human brain.
Many of the recent advancements have been made in the field of Artificial Intelligence,
including Voice Recognition, Image Recognition, Robotics using Artificial Neural Networks.
Artificial Neural Networks are the biologically inspired simulations performed on the
computer to perform certain specific tasks like –

 Clustering
 Classification
 Pattern Recognization

2. Unstructured Algorithm

One of the unstructured algorithm is

Deep Neural Network

A deep neural network (DNN) is an artificial neural network (ANN) with multiple
layers between the input and output layers. There are different types of neural networks but
they always consist of the same components: neurons, synapses, weights, biases, and
functions.These components functioning similar to the human brains and can be trained like
any other ML algorithm.

For example, a DNN that is trained to recognize dog breeds will go over the given
image and calculate the probability that the dog in the image is a certain breed. The user can
review the results and select which probabilities the network should display (above a certain
threshold, etc.) and return the proposed label. Each mathematical manipulation as such is
considered a layer, and complex DNN have many layers, hence the name "deep" networks.

Modules in python

Module: - A module allows you to logically organize your Python code.


Grouping related code into a module makes the code easier to understand and use. A
module is a Python object with arbitrarily named attributes that you can bind and
reference.

Pandas: -

Pandas is a Python package providing fast, flexible, and expressive data


structures designed to make working with “relational” or “labelled” data both easy and
intuitive. It aims to be the fundamental high-level building block for doing practical,
real world data analysis in Python. Additionally, it has the broader goal of becoming the
most powerful and flexible open source data analysis / manipulation tool available in
any language.

It is already well on its way toward this goal.

Pandas is well suited for many different kinds of data:

• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel


spread sheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.

• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and


column labels
• Any other form of observational / statistical data sets. The data actually need not
be labelled at all to be placed into a panda’s data structure

The two primary data structures of pandas,Series(1-dimensional) and Data


Frame (2dimensional), handle the vast majority of typical use cases in finance, statistics,
social science, and many areas of engineering. For R users,Data Frameprovides
everything that R’s data frame provides and much more. Pandas is built on top of
NumPy and is intended to integrate well within a scientific computing environment with
many other 3rd party libraries. Few of the things that pandas does well:

• Easy handling of missing data (represented as Nan) in floating point as well as


nonfloating-point data
• Size mutability: columns can be inserted and deleted from Data Frame and
higher dimensional objects
• Automatic and explicit data alignment: objects can be explicitly aligned to a
set of labels, or the user can simply ignore the labels and let Series, Data Frame,
etc. automatically align the data for you in computations
• Powerful, flexible group by functionality to perform split-apply-combine
operations on data sets, for both aggregating and transforming data
• Make it easy to convert ragged, differently-indexed data in other Python and
NumPy data structures into Data Frame objects
• Intelligent label-based slicing, fancy indexing, and sub settingof large data sets
• Intuitive mergingandjoining data sets

• Flexible reshaping and pivoting of data sets

• Hierarchical labelling of axes (possible to have multiple labels per tick)

• Robust IO tools for loading data from flat files (CSV and delimited), Excel files,
databases, and saving / loading data from the ultrafast HDF5 format
• Time series-specific functionality: date range generation and frequency
conversion, moving window statistics, moving window linear regressions, date
shifting and lagging, etc.

Many of these principles are here to address the shortcomings frequently


experienced using other languages / scientific research environments. For data
scientists, working with data is typically divided into multiple stages: munging and
cleaning data, analysing / modelling it, then organizing the results of the analysis into a
form suitable for plotting or tabular display. pandas is the ideal tool for all of these
tasks.

• pandas is fast. Many of the low-level algorithmic bits have been extensively
improved in Python code. However, as with anything else generalization usually
sacrifices performance. So, if you focus on one feature for your application you
may be able to create a faster specialized tool.
• pandas are a dependency of stats models, making it an important part of the
statistical computing ecosystem in Python.
• pandas have been used extensively in production in financial applications.

NumPy: -

NumPy, which stands for Numerical Python, is a library consisting of


multidimensional array objects and a collection of routines for processing those arrays.
Using NumPy, mathematical and logical operations on arrays can be performed. This
tutorial explains the basics of NumPy such as its architecture and environment. It also
discusses the various array functions, types of indexing, etc. An introduction to
Matplotlib is also provided.

All this is explained with the help of examples for better understanding.
NumPy is a Python package. It stands for 'Numerical Python'. It is a library
consisting of multidimensional array objects and a collection of routines for processing
of array. Numeric, the ancestor of NumPy, was developed by Jim Humulin. Another
package Numara was also developed, having some additional functionalities. In 2005,
Travis Oliphant created NumPy package by incorporating the features of Numara into
Numeric package. There are many contributors to this open source project.

Operations using NumPy: -

Using NumPy, a developer can perform the following operations −

• Mathematical and logical operations on arrays.

• Fourier transforms and routines for shape manipulation.

• Operations related to linear algebra. NumPy has in-built functions for linear
algebra and random number generation.

NumPy – A Replacement for MATLAB


NumPy is often used along with packages like SciPy (Scientific Python) and
Mat−plotid (plotting library). This combination is widely used as a replacement for
MATLAB, a popular platform for technical computing. However, Python alternative to
MATLAB is now seen as a more modern and complete programming language. It is
open source, which is an added advantage of NumPy.

Sickit-learn: -

Scikit-learn (formerly scikits. learn) is a free software machine learning library


for the Python programming language. It features various classification, regression and
clustering algorithms including support vector machines, random forests, gradient
boosting, k-means and DBSCAN, and is designed to interoperate with the Python
numerical and scientific libraries NumPy and SciPy.
The scikit-learn project started as scikits. learn, a Google Summer of Code
project by David Courmayeur. Its name stems from the notion that it is a “SciKit”(SciPy
Toolkit), a separately-developed and distributed third-party extension to SciPy.

The original codebase was later rewritten by other developers. In 2010 Fabian
Pedrosa, Gael Viroqua, Alexandre Gramfort and Vincent Michel, all from INRIA took
leadership of the project and made the first public release on February the 1st 2010 .Of
the various scikits, scikit-learn as well as scikit-image were described as “well-
maintained and popular” in November 2012. Scikit-learn is largely written in Python,
with some core algorithms written in Cython to achieve performance. Support vector
machines are implemented by a Cython wrapper around LIBSVM; logistic regression
and linear support vector machines by a similar wrapper around LIBLINEAR.

Some popular groups of models provided by scikit-learn include:

• Ensemble methods: for combining the predictions of multiple supervised


models.

• Feature extraction: for defining attributes in image and text data.

• Feature selection: for identifying meaningful attributes from which to create


supervised models.
• Parameter Tuning: for getting the most out of supervised models.

• Manifold Learning: For summarizing and depicting complex multi-


dimensional data.
• Supervised Models: a vast array not limited to generalize linear models,
discriminate analysis, naive bayes, lazy methods, neural networks, support
vector machines and decision trees.
Matplotlib:-
Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy. It provides an object-oriented API for
embedding plots into applications using general-purpose GUI toolkits like Tkinter,
wxPython, Qt, or GTK+. There is also a procedural "pylab" interface based on a state
machine (like OpenGL), designed to closely resemble that of MATLAB, though its use
is discouraged. SciPy makes use of matplotlib.
CHAPTER 6
TESTING
It is the process of testing the functionality and it is the process of executing a
program with the intent of finding an error. A good test case is one that has a high
probability of finding an as at undiscovered error. A successful test is one that uncovers
an as at undiscovered error. Software testing is usually performed for one of two
reasons:

• Defect Detection

• Reliability estimation

6.1 BLACK BOX TESTING:

The base of the black box testing strategy lies in the selection of appropriate data
as per functionality and testing it against the functional specifications in order to check
for normal and abnormal behavior of the system. Now a days, it is becoming to route
the testing work to a third party as the developer of the system knows too much of the
internal logic and coding of the system, which makes it unfit to test application by the
developer. The following are different types of techniques involved in black box testing.
They are:

• Decision Table Testing

• All pairs testing

• State transition tables testing

• Equivalence Partitioning

Software testing is used in association with Verification and Validation. Verification


is the checking of or testing of items, including software, for conformance and
consistency with an associated specification. Software testing is just one kind of
verification, which also uses techniques as reviews, inspections, walk-through.
Validation is the process of checking what has been specified is what the user actually
wanted.

• Validation: Are we doing the right job?


• Verification: Are we doing the job right?

In order to achieve consistency in the Testing style, it is imperative to have and


follow a set of testing principles. This enhances the efficiency of testing within SQA
team members and thus contributes to increased productivity. The purpose of this
document is to provide overview of the testing, plus the techniques. Here, after training
is done on the training dataset, testing is done.

6.2 WHITE BOX TESTING:

White box testing [10] requires access to source code. Though white box testing [10] can
be performed any time in the life cycle after the code is developed, it is a good practice to
perform white box testing [10] during unit testing phase.

In designing of database the flow of specific inputs through the code, expected output and
the functionality of conditional loops are tested.

At SDEI, 3 levels of software testing is done at various SDLC phases

• UNIT TESTING: in which each unit (basic component) of the software is


tested to verify that the detailed design for the unit has been correctly
implemented
• INTEGRATION TESTING: in which progressively larger groups of tested
software components corresponding to elements of the architectural design
are integrated and tested until the software works as a whole.
• SYSTEM TESTING: in which the software is integrated to the overall
product and tested to show that all requirements are met. A further level of
testing is also done, in accordance with requirements:
• REGRESSION TESTING: is used to refer the repetition of the earlier
successful tests to ensure that changes made in the software have not
introduced new bugs/side effects.
• ACCEPTANCE TESTING: Testing to verify a product meets customer
specified requirements. The acceptance test suite is run against supplied
input data. Then the results obtained are compared with the expected results
of the client. A correct match was obtain.
CHAPTER 7
OUTPUT SCREENS
We use Machine Learning models to evaluate a model.At back end Deep Learning
algoritms like ANN(Artificial Neural Network) is used to evaluate a model
The following algoritms are used
1. Naive Bayes
It is a classification technique based on Bayes’ Theorem with an assumption of
independence among predictors. In simple terms, a Naive Bayes classifier assumes that the
presence of a particular feature in a class is unrelated to the presence of any other feature.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
2. Random Forest
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression in ML.

It is based on the concept of ensemble learning, which is a process of combining


multiple classifiers to sole a complex problem and to improve the performance of the model.

As the name suggests, “Random Forest is a classifier that contains a number of


decision trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset”.Instead of relying on one decision tree,the random forest
takes the prediction from each tree based on the majority votes of prediction,and it predicts
the final output.

At Backend ANN algorithm is used

1.ANN(Artificial Neural Network)

Artificial Neural Networks are computational models and inspire by the human brain.
Many of the recent advancements have been made in the field of Artificial Intelligence,
including Voice Recognition, Image Recognition, Robotics using Artificial Neural Networks.
Artificial Neural Networks are the biologically inspired simulations performed on the
computer to perform certain specific tasks like –

 Clustering
 Classification
 Pattern Recognization
For the above review the output is:
For the review ”Food is Bad”
For the above review the output is
CHAPTER 8

CONCLUSION

Sentimental analysis of data collected from social media like Twitter , Facebook,
Instagram is beneficial for mankind in providing them better health care. The workflow of
analysis healthcare content in the social media helps to overcome the limitations of large scale
data analysis and manual analysis of user generated textual content in social media.

This work can help the users to be updated with the effectiveness of the
medicines and it can even suggest them with few better medications available. This project
can provide feedback to the healthcare system organization and pharmaceutical companies for
the available treatments and medicines. With the help of this project ,pharmaceutical
companies and healthcare providers can work on the feedback and try to come up with the
improvised medicines and treatments for diabetes. Users are provided with the resources of
social media for the corresponding field of healthcare.

Opinion Mining and Sentiment analysis has wide area of applications and it also
facing many research challenges. Since the fast growth of internet and internet related
applications, the Opinion Mining and Sentiment Analysis become a most interesting research
area among natural language processing community. A more innovative and effective
techniques required to be invented which should overcome the current challenges faced by
Opinion Mining and Sentiment Analysis.
CHAPTER 9

FUTURE SCOPE AND ENHANCEMENT

In future , one can collect large healthcare related data from multiple social networking
sites which may provide better results by overcoming the limitations of the project.

In future, one can even collect data which includes videos and images for analyzing and
we can use more Deep Learning and Neural network to implement our project in Future.
CHAPTER 10

BIBLIOGRAPHY

[1] Statistics-YouTube. http://www.youtube.com/yt/press/statistics.html, 2013.

[2] J. Friedman, T. Hastie, and R. Tibshirani.

The elements of statistical learning: Data Mining, Inference, and Prediction, Second Edition.
Springer Series in Statistics, 2009.

[3] G. Szabo and B. Huberman. Predicting the popularity of online content. Communic. of
ACM, 53(8), 2010.

[4[ A Machine Learning Model for Stock Market Prediction Article by Osman
Hegazy and Mustafa Abdul Salam.

[5]. The Unified Modeling Language User Guide, Low Price Edition Grady Booch, James
Rumbaugh, Ivar Jacob,ISBN: 81-7808-769-5, 91, 1997.

6].The Elements of UMLTM 2.0, Scott Ambler-Cambridge University Press


Newyork@2005,ISBN: 978-0-07-52616-7-82, 2005.

[7]. Software Testing,CemKaner, James beach,BretPettiehord,ISBN: 978-0-471-120-


940,2001.
[8]. A Practitioner’s Approach Roger S.Pressman, Software Engineering, 3rd Edition,ISBN:
978-007-126782-3,482, 2014.

[9].Black_Box Testing: Techniques for Functional Testing of Software and Systems Boris
Beizer - Wiley Publications, ISBN: 978-0-471-120-940, 1995.

You might also like