0% found this document useful (0 votes)
21 views76 pages

A System For Health Document Classification Using Machine Learning

A guideline

Uploaded by

egertonwest2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views76 pages

A System For Health Document Classification Using Machine Learning

A guideline

Uploaded by

egertonwest2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 76

ABSTRACT

Due to the massive increase in medical documents every day (including


books, journals, blogs, articles, doctors' instructions and prescriptions,
emails from patients, etc.), it is becoming very challenging to handle and
to categorize them manually. One of the most challenging projects in
information systems is extracting information from unstructured texts,
including medical document classification. The discovery of knowledge
from medical datasets is important in order to make effective medical
diagnosis. Developing a classification algorithm that classifies a medical
document by analyzing its content and categorizing it under predefined
topics is the primary aim of this research. In this project work we were
able to succeed in applying Natural Language Processing which is a
branch of Machine Learning to Classifying Health related documents.
We made use of the OpenNLP Application Programming Interface
which is a Java API for training a model and classifying the documents.
We make use of Materialize which is a HTML5, CSS and JavaScript
framework for building the user interface. The software is also built
using the Model-View-Controller (MVC) architecture. The algorithm
classified the articles correctly under the actual subject headings and got
the total subject headings correct. This holds promising solutions for the
global health arena to index and classify medical documents
expeditiously.
CHAPTER ONE

1.0 INTRODUCTION

This chapter introduces the topic of the project work A System for

Health Document Classification Using Machine Learning. In this

chapter, we will consider the background of the study, statement of the

problem, aims and objectives, methodology used to design the system,

scope of the study, its significance, definition of terms, and we conclude

with the project layout or organization of the project work.

1.1 BACKGROUND OF THE STUDY

Contemporarily, most hospitals, medical laboratories and other health

facilities make use of some kind of information system. These could be

either a hospital management system or a pharmacy management

system. Among other functions that these systems provide, they are

mainly used in collecting patient records. These information systems

stores patient records in digital format. Numerous patient data are being

recorded on a daily basis which forms a large data set popularly referred

to as “Big Data”.
Every day physicians and other health workers are required to work with

this “Big Data” in other to provide solution. Some of the everyday tasks

include information retrieval and data mining. Retrieving information

from big data can be very laborious and time consuming. This has given

rise to the study of text or document classification in other to aid the

process of retrieving information from big data. Today, text

classification is a necessity due to the very large amount of text

documents that we have to deal with daily.

Document classification is the task of grouping documents into

categories based upon their content. Document classification is a

significant learning problem that is at the core of many information

management and retrieval tasks. Document classification performs an

essential role in various applications that deals with organizing,

classifying, searching and concisely representing a significant amount of

information. Document classification is a longstanding problem in

information retrieval which has been well studied (Russell, 2018).

.
Usually, machine learning, statistical pattern recognition, or neural

network approaches are used to construct classifiers automatically.

Machine learning approaches to classification suggest the automatic

construction of classifiers using induction over pre-classified sample

documents. In this project work we will employ machine learning in

classifying health documents.

1.2 STATEMENT OF THE PROBLEM

With the explosion of information fuelled by the growth of the World

Wide Web it is no longer feasible for a human observer to understand all

the data coming in or even classify it into categories. Also in the health

sector, numerous patient records are being collected everyday and are

used for analysis. How do we efficiently classify or categorize these

health documents to complement easy retrieval.

1.3 AIM AND OBJECTIVES OF THE STUDY

The aim of this project is to develop A System for Health Document

Classification Using Machine Learning.


Other objectives include:

1. Study the various machine learning classification algorithm.

2. Implement classification algorithm in JAVA.

1.4 SCOPE OF THE STUDY

As stated earlier, statistical pattern recognition, or neural network are

used in classifying documents, this project work will concentrate on

using machine learning algorithm to classify document.

1.5 SIGNIFICANCE OF THE STUDY

The software delivered from this project work will greatly reduce the

time used by doctors, physicians and other health workers in searching

and retrieving documents.

Other importance of this project work includes:

1. Helps students and other interested individuals that want to develop a

similar application.
2. It will serve as source of materials for those interested in investigating

the processes involved in developing a document classification system

using machine learning.

3. It will serve as source of materials for students who are interested in

studying machine learning.

1.6 DEFINITION OF TERMS

Document Classification: is the task of grouping documents into

categories based upon their content.

Health Document: A health certificate is written by a doctor and

displays the official results of a physical examination.

Machine Learning: the study and construction of algorithms that can

learn from and make predictions on data.

JSP: Java Server Pages is a java technology for creating dynamic web

pages.

HTML: Hyper Text Markup Language for creating web-pages.


MYSQL: A database management system for creating, storing and

manipulating databases.

SERVLET: is a small pluggable extension to a Server that enhances the

Server’s functionality.

BOOTSTRAP: is a sleek, intuitive, and powerful mobile first front-end

framework for faster and easier web development. It uses HTML, CSS

and Javascript.

1.7 ORGANIZATION OF WORK

Chapter one introduces the background of the project with the statement

of the problems, objectives of the project, its significance, scope, and

constraints are pointed out.

Chapter two reviews literatures on machine learning, document

classification and the review of related literature.

Chapter three discusses system Investigation and Analysis. It deals with

detailed investigation and analysis of the existing system and problem

identification. It also proposed for the new system.


Chapter four covers the system design and implementation.

Chapter five was the summary and conclusion of the project.


CHAPTER TWO

LITERATURE REVIEW
2.0 DOCUMENT CLASSIFICATION
Classification can be divided in two principal phases. The first phase is

document representation, and the second phase is classification. The

standard document representation used in text classification is the vector

space model. The difference of classification systems is in document

representation models. The more relevant the representation is, the more

relevant the classification will be. The second phase includes learning

from training corpus, making a model for classes and classifying the

new documents according to the model.

2.1 TEXT CATEGORIZATION

Text categorization, the activity of labeling natural language texts with

thematic categories from a set arranged in advance has accumulated an

important status in the information systems field, due to because of

augmentation of availability of documents in digital form and the

confirms need to access them in easy ways.. Currently text


categorization is applied in many contexts, ranging from document

indexing depending on a managing vocabulary, to document filtering,

automated metadata creation, vagueness of word sense, population of

and in general any application needs document organization or chosen

and adaptive document execution. These days text categorization is a

discipline at the crossroads of ML and IR, and it claims a number of

characteristics with other tasks like information/ knowledge pulling from

texts and text mining (PAZIENZA, 1997). “Text mining” is mostly used

to represent all the tasks that, by analyzing large quantities of text and

identifying usage patterns, try to extract probably helpful (although only

probably correct) information. Concentrating on the above opinion, text

categorization is an illustration of text mining which includes:

1. the automatic assignment of documents to a predetermined set of

categories,

2. the automatic reorganization of such a set of categories, or

3. the automatic identification


The text classification is a crucial part of information management

process. As net resources constantly grow, increasing the effectiveness

of text classifiers is necessary. Document retrieval, its categorization,

routing and aforementioned information filtering is often based on the

text categorization (Hull, 1996).

2.2 TAXONOMY OF TEXT CLASSIFICATION PROCESS

The task of building a classifier for documents does not vary from other

tasks of Machine Learning. The main point is the representation of a

document (Leopold, 2002).

One special certainty of the text categorization problem is that the

number of features (unique words or phrases) reaches orders of tens of

thousands flexibly. This develops big hindrances in applying many

sophisticated learning algorithms to the text categorization, so dimension

reduction methods are used which can be used either in choosing a

subset of the original features (Brank, 2002), or transforming the

features into new ones, that is, adding new features.


2.2.1 TOKENIZATION
The process of breaking a stream of text up into tokens that is words,

phrases, symbols, or other meaningful elements is called Tokenization

where the list of tokens is input to the next processing of text

classification. Generally, tokenization occurs at the word level.

Nevertheless, it is not easy to define the meaning of the "word". Where a

tokenize process responds on simple heuristics, for instance:

All contiguous strings of alphabetic characters are part of one token;

similarly with numbers. Tokens are divided by whitespace characters,

like a space or line break, or by punctuation characters. Punctuation and

whitespace may or may not be added in the resulting list of tokens. In

languages like English (and most programming languages) words are

separated by whitespace, this approach is straightforward. Still,

tokenization is tough for languages with no word boundaries like

Chinese. Simple white spaced limited tokenization also shows toughness

in word collocations like New York which must be considered as single

token. Some ways to mention this problem are by improving more


complex heuristics, querying a table of common collocations, or fitting

the tokens to a language model that identifies collocations in a next

processing.

2.2.2 STEMMING
In linguistic morphology and information collection, stemming is the

process for decreasing deviated (or sometimes derived) words to their

stem, original form. The stem need not be identical to the morphological

root of the word; it is usually enough if it is concern words map of

similar stem, even if this stem is not a valid root. In computer science

algorithms for stemming have been studied since 1968. Many search

engines consider words with the similar stem as synonyms as a kind of

query broadening, a process called conflation.

2.2.3 STOP WORD REMOVAL

Typically in computing, stop words are filtered out prior to the

processing of natural language data (text) which is managed by man but

not a machine. A prepared list of stop words do not exist which can be
used by every tool. Though any stop word list is used by any tool in

order to support the phrase search the list is ignored.

Any group of words can be selected as the stop words for a particular

cause. For a few search machines, these is a list of common words, short

function words, like the, is, at, which and on that create problems in

performing text mining phrases that consist them. Therefore it is needed

to eliminate stop words contains lexical words, like "want" from phrases

to raise performance.

2.2.4 VECTOR REPRESENTATION OF THE DOCUMENTS

Vector denotation of the documents is an algebraic model for

symbolizing text documents (and any objects, in general) as vectors of

identifiers, like, for example, index terms which will be utilized in

information filtering, information retrieval, indexing and relevancy

rankings where its primary use is in the SMART Information Retrieval

System.

A sequence of words is called a document (Leopold, 2002). Thus every

document is generally denoted by an array of words. The group of all the


words of a training group is called vocabulary, or feature set. Thus a

document can be produced by a binary vector, assigning the value 1 if

the document includes the feature-word or 0 if there is no word in the

document.

2.2.5 FEATURE SELECTION AND TRANSFORMATION


The main objective of feature-selection methods is to decrease of the

dimensionality of the dataset by eliminating features that are not related

for the classification (Forman, 2003). The transformation procedure is

explained for presenting a number of benefits, involving tiny dataset

size, tiny computational needs for the text categorization algorithms

(especially those that do not scale well with the feature set size) and

comfortable shrinking of the search space. The goal is to reduce the

curse of dimensionality to yield developed classification perfection. The

other advantage of feature selection is its quality to decrease over fitting,

i. e. the phenomenon by which a classifier is tuned also to the contingent

characteristics of the training data rather than the constitutive


characteristics of the categories, and therefore, to augment

generalization.

Feature Transformation differs considerably from Feature Selection

approaches, but like them its aim is to decrease the feature set volume.

The approach does not weight terms in order to neglect the lower

weighted but compacts the vocabulary based on feature concurrencies.

Figure 2.x Document classification process


2.3 ASSORTMENT OF MACHINE LEARNING ALGORITHMS

FOR TEXT CLASSIFICATION

After feature opting and transformation the documents can be flexibly

denoted in a form that can be utilized by a ML algorithm. Most of the

text classifiers adduced in the literature utilizing machine learning


techniques, probabilistic models, etc. They regularly vary in the

approach taken are decision trees, naïve Bayes, rule induction, neural

networks, nearest neighbors, and lately, support vector machines.

Though most of the approaches adduced, automated text classification is

however a major area of research first due to the effectiveness of present

automated text classifiers is not errorless and nevertheless require

development.

Naive Bayes is regularly utilized in text classification applications and

experiments due to its easy and effectiveness (Kim, 2002). Nevertheless,

its performance is reduced due to it does not model text.

Schneider addressed the problems and display that they can be resolved

by a few plain corrections. Klopotek and Woch presented results of

empirical evaluation of a Bayesian multinet classifier depending on a

novel method of learning very large tree-like Bayesian networks

(Klopotek, 2003). The study advices that tree-like Bayesian networks are

able to deal a text classification task in one hundred thousand variables

with sufficient speed and accuracy.


When Support vector machines (SVM), are applied to text classification

supplying excellent precision, but less recollection. Customizing SVMs

means to develop recollect which helps in adjusting the origin associated

with an SVM. Shanahan and Roma explained an automatic process for

adjusting the thresholds of generic SVM (Shanahan, 2003),for improved

results. Johnson et al. explained a fast decision tree construction

algorithm that receives benefits of the sparse text data, and a rule

simplification method that translates the decision tree into a logically

equivalent rule set.

Lim introduced a method which raises performance of kNN based text

classification by utilizing calculated parameters. Some variants of the

kNN method with various decision functions, k values, and feature sets

are also introduced and evaluated to discover enough parameters.

For immediate document classification, Corner classification (CC)

network, feed forward neural network is used. A training algorithm,

TextCC is introduced in. The complexity of of text classification tasks

generally varies. As the number of different classes augments as of


complexity and hence the training set size is required. In multi-class text

classification task, unavoidable some classes are a bit harder than others

to classify. Reasons for this are: very few positive training examples for

the class, and lack of good forecasting features for that class.

When training a binary classifier per category in text categorization, we

use all the documents in the training corpus that has the category as

related training data and all the documents in the training corpus that are

of the other categories are non related training data. It is a regular case

that there is an overwhelming number of non related training documents

specially when there is high number of categories with every allotted to

a tiny documents, which is an “imbalanced data problem". This problem

gives a certain risk to classification algorithms, which can accomplish

perfection by simply classifying every example as negative. To resolve

this problem, cost sensitive learning is required.

2.4 REVIEW OF RELATED WORK


LI et al, investigate four different methods for document classification:

the naive Bayes classifier, the nearest neighbor classifier, decision trees

and a subspace method. These were applied to seven-class Yahoo news

groups (business, entertainment, health, international, politics, sports and

technology) individually and in combination. They studied three

classifier combination approaches: simple voting, dynamic classifier

selection and adaptive classifier combination. Their experimental results

indicate that the naive Bayes classifier and the subspace method

outperform the other two classifiers on our data sets. Combinations of

multiple classifiers did not always improve the classification accuracy

compared to the best individual classifier. Among the three different

combination approaches, the adaptive classifier combination method

introduced here performed the best. The best classification accuracy that

they were able to achieve on this seven-class problem is approximately

83%, which is comparable to the performance of other similar studies.

However, the classification problem considered here is more difficult

because the pattern classes used in our experiments have a large overlap

of words in their corresponding documents (LI, 1998).


Goller et al, thoroughly evaluate a wide variety of methods on a

document classification task for German text. They evaluate different

feature construction and selection methods and various classifiers. Their

main results are: feature selection is necessary not only to reduce

learning and classification time, but also to avoid overfitting (even for

Support Vector Machines); surprisingly, their morphological analysis

does not improve classification quality compared to a letter 5-gram

approach. Support Vector Machines are significantly better than all other

classification methods (Goller, 2002).

Ankit et al, discusses the different types of feature vectors through

which document can be represented and later classified. They compares

the Binary, Count and TfIdf feature vectors and their impact on

document classification. To test how well each of the three mentioned

feature vectors perform, they used the 20-newsgroup dataset and

converted the documents to all the three feature vectors. For each feature

vector representation, they trained the Naïve Bayes classifier and then

tested the generated classifier on test documents. In their results, they


found that TfIdf performed 4% better than Count vectorizer and 6%

better than Binary vectorizer if stop words are removed. If stop words

are not removed, then TfIdf performed 6% better than Binary vectorizer

and 11% better than Count vectorizer. Also, Count vectorizer performs

better than Binary vectorizer, if stop words are removed by 2% but lags

behind by 5% if stop words are not removed. Thus, they can conclude

that TfIdf should be the preferred vectorizer for document representation

and classification (Ankit, 2017).

CHAPTER THREE
SYSTEM ANALYSIS AND DESIGN

3.0 INTRODUCTION

This chapter shows all the modules and components used to design the

system, and how they work together. It also shows us how the users of

the system interact with the system.

3.1 ANALYSIS OF THE EXSISTING SYSTEM


Currently the existing system is manual, health workers presently

classify health documents through stacking of physical files in file

cabinets. This makes it difficult to retrieve files when a file of a

particular category is required.

3.2 ANALYSIS OF THE PROPOSED SYSTEM


System analysis and design deal with planning the development of

information systems through understanding and specifying in detail what

a system should do and how the components of the system should be

implemented and work together. System analysts solve business

problems through analyzing the requirements of information systems

and designing such systems by applying analysis and design techniques.


3.2.1 REQUIREMENTS OF THE SYSTEM

For the system to serve its intended purpose properly, the system will

have to meet the following requirements.

1. It should be able to accept as input text documents with the

following extension .txt, .doc .pdf.

2. It should be able to search for defined text in documents.

3. It should be able to summarize documents.

4. It should be able to categorize and summarize text

5. It should be able to tokenize text, carry out stemming and

lemmatization.

6. It should be able to identify sentences.

7. It should able to perform Conference resolution, Word Sense

Disambiguation and Sentence Boundary Disambiguation.

3.3 TRAINING A MODEL


In machine learning, models are used to train algorithms. The algorithm

learns from the model to the point that when it will produce similar

result when similar data (similar to the model) is presented to the

algorithm. In this project work we make use of the OpenNLP API for

document classification. The OpenNLP API is a set of Java tools from

the Apache software foundation for carrying out natural language

processing which is an aspect of machine learning and is the domain of

our project work.

In other to carry out the classification, we first train a model. Our model

is built to identify disease such as malaria, hypertension and diarrhea.

We opted to start with these three diseases as a little Google search

shows them to be the most common diseases prevalent in Nigeria. In

other to construct a model in OpenNLP, you need to create a file of

training data. The training file format consists of a series of lines, the

first word of the line is the category. The category is followed by text

separated by whitespace. We use numerous lines of text containing the

words malaria, hypertension and diarrhea which we source online


mainly from Wikipedia to create a training file called” en-

diseases.train”. The en-diseases.train file is passed to the train method of

the DocumentCategorizerME class. The train method trains the file and

outputs a model file with a .bin file name extension.

3.4 CLASSIFYING THE DOCUMENT

After training, the model file produced will be used to, classify the

health documents. The “categorizer” method of the

DocumentCategorizerME is used to classify the documents either into

Malaria, Diarrhea or Hypertension.

3.3 USE CASE DIAGRAMS

The use case diagram is used to show the interaction between the system

use cases and its clients without much detail. A use case diagram

displays an actor and its use cases, the actors are also the users of the

system.

The users or actors of our document classification system include:

Health Worker
Fig: 3.1 Health Worker Use Case

3.4 SEQUENCE DIAGRAM


Sequence diagrams are simple subsets of interaction diagrams. They

map out sequential events in an engineering or business process in order

to streamline activities. Sequence diagrams are used to show how

objects interact in a given situation. An important characteristic of a

sequence diagram is that time passes from top to bottom: the interaction

starts near the top of the diagram and ends at the bottom (i.e. Lower

equals Later).
Fig: 3.2 Sequence Diagram

3.5 CLASS DIAGRAMS


We begin our OOD process by identifying the classes required to build

the system. We describe these classes using class diagrams and

implement them in Java. The class diagram enable us to model via class

diagrams, each class is modeled as a rectangle with three compartments.


The top one contains the name of the class centered horizontally in bold

face. The middle compartment contains the class attributes, while the

bottom compartment contains the class behavior or operation. Below is

the class diagram for the system.

Figure 3.3 Class Diagram

3.7 SYSTEM FLOW CHART


This is a graphical representation of the sequence of operations in an

information system or program. Information system flowcharts show

how data flows from source documents through the computer to final

distribution to users. The following figures are the system flow chart for

our system.
Figure 3.4 System Flow Chart
CHAPTER FOUR

SYSTEM IMPLEMENTATION

4.0 INTRODUCTION

After careful requirement gathering, analysis and design, the system is

implemented. Implementation involves testing the system with required

data and observing the results to see if the system has been properly

deigned or if it contains bugs. This is usually done with data which has

known results. In this chapter we will implement the system designed.

4.1 SYSTEM REQUIREMENTS

To implement the application, the computer on which it will run has to

meet some hardware and software requirements. Also since it has been

designed as a web enabled application, the server on which the system

will be deployed also has to meet certain hardware and software

requirements. The following section will outline these requirements.

4.1.1 SERVER HARDWARE SPECIFICATION

1. 4GB of RAM and above


2. 500 GB Hard Disk or more

3. 2.0 GHZ processor speed or more

4.1.2 SERVER SOFTWARE SPECIFICATION

1. 32-bit or 64-bit Operating System, Windows or Linux

2. Java Runtime Environment, version 7

3. Apache Tomcat Server version 7

4. MySQL version 5

4.1.3 CLIENT HARDWARE SPECIFICATION

1. 1GB of RAM

2. 80 GB Hard Disk

3. 2.0 GHZ processor speed.

4. 15 inches Monitor Screen

5. Internet modem

4.1.4 CLIENT SOFTWARE SPECIFICATION

1. 32-bit Operating System, Windows or Linux

2. Web browser
4.2 SYSTEM SAMPLE OUTPUT

This section displays the sample interface, and describes the functions of

each web page in the system.

4.2.1 HOME PAGE

This is the first page that displays to the users of the system. It contains a

brief introduction to the application as well as the login link for the

administrators and the users.

Figure 4.1 Home Page


4.2.2 ADMINISTRATOR LOGIN PAGE

This page contains a login form for the administrator to login, the form

includes two text input fields which captures the user name and

password, a switcher so the browser can remember the user details and a

sign up button.

Figure 4.2 Administrator Login Page


4.2.3 ADMINISTRATOR DASHBOARD

This is the dashboard for the administrator; it is the first page the

administrator sees after login. It contains links to upload the training file.

Figure 4.3 Administrator Dashboard

4.2.4 USER LOGIN PAGE

This page contains a login form for the user to login, the form includes

two text input fields which captures the user name and password, a

switcher so the browser can remember the user details and a sign up

button.
Figure 4.4 User Login

4.2.5 USER DASHBOARD

This is the dashboard for the user; it is the first page the user sees after

login. It contains links to upload the health document.


Figure 4.5 User Dashboard

4.2.6 UPLOAD DOCUMENT

The upload document page is used by the user to upload the health

document.

Figure 4.6 Upload Document

4.2.7 UPLOAD TRAIN FILE


This page is used by the administrator to upload the training file.

Figure 4.7 Upload Train File

4.3 HOW TO INSTALL THE PROGRAM

The program is installed on the server that meets the above

requirements. Below are a few steps to take when installing the program

on the server.

1. Ensure that the server meets the above software and hardware

requirements.

2. The software will be built in to a .war file, copy the .war file into

the webapps folder of the apache tomcat folder.


3. The complementing database will be distributed as .sql which

contains all the tables. Create a database called webscrap.sql, and

import the .sql file.

4.4 HOW TO RUN THE PROGRAM

1. Ensure that the client system meets the above software and

hardware requirements.

2. Open a web browser

3. Type the URL localhost:8080/webscrap.

4.5 REASONS FOR CHOOSING JSP

1. Supports tag based programming.

2. Strong java programming is not required, so it is suitable for non-

java programmers

3. Gives 9 no of implicit objects and we can use them directly with

additional code to access them.

4. Allows to separate presentation logic(html/code) from business

logic(java code)

5. Exceptional handling is optional.


6. Increases the readability of code because of tags

7. Modifications will be reflected without re-compliation and re-

loading

8. Gives built-in JSP tags and allows to develop custom JSP tags and

to use third party supplied JSP tags

9. Easy to learn and easy to apply.

4.6 REASONS FOR CHOOSING OPENNLP

The Apache OpenNLP library is a machine learning based toolkit for the

processing of natural language text. It supports the most common NLP

tasks, such as tokenization, sentence segmentation, part-of-speech

tagging, named entity extraction, chunking, parsing, and coreference

resolution. These tasks are usually required to build more advanced text

processing services. OpenNLP also included maximum entropy and

perceptron based machine learning.


CHAPTER FIVE

SUMMARY AND CONCLUSION

5.0 INTRODUCTION

This chapter summarizes and concludes the project work; it also gives

recommendations and insight to future work.

5.1 SUMMARY

In this project work we were able to succeed in applying Natural

Language Processing which is a branch of Machine Learning to

Classifying Health related documents. We made use of the OpenNLP

Application Programming Interface which is a Java API for training a

model and classifying the documents. We make use of Materialize

which is a HTML5, CSS and JavaScript framework for building the user

interface. The software is also built using the Model-View-Controller

(MVC) architecture.

5.2 RECOMMENDATION

To properly use the system we recommend the following:


1. The system can be hosted online on a Tomcat server, so that all

users can access it from their respective locations (details of this

can be found in chapter four).

2. Medical Personnel should be trained on how to use the system.

3. The model should be properly trained to ensure accurate

classification by the system. a poorly trained model will lead to

erroneous classification.

5.3 FUTURE WORK


Due to the limited time involved in developing this project work, some

key features could not be integrated, it is my recommendation that in

future work, the following features be added.

1. A crawler should be implemented such that the model is constantly

being updated from the internet.

2. When there is new data added to the model from the internet, a

listener (should be implemented) that triggers the retraining of the

algorithm should be notified.


5.4 CONCLUSION
In conclusion we can see that applying Natural Language Processing to

classification of text and text based documents is the most effective

instead of using other machine learning techniques such as clustering

which can be regarded as over kill. Natural language processing has a lot

of potential outside document classification; its relevance has been seen

in the area of sentiment analysis. It is my recommendation that further

research be carried out in the field of Natural Language processing.


REFERENCE

Russell Power, Jay Chen, Trishank Karthik and Lakshminarayanan

Subramanian (2018),“Document Classification for Focused Topics”

https://cs.nyu.edu/~jchen/publications/aaai4d-power.pdf.

Hull D., J. Pedersen, and H. Schutze (1996), “Document routing as

statistical classification,” in AAAI Spring Symp. On Machine Learning

in Information Access Technical Papers, Palo Alto.

Fox C. (1992), “Lexical analysis and stoplist,” in Information Retrieval

Data Structures and Algorithms, W. Frakes and R. Baeza-Yates, Eds.

Prentice Hall, , pp. 102–130.

Geisser S. (1992), Predictive Inference. NY: Chapman and Hall.

Liu H. and Motoda (1998), Feature Extraction, construction and

selection: A Data Mining Perspective. Boston, Massachusetts: Springer.

Wang Y. and X. Wang (2005), “A new approach to feature selection in

text classification,” in Proceedings of 4th International Conference on

Machine Learning and Cybernetics, vol. 6, pp. 3814–3819.


Montanes E. (2003), J. Ferandez, I. Diaz, E. F. Combarro, and J. Ranilla,

“Measures of rule quality for feature selection in text categorization,” in

5th international Symposium on Intelligent data analysis. Germany:

Springer-Verlag, , pp. 589–598.

Aurangzeb K., B. Baharum, L. H. Lee, and K. Khairullah (2010), “A

review of machine learning algorithms for text-documents

classification,” Journal of Advances in Information Technology, vol. 1,

no. 1.

Wang Z.-Q., X. Sun, D.-X. Zhang, and X. Li (2006), “An optimal svm

based text classification algorithm,” in Fifth International Conference on

Machine Learning and Cybernetics, pp. 13–16.

PAZIENZA, M. T., ed. 1997. Information Extraction. Lecture Notes in

Computer Science, Vol. 1299. Springer, Heidelberg, Germany. RILOFF.

E. 1995. Little words can make a big difference for text classification. In

Proceedings of SIGIR-95, 18th ACM International Conference on

Research and Development in Information Retrieval (Seattle, WA,

1995), 130–136.
Leopold, Edda & Kindermann, Jörg (2002), "Text Categorization with

Support Vector Machines. How to Represent Texts in Input Space?",

Machine Learning 46, pp. 423 - 444.

Brank J., Grobelnik M., Milic-Frayling N., Mladenic D. (2002),

"Interaction of Feature Selection Methods and Linear Classification

Models", Proc. of the 19th International Conference on Machine

Learning, Australia.

Forman, G., An Experimental Study of Feature Selection Metrics for

Text Categorization. Journal of Machine Learning Research, 3 2003, pp.

1289-1305.

LI Y. H. AND A. K. JAIN (1998), Classification of Text Documents,

THE COMPUTER JOURNAL, Vol. 41, No. 8.

Goller C., J. Löning, T. Will, W. Wolff (2000), Automatic Document

Classification: A thorough Evaluation of various Methods, International

Symposiums for Information swissenschaft (ISI 2000), Darmstadt, 8. –

10. November 2000.


Ankit Basarkar (2017), DOCUMENT CLASSIFICATION USING

MACHINE LEARNING, San Jose State University SJSU ScholarWorks

http://scholarworks.sjsu.edu/?utm_source=scholarworks.sjsu.edu

%2Fet_projects

%2F531&utm_medium=PDF&utm_campaign=PDFCoverPags

Retrieved 5 March 2018.

Kim S. B. , Rim H. C. , Yook D. S. and Lim H. S. (2002),"Effective

Methods for Improving Naive Bayes Text Classifiers", LNAI 2417,

2002, pp. 414-423.

Klopotek M. and Woch M. (2003), "Very Large Bayesian Networks in

Text Classification", ICCS 2003, LNCS 2657, 2003, pp. 397-406

Shanahan J. and Roma N. (2003), Improving SVM Text Classification

Performance through Threshold Adjustment, LNAI 2837, 2003, 361-

372
APPENDIX A
APPENDIX B
UserController.java
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package controller;

import dao.DbConnection;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintWriter;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Random;
import javax.crypto.KeyGenerator;
import javax.crypto.SecretKey;
import javax.servlet.RequestDispatcher;
import javax.servlet.ServletContext;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.http.HttpSession;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import org.apache.commons.fileupload.FileItem;
import org.apache.commons.fileupload.FileUploadException;
import org.apache.commons.fileupload.disk.DiskFileItemFactory;
import org.apache.commons.fileupload.servlet.ServletFileUpload;
import org.mindrot.jbcrypt.BCrypt;

/**
*
* @author harmony
*/
public class UserController extends HttpServlet {

private java.util.Map<String, String[]> sessionMap = new HashMap<String, String[]>();// create


HashMap

String health_document = "";

public void createProfile(HttpServletRequest request, HttpServletResponse response)


throws ClassNotFoundException, FileNotFoundException, ServletException, IOException,
FileUploadException, SQLException {

try {

String first_name = "";


String last_name = "";
String phone = "";
String email = "";
String password = "";
String cpassword = "";
String profile_picture = "";

String rootPath = System.getProperty("catalina.home");


ServletContext servletContext = getServletContext();
String relativePath = servletContext.getInitParameter("fileUploads1.dir");

File file = new File(rootPath + File.separator + relativePath);


if (!file.exists()) {
file.mkdirs();
}

// Verify the content type


String contentType = request.getContentType();

if ((contentType.indexOf("multipart/form-data") >= 0)) {

// Create a factory for disk-based file items


DiskFileItemFactory fileFactory = new DiskFileItemFactory();

File filesDir = (File) (file);

fileFactory.setRepository(filesDir);

// Create a new file upload handler


ServletFileUpload upload = new ServletFileUpload(fileFactory);

// Parse the request to get file items.


List<FileItem> fileItemsList = upload.parseRequest(request);

// Process the uploaded items


Iterator<FileItem> fileItemsIterator = fileItemsList.iterator();
while (fileItemsIterator.hasNext()) {

FileItem fileItem = fileItemsIterator.next();

if (fileItem.isFormField()) {

String name = fileItem.getFieldName();


String value = fileItem.getString();

if (name.equals("first_name")) {
first_name = value;
}
if (name.equals("last_name")) {
last_name = value;
}
if (name.equals("phone")) {
phone = value;
}
if (name.equals("email")) {
email = value;
}

if (name.equals("password")) {
password = value;
}
if (name.equals("cpassword")) {
cpassword = value;
}
if (name.equals("email")) {
email = value;
}

} else {
profile_picture = rootPath + File.separator + relativePath + File.separator +
fileItem.getName();
System.out.println("This is what's in profile_picture: " + profile_picture);
File file1 = new File(profile_picture);

System.out.println("This is what's in rootPath: " + rootPath);


System.out.println("This is what's in relativePath: " + relativePath);
System.out.println(fileItem.getName());

try {
fileItem.write(file1);
} catch (Exception ex) {
ex.printStackTrace();
}
}

}
}

if (!cpassword.equals(password)) {

RequestDispatcher rd = request.getRequestDispatcher("/unmatch_password.jsp");

rd.forward(request, response);
} else {

DbConnection createUserAccount = new DbConnection();

// Hash User Data


//String hPassword = BCrypt.hashpw(password.trim(), BCrypt.gensalt(15));
//System.out.println("password.trim() is: " + password.trim());
//System.out.println("hPassword is: " + hPassword);

createUserAccount.createUserAccount(first_name, last_name, phone, email, password,


profile_picture);
createUserAccount.logUserRegistration();
RequestDispatcher rd =
getServletContext().getRequestDispatcher("/user_registration_successful.jsp");
rd.forward(request, response);
}

} catch (ClassNotFoundException | FileNotFoundException | FileUploadException error) {


System.out.print(error);
}

protected void userLogin(HttpServletRequest request, HttpServletResponse response)


throws ServletException, IOException {

try {

String username = request.getParameter("username");


String password = request.getParameter("password");

DbConnection user_login = new DbConnection();

String[] user_details = user_login.userLogin(username, password);

String user_password = user_details[0];


String firstName = user_details[1];
String lastName = user_details[2];
String username1 = user_details[3];
String user_phone = user_details[4];

//String generatedOtp = Arrays.toString(generateOTP(request, response));


//String generatedOtpRemoveComma = generatedOtp.replace(",","");
//String generatedOtpTrim = generatedOtpRemoveComma.replace(" ","");
//String generatedOtpRemoveOpenBrace = generatedOtpTrim.replace("[","");
//String generatedOtpRemoveCloseBrace = generatedOtpRemoveOpenBrace.replace("]","");

String[] sessionData = {username1, firstName, lastName};

if (username != null || password != null) {

if (!"".equals(username) || !"".equals(password)) {
if (password.equals(user_password)) {

System.out.println("It matches");

HttpSession session = request.getSession(true);

String sessionId = session.getId();

System.out.println("sessionId is " + sessionId);

sessionMap.put(sessionId, sessionData);

String[] sessionMapValues = sessionMap.get(sessionId);

String sessionFirstName = sessionMapValues[1];


String sessionLastName = sessionMapValues[2];
String sessionUserName = sessionMapValues[0];

request.setAttribute("sessionId", sessionId);
request.setAttribute("sessionFirstName", sessionFirstName);
request.setAttribute("sessionLastName", sessionLastName);
request.setAttribute("sessionUserName", sessionUserName);

RequestDispatcher rd = request.getRequestDispatcher("/user/user_dashboard.jsp");

rd.forward(request, response);
} else {
System.out.println("It does not match");
}
}
}

} catch (ClassNotFoundException | NumberFormatException | ServletException | IOException error)


{

error.printStackTrace();
}
}
public void goToUploadDocument(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {

String sessionId = request.getParameter("sessionId");

String[] sessionMapValues = sessionMap.get(sessionId);

String sessionFirstName = sessionMapValues[2];


String sessionLastName = sessionMapValues[1];
String sessionUserName = sessionMapValues[0];

request.setAttribute("sessionId", sessionId);
request.setAttribute("sessionFirstName", sessionFirstName);
request.setAttribute("sessionLastName", sessionLastName);
request.setAttribute("sessionUserName", sessionUserName);

RequestDispatcher rd = getServletContext().getRequestDispatcher("/user/uploadDocument.jsp");

rd.forward(request, response);
}

public void uploadDocument(HttpServletRequest request, HttpServletResponse response){

String sessionId = request.getParameter("sessionId");

try{

String document_title = "";


String document = "";

String rootPath = System.getProperty("catalina.home");


ServletContext servletContext = getServletContext();
String relativePath = servletContext.getInitParameter("fileUploads1.dir");

File file = new File(rootPath + File.separator + relativePath);


if (!file.exists()) {
file.mkdirs();
}

// Verify the content type


String contentType = request.getContentType();
if ((contentType.indexOf("multipart/form-data") >= 0)) {

// Create a factory for disk-based file items


DiskFileItemFactory fileFactory = new DiskFileItemFactory();

File filesDir = (File) (file);

fileFactory.setRepository(filesDir);

// Create a new file upload handler


ServletFileUpload upload = new ServletFileUpload(fileFactory);

// Parse the request to get file items.


List<FileItem> fileItemsList = upload.parseRequest(request);

// Process the uploaded items


Iterator<FileItem> fileItemsIterator = fileItemsList.iterator();
while (fileItemsIterator.hasNext()) {

FileItem fileItem = fileItemsIterator.next();

if (fileItem.isFormField()) {

String name = fileItem.getFieldName();


String value = fileItem.getString();

if (name.equals("document_title")) {
document_title = value;
}

} else {
health_document = rootPath + File.separator + relativePath + File.separator +
fileItem.getName();
System.out.println("This is what's in document: " + health_document);
File file1 = new File(health_document);

System.out.println("This is what's in rootPath: " + rootPath);


System.out.println("This is what's in relativePath: " + relativePath);
System.out.println(fileItem.getName());
try {
fileItem.write(file1);
} catch (Exception ex) {
ex.printStackTrace();
}
}

}
}

if (document_title != null || health_document != null) {

classifyDocuments(request, response);

String[] sessionMapValues = sessionMap.get(sessionId);

String sessionFirstName = sessionMapValues[2];


String sessionLastName = sessionMapValues[1];
String sessionUserName = sessionMapValues[0];

request.setAttribute("sessionId", sessionId);
request.setAttribute("sessionFirstName", sessionFirstName);
request.setAttribute("sessionLastName", sessionLastName);
request.setAttribute("sessionUserName", sessionUserName);

else {

RequestDispatcher rd = request.getRequestDispatcher("/error_page.jsp");

rd.forward(request, response);
}

}
catch(Exception e){
e.printStackTrace();
}
}
public void classifyDocuments(HttpServletRequest request, HttpServletResponse response)
throws IOException, FileNotFoundException {

String modelFileName = "en-diseases.bin";

String rootPath = System.getProperty("catalina.home");


ServletContext servletContext = getServletContext();
String relativePath = servletContext.getInitParameter("fileUploads1.dir");

String modelFile = rootPath + File.separator + relativePath + File.separator + modelFileName;

// Set up a byte array to hold the file's content


byte[] content = new byte[0];

Tokenizer tokenizer = WhitespaceTokenizer.INSTANCE;

try{

// Create an input stream for the file


FileInputStream hamletInputStream = new FileInputStream(health_document);

// Figure out how much content the file has


int bytesAvailable = hamletInputStream.available();

// Set the content array to the length of the content


content = new byte[bytesAvailable];

// Load the file's content into our byte array


hamletInputStream.read(content);

String[] inputText = tokenizer.tokenize(new String(content));

InputStream modelIn = new FileInputStream(modelFile);


System.out.println("modelFile value is: " + modelFile);
System.out.println("model FIle assigned to modelIn variable");
System.out.println("modelIn variable value is: " + modelIn);

DoccatModel model = new DoccatModel(modelIn);


DocumentCategorizerME categorizer = new DocumentCategorizerME(model);
double[] outcomes = categorizer.categorize(inputText);

for (int i = 0; i < categorizer.getNumberOfCategories(); i++) {


String category = categorizer.getCategory(i);
System.out.println(category + " - " + outcomes[i]);
}

//String category = categorizer.getBestCategory(outcomes);


System.out.println(categorizer.getBestCategory(outcomes));
System.out.println(categorizer.getAllResults(outcomes));

}catch(Exception e){

e.printStackTrace();
}
}

public void logout(HttpServletRequest request, HttpServletResponse response) throws


ServletException, IOException {

String sessionId = request.getParameter("sessionId");

sessionMap.remove(sessionId);

RequestDispatcher rd = getServletContext().getRequestDispatcher("/user/userLogin.jsp");
rd.forward(request, response);
}

@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
doPost(request, response);
}

@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
try {

String user_action = request.getParameter("user_action");

switch (user_action) {
case "register_user":
createProfile(request, response);
break;

case "user_login":
userLogin(request, response);
break;

case "go_to_upload_document":
goToUploadDocument(request, response);
break;

case "upload_document":
uploadDocument(request, response);
break;

case "logout":
logout(request, response);
break;

} catch (ServletException | IOException | ClassNotFoundException | FileUploadException |


SQLException error) {

error.printStackTrace();
}
}

/**
* Returns a short description of the servlet.
*
* @return a String containing servlet description
*/
@Override
public String getServletInfo() {
return "Short description";
}// </editor-fold>

}
AdministratorController.java
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package controller;

import dao.DbConnection;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.PrintWriter;
import java.nio.charset.StandardCharsets;
import java.sql.SQLException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import javax.servlet.RequestDispatcher;
import javax.servlet.ServletContext;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.http.HttpSession;
import opennlp.tools.doccat.DoccatFactory;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSample;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
import org.apache.commons.fileupload.FileItem;
import org.apache.commons.fileupload.FileUploadException;
import org.apache.commons.fileupload.disk.DiskFileItemFactory;
import org.apache.commons.fileupload.servlet.ServletFileUpload;
/**
*
* @author harmony
*/
public class AdministratorController extends HttpServlet {

private java.util.Map<String, String[]> sessionMap = new HashMap<String, String[]>();// create


HashMap

String train_file = "";


File file1;

protected void administratorLogin(HttpServletRequest request, HttpServletResponse response)


throws ServletException, IOException {

try {

String username = request.getParameter("username");


String password = request.getParameter("password");
String lastLogonForm = request.getParameter("lastLogonForm");

DbConnection admin_login = new DbConnection();

String[] administrator_details = admin_login.administratorLogin(username, password);

String administrator_password = administrator_details[0];


String lastlogon = administrator_details[1];
String firstName = administrator_details[2];
String lastName = administrator_details[3];
String username1 = administrator_details[4];

String[] sessionData = {username1, firstName, lastName};

long longValueOfLastLogon = Long.parseLong(lastlogon);

if (username != null || password != null) {

if (!"".equals(username) || !"".equals(password)) {

if (administrator_password.equals(password)) {
long longValueOfLastLogonForm = Long.parseLong(lastLogonForm);

if (longValueOfLastLogonForm > longValueOfLastLogon) {

HttpSession session = request.getSession(true);

String sessionId = session.getId();

sessionMap.put(sessionId, sessionData);

String[] sessionMapValues = sessionMap.get(sessionId);

String sessionFirstName = sessionMapValues[1];


String sessionLastName = sessionMapValues[2];
String sessionUserName = sessionMapValues[0];

String stringValueOfLastLogonForm = String.valueOf(longValueOfLastLogonForm);

admin_login.updateAdministratorLastLogon(stringValueOfLastLogonForm, username);

request.setAttribute("sessionId", sessionId);
request.setAttribute("sessionFirstName", sessionFirstName);
request.setAttribute("sessionLastName", sessionLastName);
request.setAttribute("sessionUserName", sessionUserName);

RequestDispatcher rd =
request.getRequestDispatcher("/admin/administrator_dashboard.jsp");

rd.forward(request, response);

}
}
}
}

} catch (ClassNotFoundException | NumberFormatException | ServletException | IOException error)


{

error.printStackTrace();
}
}

public void goToUploadTrainingFile(HttpServletRequest request, HttpServletResponse response)


throws ServletException, IOException {

String sessionId = request.getParameter("sessionId");

String[] sessionMapValues = sessionMap.get(sessionId);

String sessionFirstName = sessionMapValues[2];


String sessionLastName = sessionMapValues[1];
String sessionUserName = sessionMapValues[0];

request.setAttribute("sessionId", sessionId);
request.setAttribute("sessionFirstName", sessionFirstName);
request.setAttribute("sessionLastName", sessionLastName);
request.setAttribute("sessionUserName", sessionUserName);

RequestDispatcher rd =
getServletContext().getRequestDispatcher("/admin/upload_training_file.jsp");

rd.forward(request, response);
}

public void uploadTrainFile(HttpServletRequest request, HttpServletResponse response)


throws ServletException, IOException, ClassNotFoundException, SQLException,
FileUploadException {

String sessionId = "";

String rootPath = System.getProperty("catalina.home");


ServletContext servletContext = getServletContext();
String relativePath = servletContext.getInitParameter("fileUploads1.dir");

File file = new File(rootPath + File.separator + relativePath);


if (!file.exists()) {
file.mkdirs();
}
// Verify the content type
String contentType = request.getContentType();

if ((contentType.indexOf("multipart/form-data") >= 0)) {

// Create a factory for disk-based file items


DiskFileItemFactory fileFactory = new DiskFileItemFactory();

File filesDir = (File) (file);

fileFactory.setRepository(filesDir);

// Create a new file upload handler


ServletFileUpload upload = new ServletFileUpload(fileFactory);

// Parse the request to get file items.


List<FileItem> fileItemsList = upload.parseRequest(request);

// Process the uploaded items


Iterator<FileItem> fileItemsIterator = fileItemsList.iterator();
while (fileItemsIterator.hasNext()) {

FileItem fileItem = fileItemsIterator.next();

if (fileItem.isFormField()) {

String name = fileItem.getFieldName();


String value = fileItem.getString();

if (name.equals("sessionId")) {
sessionId = value;
}

} else {

train_file = rootPath + File.separator + relativePath + File.separator + fileItem.getName();


System.out.println("This is what's in train_file: " + train_file);
file1 = new File(train_file);

System.out.println("This is what's in rootPath: " + rootPath);


System.out.println("This is what's in relativePath: " + relativePath);
System.out.println(fileItem.getName());

try {
fileItem.write(file1);
} catch (Exception ex) {
ex.printStackTrace();
}
}

}
}

String[] sessionMapValues = sessionMap.get(sessionId);

String sessionFirstName = sessionMapValues[2];


String sessionLastName = sessionMapValues[1];
String sessionUserName = sessionMapValues[0];

trainModel(request, response);

request.setAttribute("sessionId", sessionId);
request.setAttribute("sessionFirstName", sessionFirstName);
request.setAttribute("sessionLastName", sessionLastName);
request.setAttribute("sessionUserName", sessionUserName);

RequestDispatcher rd =
getServletContext().getRequestDispatcher("/admin/training_successful.jsp");
rd.forward(request, response);

public void trainModel(HttpServletRequest request, HttpServletResponse response)


throws IOException, FileNotFoundException {
response.setContentType("text/html;charset=UTF-8");

String modelFileName = "en-diseases.bin";

String rootPath = System.getProperty("catalina.home");


ServletContext servletContext = getServletContext();
String relativePath = servletContext.getInitParameter("fileUploads1.dir");

String modelFile = rootPath + File.separator + relativePath + File.separator + modelFileName;

try {

DoccatModel model = null;


DoccatFactory df = new DoccatFactory();
System.out.println("Model file path is: "+modelFile);

InputStreamFactory dataIn = new MarkableFileInputStreamFactory(file1);


//System.out.println("Train FIle is: " + file1);
//System.out.println("Train FIle assigned to dataIn variable");
//System.out.println("dataIn variable value is: " + dataIn);

ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, StandardCharsets.UTF_8);


//System.out.println("dataIn variable passed to the PlainTextByLineStream and used to create a
new ObjectStream called lineStream");
//System.out.println("lineStream value is: " + lineStream);

ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);


//System.out.println("lineStream variable passed to the DocumentSampleStream and used to
create a new ObjectStream called sampleStream");
//System.out.println("sampleStream value is: " + sampleStream);

model = DocumentCategorizerME.train("en",
sampleStream,TrainingParameters.defaultParams(),df);
//System.out.println("sampleStream variable passed to the DocumentCategorizerME and the
value obtained is assigned to a model");
//System.out.println("model value is: " + model);

OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));

model.serialize(modelOut);

}catch(Exception e){}

}
public void logout(HttpServletRequest request, HttpServletResponse response) throws
ServletException, IOException {

String sessionId = request.getParameter("sessionId");

sessionMap.remove(sessionId);

RequestDispatcher rd = getServletContext().getRequestDispatcher("/admin/adminLogin.jsp");
rd.forward(request, response);
}

@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
doPost(request, response);
}

@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
try {

String administrator_action = request.getParameter("administrator_action");

switch (administrator_action) {

case "administrator_login":
administratorLogin(request, response);
break;

case "go_to_upload_training_file":
goToUploadTrainingFile(request, response);
break;

case "upload_train_file":
uploadTrainFile(request, response);
break;
/** case "go_to_add_room":
goToAddRoom(request, response);
break;

case "add_room":
addRoom(request, response);
break;

case "logout":
logout(request, response);
break;**/
}

} catch (ServletException | IOException | ClassNotFoundException | SQLException |


FileUploadException error) {

error.printStackTrace();
}
}

@Override
public String getServletInfo() {
return "Short description";
}// </editor-fold>

}
APPENDIX C
en-diseases.train

Malaria is a life-threatening mosquito-borne blood disease caused by a Plasmodium parasite

Malaria was eliminated from the U.S. in the early 1950s

Malaria is typically spread by mosquitoes

Malaria symptoms can be classified into two categories

Malaria happens when a bite from the female Anopheles mosquito infects the body with
Plasmodium

Malaria is a mosquito-borne infectious disease affecting humans and other animals caused by
parasitic protozoans

Malaria is a mosquito-borne disease caused by a parasite

Malaria occurred worldwide and 445,000 people died

Malaria is caused by parasites from the genus Plasmodium

Malaria parasite in most countries

Malaria is an acute febrile illness

Malaria If not treated within 24 hours

Malaria can progress to severe illness

Malaria frequently develop one or more of the following symptoms

Malaria cases and deaths

Malaria transmission

Malaria control programmes

Malaria infection

Diarrhea can be prevented by improved sanitation

Diarrhea it is recommended that they continue to eat healthy food and babies continue to be
breastfed

Diarrhea and a high fever


Diarrhea on average three times a year

Diarrhea are also a common cause of malnutrition and the most common cause in those younger
than five years of age

Diarrhea is defined by the World Health Organization as having three or more loose or liquid
stools per day

Diarrhea is defined as an abnormally frequent discharge of semisolid or fluid fecal matter from
the bowel

Diarrhea means that there is an increase in the active secretion

Diarrhea is a cholera toxin that stimulates the secretion of anions

Diarrhea intestinal fluid secretion is isotonic with plasma even during fasting

Diarrhea occurs when too much water is drawn into the bowels

Diarrhea can also be the result of maldigestion

Diarrhea and distention of the bowel

Hypertension also known as high blood pressure

Hypertension was believed to have been a factor in

Hypertension is rarely accompanied by symptoms, and its identification is usually through


screening

Hypertension may be associated with the presence of changes in the optic fundus seen by
ophthalmoscopy

Hypertension with certain specific additional signs and symptoms may suggest secondary
hypertension

Hypertension due to an identifiable cause

Hypertension accompanied by headache

Hypertension occurs in approximately

Hypertension in pregnancy

Hypertension during pregnancy without protein in the urine

Hypertension in newborns and young infants. In older infants and children


Hypertension results from a complex interaction of genes and environmental factors

Hypertension results from an identifiable cause

Hypertension can also be caused by endocrine conditions

You might also like