A System For Health Document Classification Using Machine Learning
A System For Health Document Classification Using Machine Learning
1.0 INTRODUCTION
This chapter introduces the topic of the project work A System for
system. Among other functions that these systems provide, they are
stores patient records in digital format. Numerous patient data are being
recorded on a daily basis which forms a large data set popularly referred
to as “Big Data”.
Every day physicians and other health workers are required to work with
this “Big Data” in other to provide solution. Some of the everyday tasks
from big data can be very laborious and time consuming. This has given
.
Usually, machine learning, statistical pattern recognition, or neural
the data coming in or even classify it into categories. Also in the health
sector, numerous patient records are being collected everyday and are
The software delivered from this project work will greatly reduce the
similar application.
2. It will serve as source of materials for those interested in investigating
JSP: Java Server Pages is a java technology for creating dynamic web
pages.
manipulating databases.
Server’s functionality.
framework for faster and easier web development. It uses HTML, CSS
and Javascript.
Chapter one introduces the background of the project with the statement
LITERATURE REVIEW
2.0 DOCUMENT CLASSIFICATION
Classification can be divided in two principal phases. The first phase is
representation models. The more relevant the representation is, the more
relevant the classification will be. The second phase includes learning
from training corpus, making a model for classes and classifying the
texts and text mining (PAZIENZA, 1997). “Text mining” is mostly used
to represent all the tasks that, by analyzing large quantities of text and
categories,
The task of building a classifier for documents does not vary from other
processing.
2.2.2 STEMMING
In linguistic morphology and information collection, stemming is the
stem, original form. The stem need not be identical to the morphological
similar stem, even if this stem is not a valid root. In computer science
algorithms for stemming have been studied since 1968. Many search
not a machine. A prepared list of stop words do not exist which can be
used by every tool. Though any stop word list is used by any tool in
Any group of words can be selected as the stop words for a particular
cause. For a few search machines, these is a list of common words, short
function words, like the, is, at, which and on that create problems in
to eliminate stop words contains lexical words, like "want" from phrases
to raise performance.
System.
document.
(especially those that do not scale well with the feature set size) and
generalization.
approaches, but like them its aim is to decrease the feature set volume.
The approach does not weight terms in order to neglect the lower
approach taken are decision trees, naïve Bayes, rule induction, neural
development.
Schneider addressed the problems and display that they can be resolved
(Klopotek, 2003). The study advices that tree-like Bayesian networks are
algorithm that receives benefits of the sparse text data, and a rule
kNN method with various decision functions, k values, and feature sets
classification task, unavoidable some classes are a bit harder than others
to classify. Reasons for this are: very few positive training examples for
the class, and lack of good forecasting features for that class.
use all the documents in the training corpus that has the category as
related training data and all the documents in the training corpus that are
of the other categories are non related training data. It is a regular case
the naive Bayes classifier, the nearest neighbor classifier, decision trees
indicate that the naive Bayes classifier and the subspace method
introduced here performed the best. The best classification accuracy that
because the pattern classes used in our experiments have a large overlap
learning and classification time, but also to avoid overfitting (even for
approach. Support Vector Machines are significantly better than all other
the Binary, Count and TfIdf feature vectors and their impact on
converted the documents to all the three feature vectors. For each feature
vector representation, they trained the Naïve Bayes classifier and then
better than Binary vectorizer if stop words are removed. If stop words
are not removed, then TfIdf performed 6% better than Binary vectorizer
and 11% better than Count vectorizer. Also, Count vectorizer performs
better than Binary vectorizer, if stop words are removed by 2% but lags
behind by 5% if stop words are not removed. Thus, they can conclude
CHAPTER THREE
SYSTEM ANALYSIS AND DESIGN
3.0 INTRODUCTION
This chapter shows all the modules and components used to design the
system, and how they work together. It also shows us how the users of
For the system to serve its intended purpose properly, the system will
lemmatization.
learns from the model to the point that when it will produce similar
algorithm. In this project work we make use of the OpenNLP API for
In other to carry out the classification, we first train a model. Our model
training data. The training file format consists of a series of lines, the
first word of the line is the category. The category is followed by text
the DocumentCategorizerME class. The train method trains the file and
After training, the model file produced will be used to, classify the
The use case diagram is used to show the interaction between the system
use cases and its clients without much detail. A use case diagram
displays an actor and its use cases, the actors are also the users of the
system.
Health Worker
Fig: 3.1 Health Worker Use Case
sequence diagram is that time passes from top to bottom: the interaction
starts near the top of the diagram and ends at the bottom (i.e. Lower
equals Later).
Fig: 3.2 Sequence Diagram
implement them in Java. The class diagram enable us to model via class
face. The middle compartment contains the class attributes, while the
how data flows from source documents through the computer to final
distribution to users. The following figures are the system flow chart for
our system.
Figure 3.4 System Flow Chart
CHAPTER FOUR
SYSTEM IMPLEMENTATION
4.0 INTRODUCTION
data and observing the results to see if the system has been properly
deigned or if it contains bugs. This is usually done with data which has
meet some hardware and software requirements. Also since it has been
4. MySQL version 5
1. 1GB of RAM
2. 80 GB Hard Disk
5. Internet modem
2. Web browser
4.2 SYSTEM SAMPLE OUTPUT
This section displays the sample interface, and describes the functions of
This is the first page that displays to the users of the system. It contains a
brief introduction to the application as well as the login link for the
This page contains a login form for the administrator to login, the form
includes two text input fields which captures the user name and
password, a switcher so the browser can remember the user details and a
sign up button.
This is the dashboard for the administrator; it is the first page the
administrator sees after login. It contains links to upload the training file.
This page contains a login form for the user to login, the form includes
two text input fields which captures the user name and password, a
switcher so the browser can remember the user details and a sign up
button.
Figure 4.4 User Login
This is the dashboard for the user; it is the first page the user sees after
The upload document page is used by the user to upload the health
document.
requirements. Below are a few steps to take when installing the program
on the server.
1. Ensure that the server meets the above software and hardware
requirements.
2. The software will be built in to a .war file, copy the .war file into
1. Ensure that the client system meets the above software and
hardware requirements.
java programmers
logic(java code)
loading
8. Gives built-in JSP tags and allows to develop custom JSP tags and
The Apache OpenNLP library is a machine learning based toolkit for the
resolution. These tasks are usually required to build more advanced text
5.0 INTRODUCTION
This chapter summarizes and concludes the project work; it also gives
5.1 SUMMARY
which is a HTML5, CSS and JavaScript framework for building the user
(MVC) architecture.
5.2 RECOMMENDATION
erroneous classification.
2. When there is new data added to the model from the internet, a
which can be regarded as over kill. Natural language processing has a lot
https://cs.nyu.edu/~jchen/publications/aaai4d-power.pdf.
no. 1.
Wang Z.-Q., X. Sun, D.-X. Zhang, and X. Li (2006), “An optimal svm
E. 1995. Little words can make a big difference for text classification. In
1995), 130–136.
Leopold, Edda & Kindermann, Jörg (2002), "Text Categorization with
Learning, Australia.
1289-1305.
http://scholarworks.sjsu.edu/?utm_source=scholarworks.sjsu.edu
%2Fet_projects
%2F531&utm_medium=PDF&utm_campaign=PDFCoverPags
372
APPENDIX A
APPENDIX B
UserController.java
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package controller;
import dao.DbConnection;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintWriter;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Random;
import javax.crypto.KeyGenerator;
import javax.crypto.SecretKey;
import javax.servlet.RequestDispatcher;
import javax.servlet.ServletContext;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.http.HttpSession;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import org.apache.commons.fileupload.FileItem;
import org.apache.commons.fileupload.FileUploadException;
import org.apache.commons.fileupload.disk.DiskFileItemFactory;
import org.apache.commons.fileupload.servlet.ServletFileUpload;
import org.mindrot.jbcrypt.BCrypt;
/**
*
* @author harmony
*/
public class UserController extends HttpServlet {
try {
fileFactory.setRepository(filesDir);
if (fileItem.isFormField()) {
if (name.equals("first_name")) {
first_name = value;
}
if (name.equals("last_name")) {
last_name = value;
}
if (name.equals("phone")) {
phone = value;
}
if (name.equals("email")) {
email = value;
}
if (name.equals("password")) {
password = value;
}
if (name.equals("cpassword")) {
cpassword = value;
}
if (name.equals("email")) {
email = value;
}
} else {
profile_picture = rootPath + File.separator + relativePath + File.separator +
fileItem.getName();
System.out.println("This is what's in profile_picture: " + profile_picture);
File file1 = new File(profile_picture);
try {
fileItem.write(file1);
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
}
if (!cpassword.equals(password)) {
RequestDispatcher rd = request.getRequestDispatcher("/unmatch_password.jsp");
rd.forward(request, response);
} else {
try {
if (!"".equals(username) || !"".equals(password)) {
if (password.equals(user_password)) {
System.out.println("It matches");
sessionMap.put(sessionId, sessionData);
request.setAttribute("sessionId", sessionId);
request.setAttribute("sessionFirstName", sessionFirstName);
request.setAttribute("sessionLastName", sessionLastName);
request.setAttribute("sessionUserName", sessionUserName);
RequestDispatcher rd = request.getRequestDispatcher("/user/user_dashboard.jsp");
rd.forward(request, response);
} else {
System.out.println("It does not match");
}
}
}
error.printStackTrace();
}
}
public void goToUploadDocument(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
request.setAttribute("sessionId", sessionId);
request.setAttribute("sessionFirstName", sessionFirstName);
request.setAttribute("sessionLastName", sessionLastName);
request.setAttribute("sessionUserName", sessionUserName);
RequestDispatcher rd = getServletContext().getRequestDispatcher("/user/uploadDocument.jsp");
rd.forward(request, response);
}
try{
fileFactory.setRepository(filesDir);
if (fileItem.isFormField()) {
if (name.equals("document_title")) {
document_title = value;
}
} else {
health_document = rootPath + File.separator + relativePath + File.separator +
fileItem.getName();
System.out.println("This is what's in document: " + health_document);
File file1 = new File(health_document);
}
}
classifyDocuments(request, response);
request.setAttribute("sessionId", sessionId);
request.setAttribute("sessionFirstName", sessionFirstName);
request.setAttribute("sessionLastName", sessionLastName);
request.setAttribute("sessionUserName", sessionUserName);
else {
RequestDispatcher rd = request.getRequestDispatcher("/error_page.jsp");
rd.forward(request, response);
}
}
catch(Exception e){
e.printStackTrace();
}
}
public void classifyDocuments(HttpServletRequest request, HttpServletResponse response)
throws IOException, FileNotFoundException {
try{
}catch(Exception e){
e.printStackTrace();
}
}
sessionMap.remove(sessionId);
RequestDispatcher rd = getServletContext().getRequestDispatcher("/user/userLogin.jsp");
rd.forward(request, response);
}
@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
doPost(request, response);
}
@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
try {
switch (user_action) {
case "register_user":
createProfile(request, response);
break;
case "user_login":
userLogin(request, response);
break;
case "go_to_upload_document":
goToUploadDocument(request, response);
break;
case "upload_document":
uploadDocument(request, response);
break;
case "logout":
logout(request, response);
break;
error.printStackTrace();
}
}
/**
* Returns a short description of the servlet.
*
* @return a String containing servlet description
*/
@Override
public String getServletInfo() {
return "Short description";
}// </editor-fold>
}
AdministratorController.java
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package controller;
import dao.DbConnection;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.PrintWriter;
import java.nio.charset.StandardCharsets;
import java.sql.SQLException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import javax.servlet.RequestDispatcher;
import javax.servlet.ServletContext;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.servlet.http.HttpSession;
import opennlp.tools.doccat.DoccatFactory;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSample;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
import org.apache.commons.fileupload.FileItem;
import org.apache.commons.fileupload.FileUploadException;
import org.apache.commons.fileupload.disk.DiskFileItemFactory;
import org.apache.commons.fileupload.servlet.ServletFileUpload;
/**
*
* @author harmony
*/
public class AdministratorController extends HttpServlet {
try {
if (!"".equals(username) || !"".equals(password)) {
if (administrator_password.equals(password)) {
long longValueOfLastLogonForm = Long.parseLong(lastLogonForm);
sessionMap.put(sessionId, sessionData);
admin_login.updateAdministratorLastLogon(stringValueOfLastLogonForm, username);
request.setAttribute("sessionId", sessionId);
request.setAttribute("sessionFirstName", sessionFirstName);
request.setAttribute("sessionLastName", sessionLastName);
request.setAttribute("sessionUserName", sessionUserName);
RequestDispatcher rd =
request.getRequestDispatcher("/admin/administrator_dashboard.jsp");
rd.forward(request, response);
}
}
}
}
error.printStackTrace();
}
}
request.setAttribute("sessionId", sessionId);
request.setAttribute("sessionFirstName", sessionFirstName);
request.setAttribute("sessionLastName", sessionLastName);
request.setAttribute("sessionUserName", sessionUserName);
RequestDispatcher rd =
getServletContext().getRequestDispatcher("/admin/upload_training_file.jsp");
rd.forward(request, response);
}
fileFactory.setRepository(filesDir);
if (fileItem.isFormField()) {
if (name.equals("sessionId")) {
sessionId = value;
}
} else {
try {
fileItem.write(file1);
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
}
trainModel(request, response);
request.setAttribute("sessionId", sessionId);
request.setAttribute("sessionFirstName", sessionFirstName);
request.setAttribute("sessionLastName", sessionLastName);
request.setAttribute("sessionUserName", sessionUserName);
RequestDispatcher rd =
getServletContext().getRequestDispatcher("/admin/training_successful.jsp");
rd.forward(request, response);
try {
model = DocumentCategorizerME.train("en",
sampleStream,TrainingParameters.defaultParams(),df);
//System.out.println("sampleStream variable passed to the DocumentCategorizerME and the
value obtained is assigned to a model");
//System.out.println("model value is: " + model);
model.serialize(modelOut);
}catch(Exception e){}
}
public void logout(HttpServletRequest request, HttpServletResponse response) throws
ServletException, IOException {
sessionMap.remove(sessionId);
RequestDispatcher rd = getServletContext().getRequestDispatcher("/admin/adminLogin.jsp");
rd.forward(request, response);
}
@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
doPost(request, response);
}
@Override
protected void doPost(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
try {
switch (administrator_action) {
case "administrator_login":
administratorLogin(request, response);
break;
case "go_to_upload_training_file":
goToUploadTrainingFile(request, response);
break;
case "upload_train_file":
uploadTrainFile(request, response);
break;
/** case "go_to_add_room":
goToAddRoom(request, response);
break;
case "add_room":
addRoom(request, response);
break;
case "logout":
logout(request, response);
break;**/
}
error.printStackTrace();
}
}
@Override
public String getServletInfo() {
return "Short description";
}// </editor-fold>
}
APPENDIX C
en-diseases.train
Malaria happens when a bite from the female Anopheles mosquito infects the body with
Plasmodium
Malaria is a mosquito-borne infectious disease affecting humans and other animals caused by
parasitic protozoans
Malaria transmission
Malaria infection
Diarrhea it is recommended that they continue to eat healthy food and babies continue to be
breastfed
Diarrhea are also a common cause of malnutrition and the most common cause in those younger
than five years of age
Diarrhea is defined by the World Health Organization as having three or more loose or liquid
stools per day
Diarrhea is defined as an abnormally frequent discharge of semisolid or fluid fecal matter from
the bowel
Diarrhea intestinal fluid secretion is isotonic with plasma even during fasting
Diarrhea occurs when too much water is drawn into the bowels
Hypertension may be associated with the presence of changes in the optic fundus seen by
ophthalmoscopy
Hypertension with certain specific additional signs and symptoms may suggest secondary
hypertension
Hypertension in pregnancy