Epi Info
Epi Info
Abstract
Background: The Epi-Info software suite, built and maintained by the Centers for Disease Control and Prevention
(CDC), is widely used by epidemiologists and public health researchers to collect and analyze public health data,
especially in the event of outbreaks such as Ebola and Zika. As it exists today, Epi-Info Desktop runs only on the
Windows platform, and the larger Epi-Info Suite of products consists of separate codebases for several different
devices and use-cases. Software portability has become increasingly important over the past few years as it offers a
number of obvious benefits. These include reduced development time, reduced cost, and simplified system
architecture. Thus, there is a blatant need for continued research. Specifically, it is critical to fully understand any
underlying negative performance issues which arise from platform-agnostic systems. Such understanding should
allow for improved design, and thus result in substantial mitigation of reduced performance. In this paper, we present
a viable cross-platform architecture for Epi-Info which solves many of these problems.
Results: We have successfully generated executables for Linux, Mac, and Windows from a single code-base, and we
have shown that performance need not be completely sacrificed when building a cross-platform application. This has
been accomplished by using Electron as a wrapper for an AngularJS app, a Python analytics module, and a local,
browser-based NoSQL database.
Conclusions: Promising results warrant future research. Specifically, the design allows for cross-platform
form-design, data-collection, offline/online modes, scalable storage, automatic local-to-remote data sync, and fast
analytics which rival more traditional approaches.
Keywords: Cross-platform, Form-design, Analytics, Pubic-health, NoSQL, Electron, Data-collection
separate applications, codebases and use-cases includ- such as dashboards etc. This data is exposed as
ing desktop, mobile, web, and cloud. This has resulted RESTful Web Services to the clients, and
in an unfortunate increase in development complexity. • Multiple clients, each equipped with an AngularJS
Outbreaks can often spread faster than engineers can application that provides all the functionality of
keep up. It is not uncommon for new analytics compo- Epi-Info including form design, deployment of forms,
nents or data-collection tools to be requested by public data collection, and user dashboards, and on demand
health teams on the ground during highly active out- analytics. Each client stores its data in PouchDB,
break situations. This on-the-fly requirements specifica- which is automatically synchronized the CouchDB
tion and engineering can be difficult to manage together on the server. The client also includes a Python/Flask
with complicated codebases. Third, the existing interfaces module that provides access to a large set of analytics
for offline data-collection and maintenance protocols are functionality. All of the client is encapsulated with
not altogether intuitive. The processes for importing or the Electron, a platform-independent application
broadcasting between remote servers and local client framework.
machines may present steep learning curves for public
health officials. Client Side Architecture
In this work, we propose and implement a new cross- To address the cross-platform system requirement, we
platform architecture for Epi-Info software suite, which use Electron as a wrap- per for an AngularJS front-end,
can simplify the codebases, expedite the development a Python Analytics module, and an embedded NoSQL
process and incorporate open-source techniques for flex- database called PouchDB.
ible interfaces. The proposed architecture adopts the The database accessibility protocol was an important
Electron [7] as the cross-platform framework to achieve design consideration. PouchDB is a lightweight, browser-
significant reduction in development time and cost. The based NoSQL database which is designed to automatically
open-source techniques in NoSQL and Python are also sync with a remote Couch (Fig. 2) Database. However,
introduced into Epi-Info. NoSQL, as a viable database the proper access point was not immediately obviously.
option, can scale extremely well and provide a flexible As shown in Fig. 1, the PouchDB is accessed directly
structure to otherwise unstructured data. Python [8] has by the Angular front-end. Importantly, this configuration
emerged as a very popular languages for data analytics was chosen because the alternative approach, whereby
and becomes the coding language of choice for many in the database is accessed directly by the Python Analyt-
the science community. Its robust statistical libraries and ics module, would have required the use of a Python-
machine learning frameworks make it a suitable choice for PouchDB wrapper. The documentation for the wrapper is
Epi-Info. In addition, the ease-of-use and platform uni- very light, and it has been much easier to use the origi-
versality from Python can greatly reduce the development nal API’s for database interaction. Any data needed by the
time of any new modules in the event of some emergencies Python Analytics module can be requested and sent via a
or outbreak. simple HTTP connection.
A Web-Based Form-Designer is not currently part of The Flask framework manages the Python code. When
the existing Epi-Info suite, but such a product has been the Electron application is initiated, a child process is
needed for some time [2]. One challenge here is finding spawned which starts the Flask server, allowing access to
a balance between flexibility, speed, and ease-of-use with the Python analytics module. This has proved successful
respect to the form design process. We propose to use and it has allowed us to seamlessly integrate Angular and
AngularJS [9]. Even though AngularJS does not natively Python in a single, local application. When an analytics requ-
support drag & drop functionality, we develop a back-to- est is made, for example, the data is simply re-routed to the
front design methodology to ensure a user-friendly, yet appropriate Python function via the HTTP connection.
effective form-designer. Python was chosen because of it’s popularity and
platform-agnosticism. It is critical that researchers from
Implementation around the world be allowed to contribute to this project
We present a cross-platform system architecture which in a timely way. This can be facilitated by offering a plat-
allows for intuitive form-design, data-collection, online form comprised of tools which are popular and universal.
and offline modes, automatic local-remote data sync, fast
analytics, and scalable storage. The overall system archi- NoSQL and PouchDB
tecture is shown in Fig. 1. NoSQL databases have been one of our primary areas of
The Epi-Info deployment consists of research to date. They are understood to scale extremely
well because they are well-suited to provide a flexible
• A server-side CouchDB database which stores shared structure to otherwise unstructured data. That fact has
form templates, data, and other user information proven helpful when storing Epi-Form schemas. However,
Camp et al. BMC Bioinformatics 2018, 19(Suppl 11):359 Page 55 of 67
Fig. 1 System Architecture - Local PouchDB clients sync automatically with a central CouchDB cloud server,allowing for seamless online-offline
transition
Fig. 2 Client Side Architecture - An AngularJS app, Python Analytics Module, and local PouchDB
Camp et al. BMC Bioinformatics 2018, 19(Suppl 11):359 Page 56 of 67
we have identified several other database-related issues It was important to precisely identify each point of
which require careful consideration. data-transfer and manipulation in order to pinpoint any
It was necessary to choose an appropriate candidate potential bottlenecks (Fig. 3, Table 1).
to be embedded with our Electron Application. This is In order to expedite the analytics cycle, we demon-
critical because larger NoSQL databases, like MongoDB, strate drastic performance improvement by storing two
require different installation protocols for different oper- copies of any particular dataset. We keep one copy in
ating systems. Recall once again, our primary objective PouchDB, so that it may be available for automatic syncing
is to be a platform-agnostic application that is extremely with the central CouchDB. We keep another in a com-
user-friendly, and very little effort to download and install. pressed format native to Python, called HDF5. Even on
Thus, our approach has been to embed a lightweight a slow machine, the read and write times for HDF5 are
NoSQL database within our Electron desktop application. extremely fast, better even than SQL or CSV. Additionally,
After research, PouchDB was selected as the NoSQL the excellent compression ratio means that even though
database, and we consider it to be a viable option going we store the data twice, we increase the total storage-size
forward. It’s robust documentation, community support, requirement by less than 10%.
and seamless synchronization with CouchDB makes it As shown in Table 1, the most costly processes, with
very attractive. Furthermore, it is easy to embed, and respect to time, involve retrieving the data from PouchDB,
can be interfaced directly with the Angular frontend. sending the data to the analytics module, and converting
Specifically, PouchDB is designed to sync automati- the data to a useable DataFrame. This problem is exac-
cally with a remote Couch Database. This allows for erbated if the database is allowed to accumulate alot of
seamless transition between online and offline modes, data prior to carrying out these steps. Thus, it is pos-
and guards against the potential for data-loss during sible to mitigate such effects by performing the opera-
transfer. tions iteratively, whenever new data is entered into the
As a result of this auto-sync, any underlying changes database. The compressed HDF5 DataFrame must be
to data on local client machines can be automatically continuously maintained, allowing for immediate analyt-
broadcast a centralized remote database, and subse- ics requests at all times. Fortunately, PouchDB comes
quently on to any additional client machines. Further- equipped witha change-log which offers a detailed expla-
more, PouchDB provides a detailed change-log which nation of any changes to the underlying data. This can be
identifies and explains any alterations in local data or used to subsequently update the compressed DataFrame.
data-structure. The result is a system that would allow for very fast
access to data and analytics which rivals even traditional
Analytics Module approaches. Additional performance metrics are provided
Epi-Info is essentially data-collection and analytics soft- in the “Results” section of this paper.
ware. Consequently, the analytics module is perhaps the
most crucial component, and the primary objective was to Form Designer
increase speed and efficiency. In the following paragraphs, The challenge associated with building a web-based form
we outline an approach which successfully mitigates the designer is derived from a need to balance flexibil-
negative effects often found in cross-platform and NoSQL ity with specificity. The current desktop form-designer
systems. provided by Epi-Info offers extreme precision, allowing
Table 1 Approximate time-requirements for critical data-flow base, and we have shown that performance need not
processes be completely sacrificed when building a cross-platform
Process Time (Data: 50k x 200) application.
(300MB) Figure 5 shows a typical Epi-Info desktop workflow.
1 Retrieve data from PouchDB 1 minute Our product can successfully support: Form Design,
2 HTTP POST, send JSON data to Python 30 seconds
Import/Export of Forms to a centralized database, Data-
Entry, Advanced Analytics, Savable and Customizable
3 Convert JSON to Pandas DataFrame 30–45 seconds
Advanced Analytics Dashboards (Fig. 6), and Report
4 Compress/Write JSON to HDF <1 second (compresses Export.
to only 21MB) By incorporating the use of a compressed HDF5
5 Analytics varies, but fast DataFrame, we have successfully demonstrated that we
6 HTTP POST, return results to Angular varies can expedite the analytics cycle, thus mitigating many
Test Data: 50k records, by 200 features
of the negative effects typically associated with cross-
platform or NoSQL applications. For a dataset with 50,000
records and 200 columns, the software can read the
form-creators the ability to define form elements on a data, perform a user- defined 10-variable multiple logis-
pixel-by-pixel basis. The form-schemas are then stored as tic regression, and report the results in under 2 s, even on
XML, and the exact positions of form elements are sub- modest machines.
sequently recorded. On the one hand, this is desirable Additionally, the use of multiple cores can further opti-
because health form appearance often requires such acute mize the analytics module. This allows multiple analytics
attention to detail. On the other, this can cause a large requests to be made on-the-fly as needed. Reports are sent
increase in design time. With our web-based AngularJS back to the user-interface as those jobs are completed.
form designer, we strike a balance between the two char- That is, any single request need not wait for a previous job
acteristics, offering users an acceptable level of precision to finish as long as there is another core available for use
while simultaneously expediting the form-design process on the machine (Fig. 7).
with a flexible and intuitive interface (Fig. 4).
Additional Considerations
Results The design presented in this paper should be regarded
To date, we have successfully generated executables for as one acceptable approach with respect to the the
Windows, Mac, and Linux machines from a single code aforementioned requirements. However, numerous other
adopted technologies, and it shows the power and need weakly-supported tools. This was discovered first hand
for additional software standardization. by our researchers, particularly when tasked with choos-
Importantly, new technologies must first be examined ing an acceptable local NoSQL database. Many newer,
and evaluated based on the quality of community sup- lightweight NoSQL databases have very shallow docu-
port. Public-Health is a critical domain, and it would not mentation and community support. ForerunnerDB was
be wise to imprudently experiment with untested and one such product which ultimately proved impractical.
Fig. 7 High level view of step-by-step analytics optimization using multiple cores
Camp et al. BMC Bioinformatics 2018, 19(Suppl 11):359 Page 60 of 67
Abbreviations
CDC: Centers for Disease Control and Prevention; HDF5: Hierarchical Data
Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and
organize large amounts of data; NoSQL: Non-relational database