GPU-accelerated CellProfiler
GPU-accelerated CellProfiler
Abstract—CellProfiler excels at bridging the gap between properties is becoming computationally intensive. Performing
advanced image analysis algorithms and scientists who lack com- analysis on a large number of images from independent sam-
putational expertise. It lacks however high performance capabil- ples is in theory an embarrassingly parallel problem that can
ities needed for High Throughput Imaging experiments where
workloads reach hundreds of TB of data and are computationally be sped up by allocating more compute nodes, but this may be
very demanding. In this work, we introduce a GPU-accelerated too costly or unfeasible for some problem sizes. For example,
CellProfiler where the most time-consuming algorithmic steps as part of our work on re-purposing HTI assays [23] we had
are executed on Graphics Processing Units. Experiments on a to run a CP pipeline that extracts 1400 features per cell on 7.5
benchmark dataset showed significant speedup over both single million of images, which lasted 14 days on 16 compute nodes
and multi-core CPU versions. The overall execution time was
reduced from 9.83 Days to 31.64 Hours. (each with 24 cores) and needed 12.5 of TeraBytes (TB) for
Index Terms—CellProfiler, High Throughput Imaging, Graph- storing the input images and 15 more TB to store the extracted
ics Processing Units, High Performance Computing. features. To reduce runtime to 24h, we would need 224 nodes
(or 5376 cores). For 50TB of data (the size of a second
I. I NTRODUCTION dataset we needed to process), this would increase to 896
High Throughput Imaging (HTI), is a key process in chem- nodes (or 21504 cores). This also exceeds the limits of readily
istry and biology where thousands of experimental samples available cloud computing solutions. For example, Amazon
(such as chemical compounds, amino acids, or live cells) are Elastic Cloud computing requires a special application for runs
subjected to simultaneous testing under given conditions. HTI with over 20 nodes [17]. These estimates are optimistic, as
is widely used for drug discovery, target validation and the they assume immediate access to the cluster and ignore the
identification of genes or proteins that modulate a particular cost of spinning up nodes or loading data. Therefore, Graphics
biological pathway. Like many other scientific experiments Processing Units (GPUs) appears as a good candidate for
in the biology and pharmaceutical industries, HTI involves reducing the aforementioned cost. The execution pipeline is
acquiring images and analyzing them using simple visual indeed the same for all the input images which is suitable
inspection or rather advanced image analysis software. for Simple Instruction Multiple Data (SIMD) programming
With the development of laboratory robotics that automate models such as GPUs. Moreover GPUs are historically known
sample preparation and handling, biologists can easily generate for performing well on image processing workflows.
and use very large image sets from HTI. Hence, the demand In this paper, we present the design and implementation of
for scientific less tedious and more quantitative tools to analyze an accelerated version of CellProfiler on a Graphics Processing
the resultant high-content images is in continuous growing Units (GPU). The GPU optimization were incorporated within
need [1]–[4]. Among the several existing commercial and the existing library in a transparent way to the user using the
free software packages, we have used CellProfiler (CP) [6], same programming language for further portability. The focus
an open-source software for quantifying data from biological was put on very time consuming and commonly used modules
images, and in particular images from high-throughput exper- included in CP. The reminder of the paper is organized as
iments. CP provides researchers with a variety of biological follow: in Section II an overview is given on CellProfiler
analysis including standard assays (such as cell count, size, together with an analysis of its major time consuming com-
per-cell protein levels) and complex morphological assays ponents. In Section III, the CPU optimization introduced to
(such as cell/organelle shape or sub cellular patterns of DNA the CellProfiler are listed. In Section IV, the architecture of
or protein staining) [9]. Furthermore it has a big user base, is Graphics Processing Units is described. The GPU-accelerated
being cited in more than a thousand papers and is validated for CellProfiler is detailed in Section V. Experimental settings and
a wide variety of biological applications. It has also won the results are described in Section VI. In Section VII, we provide
2009 Bio-IT World Best Practices Award in IT & Informatics. a summary of the work and some future research improvement.
With the already increasing rate of possible assays (HTI
robots capable of testing up to 100,000 compounds per day II. OVERVIEW OF C ELL P ROFILER
currently exist) and the growing size and resolution of resultant Running an image analysis project in CellProfiler [6] can be
images, identifying objects of interest and measuring their done by using the graphical user interface (GUI) or by using
a headless batch-mode. In both methods, a pipeline has to be TABLE I
defined listing the modules to be executed with their associated AVERAGE PROFILING TIMES FOR THE MeasureObjectSizeShape AND
MeasureTexture MODULE OF C ELL P ROFILER . T HE LEFT COLUMN SHOWS
settings. Each module deals with some specific processing THE INDIVIDUAL CORE FUNCTION FOR BOTH MODULES . T HE MIDDLE
task on the images using the given parameters. Many modules AND RIGHT COLUMNS SHOW THEIR RESPECTIVE TIMINGS FOR BOTH THE
(over 50) are available. As shown in the high-level pseudo ORIGINAL P YTHON IMPLEMENTATION AND THE CPU OPTIMIZED ONE .
code given in Algorithm 1, every image travels through the Original (Python) CPU Optimized (C++)
whole pipeline and is processed sequentially by each module Module Time % Time Speedup
before any other image can be processed. Not only all the MeasureObjectSizeShape 114.02s 100.00% 67.16s 1.6
modules are to be executed in order but also the complete zernike 79.41s 69.65% 34.07s 2.3
distance to edge 15.14s 13.28% 14.51s 1.0
pipeline has to be executed for each image at a time. All the calculate extents 7.01s 6.15% 6.74s 1.0
computed measurements are temporarily saved for each image other 12.44s 10.91% 11.84s 1.1
in a workspace that is written in a file or displayed on the GUI MeasureTexture 289.52s 100.00% 44.61s 6.5
run one gabor 159.11s 54.96% 21.55s 7.4
when the saving module is reached. This execution model has gabor 111.28s 38.44% 18.08s 6.2
been redesigned in Section V in order to generate more data norm per object 30.94s 10.69% 0.98s 31.6
sum 15.96s 5.51% 1.83s 8.7
parallelism needed by the GPU. other 0.93s 0.32% 0.66s 1.4
run image gabor 78.56s 27.13% 11.60s 6.8
gabor 76.19s 26.32% 9.48s 8.0
Data: I, a set of images training - M , a list of other 2.37s 0.82% 2.12s 1.1
run one 38.06s 13.15% 7.65s 5.0
modules haralick 37.07s 12.80% 6.66s 5.6
for Each image i in the input image set I (1) do other 0.99s 0.34% 0.99s 1.0
Create Workspace W i run image 13.79s 4.76% 3.81s 3.6
haralick 13.59s 4.69% 3.62 2.4
for Each module m in the input pipeline M (2) do other 0.20s 0.07% 0.19s 1.1
Execute module m on image i other 0.00s 0.00% 0.00s 1.0
Save the measurements in W i
if Module.name is Save module then
Save the output A. The MeasureTexture Module
Delete Workspace W i The MeasureTexture module measures the degree and nature
end of textures within objects. As shown in the pseudo code
end in Algorithm 2, it implements the Gabor filter [12] which
end analyzes whether there are any frequency (scale) contents in
Algorithm 1: General skeleton of a pipeline in CellPro-
the image for a specific directions (angle) in a localized region
filer.
around the point or region of analysis. The MT module also
computes the Haralick [11] filter which quantifies the spatial
Table I shows the results of an empirical profiling generated variation of grey tone values within an image.
on a server machine with an Intel Xeon E5-2690 v2 3.0 GHz If we closely examine the profiling results of the MT module
CPU (2 sockets, 10 cores per socket) with 32 GB of RAM, in Table I, we can clearly observe that the Gabor filter [12] is
running 64-bit Arch 4.16.12 Linux as operating system. Each taking a considerable amount of time. The execution is spread
instance of CellProfiler is running on a single core. Detailed over two core functions and takes in total 187.47s to execute or
timings of the most time consuming modules namely Measure- 64.76% of the entire module. The second prominent bottleneck
dObjectSizeShape (MOSS) and MeasureTexture (MT) modules is the Haralick [11] feature extractor. The cumulated time of
of CellProfiler are reported. The main core functions are shown the Haralick feature extractor for the entire module is 50.66s
on the left of the table. Detailed timings are shown on the right (17.49%).
for each of these core functions. The timings are given for one
image and are averaged out over 16 images with 3 channels B. The MeasureObjectSizeShape module
(stainings) per image, with a resolution of 2498 × 2098. Given an image with identified objects (e.g., nuclei or cells),
We can observe that, on average, MeasuredObjectSizeShape the MeasureObjectSizeShape module extracts areas and shape
and MeasureTexture take respectively 114.02s and 289.52s to features of each object that is completely inside the image
execute on a single image. Extrapolating this to a full plate borders. Extracted features include simple metrics such as
with 384 wells and with 6 field images per well, it would area, volume, solidity, etc. and other complex metrics such as
take approximately 69.18h for MeasuredObjectSizeShape and Euler Number, Min/Max Feret Diameter and Zernike shape
185.29h for MeasureTexture. Our dataset of 744 plates is features, etc. As mentioned before, most of the compute
making it even more impractical with 2144.58 days and time spent in this module is taken by the calculating the
5744.08 days respectively. Although those are extrapolated Zernike moments (takes 79.41s or 69.65% for our example
timings for a single core, even for a perfectly scale-able 32 in Table I). This metric describes an object in a basis of
nodes cluster with 18 cores per cluster, it would still take Zernike polynomials using the coefficients as features. Zernike
about 3.72 days and 9.97 days to execute. It is clear that both polynomials from order 0 to order 9 are calculated, giving in
CellProfiler modules do not scale well for larger datasets. total 30 measurements per object.
Data: G, a list of image groups and fast array operations, however, it is not cache friendly
S, a list of image scales when it requires multiple iterations over the same data. To
A, a list of angles illustrate the gains in execution times by applying minor CPU
O, a list of objects optimizations, we have refactored and implemented the code
labels, an array of labels for both modules in C++.
for Each group g in G do The proposed CPU improvements to the Gabor filter are
for Each scale s in S do threefold. First, in the original approach, similar Gabor calcu-
if Compute Gabor is True then lations are repeated over distinct angles of the Gabor filter.
Run image Gabor() However, most of the time consuming calculations are in-
end dependent of those angles and can be calculated only once.
for Each angle a in A do Secondly, a large portion of the execution time is spent
Run image Haralick()
on calculating sines and cosines. To solve this, we have
end
constructed a preprocessed lookup table. Each required sine
for Each object o in O do
or cosine is obtained by linearly interpolating values in this
for Each angle a in A do
Run one Haralick(labels) lookup table. Unfortunately using a lookup table comes at the
end cost of reduced accuracy. Adding more preprocessed data in
if Compute Gabor is True then the lookup table, will result in more accurate approximations
Run one Gabor(labels) of sines and cosines. In our experiments, we observed a max-
end imum relative change of about 2.80e-5 compared to original
end results. Lastly, we implemented the full Gabor filter in C++.
end The average speedup we obtain for the optimized Gabor filter
end is ranging between 6 to 8 depending on the amount of cells
Algorithm 2: Pseudo code of the MeasureTexture module present in the image. In terms of cache efficiency, which is
in CellProfiler. essential for scaling purposes, we obtain 21 less loads, 3 less
stores, 12 less load misses and 4 less store misses.
The second main improvement is done in the processing
As illustrated in the high level Algorithm 3, of Haralick features. Here, we have observed that the most
processing time is spent in the calculation of a normaliza-
Data: O, a list of objects tion step per cell. It uses a generalized and complex min-
labels, an array of labels imum/maximum operator in the original Python code. By
for Each object o in O do focusing on implementing only the specific needs for the
Measure area()
Haralick feature detector, we are able to gain a speedup of
Measure perimeter()
about 2.4 to 2.7.
Measure solidity()
Measure extent() A last improvement is optimizing the Zernike Moments,
Measure eulernumber() which is a key part in the MeasureObjectSizeShape mod-
Measure center X Y() ule. The Zernike polynomials are constructed using similar
Measure max radius() exponent operations. To reduce the amount of calculations,
Measure mean radius() we implemented an on-the-fly caching mechanism for these
Measure min max feretdiameter() exponents, resulting in a speedup factor of about 2.2. The
for Each index i in Zernike indexes do cache efficiency is improved by 16 less loads, 10 less stores,
Measure zernike(labels) 10 less load misses and 33 less store misses.
end The joint speedup of the new optimized approach using the
end three improvements above is shown in Table I on the right.
Algorithm 3: Pseudo code of the MeasureObjectSizeShape The MeasureObjectSizeShape module has a speedup factor of
module in CellProfiler. approximately 1.6 and the MeasureTexture module a speedup
factor of approximately 6.5.
III. CPU O PTIMIZATIONS
IV. G RAPHICS P ROCESSING U NITS
The original CellProfiler [9] code is written in Python
2.x. Although it is really effective and flexible in terms of Graphics Processing Units are at the leading edge of
image processing on cell data, it is not optimized for time many-core parallel computational platforms in several research
performance. As we have shown in the previous section, it fields. GPUs are especially well-suited for fine-grained, data-
contains some major time performance bottlenecks. Here we parallel computations, consisting of thousands of independent
will show that by refactoring the code of the core functionality, threads executing the same program concurrently. It excels in
we can drastically decrease the amount of calculations. In running programs on many data elements in parallel and with
addition, the vectorized nature of Python code allows for clean a high ratio of arithmetic operations to memory operations.
for Each module M in the input pipeline (2) do
for Each image I in the input image set (1) do
if M is M T || M is M OSS then
if limit GPU reached is False then
Push I in a pool P
end
else
Run M on GPU for all images in pool
P
Save the measurements of I in W I
end
end
else
Run module M on CPU for I
Save the measurements in W I
end
end
if Module name is Save module then
Write all measurements in file
Delete all workspaces
Fig. 1. Illustration of the GPU programming model.
end
end
Algorithm 4: High level main pipeline of the CP as
A GPU pipeline starts from the CPU by calling a function rewritten for the GPU version.
named kernel that is executed on the GPU device N times
in parallel by N threads organized into thread blocks and
grids of thread blocks. Each thread within a thread block For each input image I, a workspace W I is created where
executes an instance of the kernel, and has a thread identifier all its measurements are saved. If the current module M is
within its thread block. Threads are partitioned into groups of the MT module or the MOSS module, then the execution will
threads called warps, which execution is scheduled following occur at the GPU side. The program checks whether the limits
a time-sharing strategy. A thread block is a set of concur- of the GPU are reached, meaning that the maximum number
rently executing threads that can cooperate among themselves of threads to be launched and the size of consumed memory
through barrier synchronization and shared memory. A thread do not exceed the hardware characteristics of the underlying
block has a block identifier within its grid. A grid is an device. This assessment is done automatically and without any
array of thread blocks that execute the same kernel, read user intervention. If the limits of the GPU are not exceeded,
inputs from global memory, write results to global memory, the image is pushed into a pool of images to be executed in
and synchronize between dependent kernel calls. Active GPU next iterations. If the GPU limits are reached or all images
threads have access to several memory spaces with different have been analyzed, module M is executed on the GPU and
characteristics that reflect their distinct usages. These memory the resulting measurements are saved into the workspace W I
spaces include global, local, shared, texture, and registers. For of I. If M is other than MT or MOSS it’s executed on CPU
further information about GPU programming and the GPU and here as well the resulting measurements are saved into
memory hierarchy please refer to [15] as an entry to the W I. Finally, if M is the saving module all measurements
literature. are saved into a file or displayed on the GUI.
One Gabor One Haralick Image Gabor Image Haralick Pre-processing Post-processing One Gabor One Haralick Image Gabor Image Haralick
38180.352 2769.6 17030.4 1230.432 3082.56 18944.448 366589.44 87690.24 181002.24 31772.16
TABLE II
C OMPARING EXECUTION TIME OF THE MeasureTexture MODULE ON CPU AND GPU.
Benchmarks: The dataset used in this paper consist of high 13.09× bigger for the second dataset than for the first smaller
content images taken from plates with 384 wells (16 rows one. The performance gain perfectly scales with the size of
× 24 columns). Every well has 6 fields and is stained with the dataset. The same tendency have been observed with the
3 chemicals (referenced as image group in Algorithms 1–5). MeasuredObjectSizeShape.
Resultant images are sized (2498×2098). The proprietary used
pipeline extracts 1400 features per cell. VII. C ONCLUSION AND F UTURE WORK
In this paper, we introduced a GPU-accelerated version
B. Experimental results
of the CellProfiler application, an open source software that
In Table II, the results of executing the MeasureTexture excels at image processing pipelines to identify individual
module on GPU are listed for the whole plate (2304 images). cells and to extract sub-cellular morphological features but
The reported times for One Gabor GPU, Image Gabor GPU, lacked high performance capabilities. Indeed, even if in theory
One Haralick GPU and Image Haralick GPU include the performing image analysis on a large number of images from
transfer of the data from the CPU, the execution of the kernel independent samples is an embarrassingly parallel problem
and bringing the data back to the CPU. The pre-processing that can be sped up by allocating more compute nodes,
time refers to the time elapsed in constructing the data for empirical estimate of the compute cost showed that using
the GPU. Post-processing time refers to the time elapsed CellProfiler to analyze 50TB of data within 24 hours needed
in assigning the results to associated images. This steps are 896 nodes with each 24 cores (or 21504 cores). This is
executed on the CPU side but are compulsory for the GPU too costly and exceeds the limits of readily available cloud
hence we include them in the GPU evaluation. These two computing solutions. Major time-consuming modules of the
latter timings are for both Gabor and Haralick filters. Indeed as CellProfiler have been identified and re-implemented for GPU.
mentioned in Section V-A, part of the optimization consisted Speedups of × 7.5 have been reached, which meant that the
of fusing loops on the CPU and reusing the data structures that processing of a particular dataset was reduced from 9.83 days
already live in the GPU memory. The overall improvement to 31.64 hours.
in the performances for the whole plate between the single For future work we are looking into using other types
core original version of the Python code and the GPU one is of hardware acceleration, such as FPGAs, to optimize other
about 8.21×. The original code run for 185.29 Hours ∼ 7.72 CellProfiler steps for which GPU acceleration is less amenable.
Days for the whole plate. Using GPU accelerations it has been
reduced to 22.56 Hours. ACKNOWLEDGMENT
In Table III, the results of the MeasuredObjectSizeShape on This work was supported by the VLAIO industrial R&D
GPU are listed. Here as well, the pre-processing and post- project ImmCyte. The NVIDIA Corporation generously do-
processing times refers respectively to the time spent in data nated a GPU.
preparation and in assigning data to the related workspace
objects. The execution time of the Zernike filter for the R EFERENCES
whole plate have been reduced from 50.82 Hours ∼2.11 Days [1] Malo N, Hanley JA, Cerquozzi S, Pelletier J, Nadon R: Statistical
using the original Python code to 9.08 hours used the GPU practice in high-throughput screening data analysis. Nat Biotechnol
accelerator. 2006, 24(2):167–175.
[2] Boutros M, Bras LP, Huber W: Analysis of cell-based RNAi screens.
Same experimental protocol have been applied to smaller Genome Biol 2006, 7(7):R66. 10.1186/gb-2006-7-7-r66
[3] Cook D, Swayne DF: Interactive and Dynamic Graphics for Data
images (1056 × 1256) from a plate of 384 wells with two Analysis: With Examples Using R and GGobi: Springer. 2007.
fields each (768 images). The objective here was to evaluate [4] Gee AG, Li H, Yu M, Smrtic MB, Cvek U, Goodell H, Gupta V,
the weak scaling of the proposed GPU-CP software. In Table Lawrence C, Zhou J, Chiang C, et al.: The Universal Visualization
Platform. SPIE Visualization and Data Analysis 2005, 5669: 274–283.
IV, we show the results for the MeasureTexture. The speedup [5] Selinummi, J., J. Seppala, O. Yli-Harja, and J.A. Puhakka. Software
between the original CPU single core version and the GPU for quantification of labeled bacteria from digital microscope images by
accelerated version is 5.47 for the smaller dataset and 8.21 automated image analysis. BioTechniques 39:859–863. 2005.
[6] Michael R. Lamprecht, David M. Sabatini, and Anne E. Carpenter.
for the bigger one. The second dataset is roughly 4× bigger CellProfiler: free, versatile software for automated biological image
than the first one. The execution time using the GPU version is analysis. BioTechniques 42:71-75 (January 2007).
MeasuredObjectSizeShape GPU (seconds) MeasuredObjectSizeShape CPU (seconds)
TABLE III
C OMPARING EXECUTION TIME OF THE MeasuredObjectSizeShape MODULE ON CPU AND GPU FOR A FULL PLATE .
TABLE IV
R ESULTS FOR DIFFERENT SIZES OF IMAGES .
[7] Abramoff, M.D., P.J. Magalhaes, and S.J. Ram. 2004. Image processing
with ImageJ. Biophotonics Int. 11:36-42.
[8] http://cellprofiler.org/impact/#impact Accessed August 2018
[9] Carpenter, Anne E et al. CellProfiler: image analysis software for
identifying and quantifying cell phenotypes. Genome Biology, 2006,
7(10), R100.
[10] https://www.singerinstruments.com/resource/
what-is-high-throughput-screening/ Accessed August 2018
[11] Haralick RM, Shanmugam K, Dinstein I. (1973), ”Textural Features for
Image Classification” IEEE Transaction on Systems Man, Cybernetics,
SMC-3(6):610-621.
[12] D. Gabor, ”Theory of communication. Part 1: The analysis of infor-
mation,” in Journal of the Institution of Electrical Engineers - Part III:
Radio and Communication Engineering, vol. 93, no. 26, pp. 429-441,
November 1946.
[13] P. Bhaskara Rao, D.vara Prasad, Ch.pavan Kumar, ”Feature extraction
using zernike moments”, International Journal of Latest Trends in
Engineering and Technology (IJLTET). Volume 2 Issue 2 - March 2013.
[14] https://documen.tician.de/pycuda/ Accessed August 2018
[15] https://developer.download.nvidia.com/GPU Programming Guide/
GPU Programming Guide.pdf Accessed August 2018
[16] https://www.docker.com Accessed August 2018
[17] Amazon Web Services Faq. How many instances can I run in Amazon
EC2. https://aws.amazon.com/ec2/faqs/ Accessed August 2018
[18] https://www.docker.com Accessed August 2018
[19] https://github.com/CellProfiler/docker Accessed August 2018
[20] https://www.nvidia.com/en-us/geforce/products/10series/
geforce-gtx-1080-ti/#specs Accessed August 2018
[21] https://www.nvidia.com/object/pascal-architecture-whitepaper.html Ac-
cessed August 2018
[22] https://aws.amazon.com/ec2/purchasing-options/dedicated-instances/
Accessed August 2018
[23] Simm, J., Klambauer, G., Arany, A., et al. (2018). Repur-
posing High-Throughput Image Assays Enables Biological Ac-
tivity Prediction for Drug Discovery. Cell Chemical Biology.
https://doi.org/10.1016/j.chembiol.2018.01.015