0% found this document useful (0 votes)

414 views7 pages

GPU-accelerated CellProfiler

The document discusses GPU acceleration of CellProfiler, an open-source software for analyzing biological images from high-throughput experiments. Profiling showed the MeasureObjectSizeShape and MeasureTexture modules took the most time. The authors introduce a GPU-accelerated version of CellProfiler where the most computationally intensive algorithmic steps are executed on graphics processing units. Experiments on benchmark data showed significant speedup, reducing execution time from 9.83 days to 31.64 hours.

Uploaded by

Imen Chakroun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

414 views7 pages

GPU-accelerated CellProfiler

Uploaded by

Imen Chakroun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

GPU-accelerated CellProfiler

Imen Chakroun Nick Michiels Roel Wuyts

Exascience Life Lab - IMEC Hasselt University – tUL – IMEC Exascience Life Lab - IMEC
75 kapeldreef 3001 Leuven, Belgium Expertise Centre for Digital Media 75 kapeldreef 3001 Leuven, Belgium
[email protected] 2 Wetenschapspark 3590 Diepenbeek, Belgium [email protected]
[email protected]

Abstract—CellProfiler excels at bridging the gap between properties is becoming computationally intensive. Performing
advanced image analysis algorithms and scientists who lack com- analysis on a large number of images from independent sam-
putational expertise. It lacks however high performance capabil- ples is in theory an embarrassingly parallel problem that can
ities needed for High Throughput Imaging experiments where
workloads reach hundreds of TB of data and are computationally be sped up by allocating more compute nodes, but this may be
very demanding. In this work, we introduce a GPU-accelerated too costly or unfeasible for some problem sizes. For example,
CellProfiler where the most time-consuming algorithmic steps as part of our work on re-purposing HTI assays [23] we had
are executed on Graphics Processing Units. Experiments on a to run a CP pipeline that extracts 1400 features per cell on 7.5
benchmark dataset showed significant speedup over both single million of images, which lasted 14 days on 16 compute nodes
and multi-core CPU versions. The overall execution time was
reduced from 9.83 Days to 31.64 Hours. (each with 24 cores) and needed 12.5 of TeraBytes (TB) for
Index Terms—CellProfiler, High Throughput Imaging, Graph- storing the input images and 15 more TB to store the extracted
ics Processing Units, High Performance Computing. features. To reduce runtime to 24h, we would need 224 nodes
(or 5376 cores). For 50TB of data (the size of a second
I. I NTRODUCTION dataset we needed to process), this would increase to 896
High Throughput Imaging (HTI), is a key process in chem- nodes (or 21504 cores). This also exceeds the limits of readily
istry and biology where thousands of experimental samples available cloud computing solutions. For example, Amazon
(such as chemical compounds, amino acids, or live cells) are Elastic Cloud computing requires a special application for runs
subjected to simultaneous testing under given conditions. HTI with over 20 nodes [17]. These estimates are optimistic, as
is widely used for drug discovery, target validation and the they assume immediate access to the cluster and ignore the
identification of genes or proteins that modulate a particular cost of spinning up nodes or loading data. Therefore, Graphics
biological pathway. Like many other scientific experiments Processing Units (GPUs) appears as a good candidate for
in the biology and pharmaceutical industries, HTI involves reducing the aforementioned cost. The execution pipeline is
acquiring images and analyzing them using simple visual indeed the same for all the input images which is suitable
inspection or rather advanced image analysis software. for Simple Instruction Multiple Data (SIMD) programming
With the development of laboratory robotics that automate models such as GPUs. Moreover GPUs are historically known
sample preparation and handling, biologists can easily generate for performing well on image processing workflows.
and use very large image sets from HTI. Hence, the demand In this paper, we present the design and implementation of
for scientific less tedious and more quantitative tools to analyze an accelerated version of CellProfiler on a Graphics Processing
the resultant high-content images is in continuous growing Units (GPU). The GPU optimization were incorporated within
need [1]–[4]. Among the several existing commercial and the existing library in a transparent way to the user using the
free software packages, we have used CellProfiler (CP) [6], same programming language for further portability. The focus
an open-source software for quantifying data from biological was put on very time consuming and commonly used modules
images, and in particular images from high-throughput exper- included in CP. The reminder of the paper is organized as
iments. CP provides researchers with a variety of biological follow: in Section II an overview is given on CellProfiler
analysis including standard assays (such as cell count, size, together with an analysis of its major time consuming com-
per-cell protein levels) and complex morphological assays ponents. In Section III, the CPU optimization introduced to
(such as cell/organelle shape or sub cellular patterns of DNA the CellProfiler are listed. In Section IV, the architecture of
or protein staining) [9]. Furthermore it has a big user base, is Graphics Processing Units is described. The GPU-accelerated
being cited in more than a thousand papers and is validated for CellProfiler is detailed in Section V. Experimental settings and
a wide variety of biological applications. It has also won the results are described in Section VI. In Section VII, we provide
2009 Bio-IT World Best Practices Award in IT & Informatics. a summary of the work and some future research improvement.
With the already increasing rate of possible assays (HTI
robots capable of testing up to 100,000 compounds per day II. OVERVIEW OF C ELL P ROFILER
currently exist) and the growing size and resolution of resultant Running an image analysis project in CellProfiler [6] can be
images, identifying objects of interest and measuring their done by using the graphical user interface (GUI) or by using
a headless batch-mode. In both methods, a pipeline has to be TABLE I
defined listing the modules to be executed with their associated AVERAGE PROFILING TIMES FOR THE MeasureObjectSizeShape AND
MeasureTexture MODULE OF C ELL P ROFILER . T HE LEFT COLUMN SHOWS
settings. Each module deals with some specific processing THE INDIVIDUAL CORE FUNCTION FOR BOTH MODULES . T HE MIDDLE
task on the images using the given parameters. Many modules AND RIGHT COLUMNS SHOW THEIR RESPECTIVE TIMINGS FOR BOTH THE
(over 50) are available. As shown in the high-level pseudo ORIGINAL P YTHON IMPLEMENTATION AND THE CPU OPTIMIZED ONE .

code given in Algorithm 1, every image travels through the Original (Python) CPU Optimized (C++)
whole pipeline and is processed sequentially by each module Module Time % Time Speedup
before any other image can be processed. Not only all the MeasureObjectSizeShape 114.02s 100.00% 67.16s 1.6
modules are to be executed in order but also the complete zernike 79.41s 69.65% 34.07s 2.3
distance to edge 15.14s 13.28% 14.51s 1.0
pipeline has to be executed for each image at a time. All the calculate extents 7.01s 6.15% 6.74s 1.0
computed measurements are temporarily saved for each image other 12.44s 10.91% 11.84s 1.1
in a workspace that is written in a file or displayed on the GUI MeasureTexture 289.52s 100.00% 44.61s 6.5
run one gabor 159.11s 54.96% 21.55s 7.4
when the saving module is reached. This execution model has gabor 111.28s 38.44% 18.08s 6.2
been redesigned in Section V in order to generate more data norm per object 30.94s 10.69% 0.98s 31.6
sum 15.96s 5.51% 1.83s 8.7
parallelism needed by the GPU. other 0.93s 0.32% 0.66s 1.4
run image gabor 78.56s 27.13% 11.60s 6.8
gabor 76.19s 26.32% 9.48s 8.0
Data: I, a set of images training - M , a list of other 2.37s 0.82% 2.12s 1.1
run one 38.06s 13.15% 7.65s 5.0
modules haralick 37.07s 12.80% 6.66s 5.6
for Each image i in the input image set I (1) do other 0.99s 0.34% 0.99s 1.0
Create Workspace W i run image 13.79s 4.76% 3.81s 3.6
haralick 13.59s 4.69% 3.62 2.4
for Each module m in the input pipeline M (2) do other 0.20s 0.07% 0.19s 1.1
Execute module m on image i other 0.00s 0.00% 0.00s 1.0
Save the measurements in W i
if Module.name is Save module then
Save the output A. The MeasureTexture Module
Delete Workspace W i The MeasureTexture module measures the degree and nature
end of textures within objects. As shown in the pseudo code
end in Algorithm 2, it implements the Gabor filter [12] which
end analyzes whether there are any frequency (scale) contents in
Algorithm 1: General skeleton of a pipeline in CellPro-
the image for a specific directions (angle) in a localized region
filer.
around the point or region of analysis. The MT module also
computes the Haralick [11] filter which quantifies the spatial
Table I shows the results of an empirical profiling generated variation of grey tone values within an image.
on a server machine with an Intel Xeon E5-2690 v2 3.0 GHz If we closely examine the profiling results of the MT module
CPU (2 sockets, 10 cores per socket) with 32 GB of RAM, in Table I, we can clearly observe that the Gabor filter [12] is
running 64-bit Arch 4.16.12 Linux as operating system. Each taking a considerable amount of time. The execution is spread
instance of CellProfiler is running on a single core. Detailed over two core functions and takes in total 187.47s to execute or
timings of the most time consuming modules namely Measure- 64.76% of the entire module. The second prominent bottleneck
dObjectSizeShape (MOSS) and MeasureTexture (MT) modules is the Haralick [11] feature extractor. The cumulated time of
of CellProfiler are reported. The main core functions are shown the Haralick feature extractor for the entire module is 50.66s
on the left of the table. Detailed timings are shown on the right (17.49%).
for each of these core functions. The timings are given for one
image and are averaged out over 16 images with 3 channels B. The MeasureObjectSizeShape module
(stainings) per image, with a resolution of 2498 × 2098. Given an image with identified objects (e.g., nuclei or cells),
We can observe that, on average, MeasuredObjectSizeShape the MeasureObjectSizeShape module extracts areas and shape
and MeasureTexture take respectively 114.02s and 289.52s to features of each object that is completely inside the image
execute on a single image. Extrapolating this to a full plate borders. Extracted features include simple metrics such as
with 384 wells and with 6 field images per well, it would area, volume, solidity, etc. and other complex metrics such as
take approximately 69.18h for MeasuredObjectSizeShape and Euler Number, Min/Max Feret Diameter and Zernike shape
185.29h for MeasureTexture. Our dataset of 744 plates is features, etc. As mentioned before, most of the compute
making it even more impractical with 2144.58 days and time spent in this module is taken by the calculating the
5744.08 days respectively. Although those are extrapolated Zernike moments (takes 79.41s or 69.65% for our example
timings for a single core, even for a perfectly scale-able 32 in Table I). This metric describes an object in a basis of
nodes cluster with 18 cores per cluster, it would still take Zernike polynomials using the coefficients as features. Zernike
about 3.72 days and 9.97 days to execute. It is clear that both polynomials from order 0 to order 9 are calculated, giving in
CellProfiler modules do not scale well for larger datasets. total 30 measurements per object.
Data: G, a list of image groups and fast array operations, however, it is not cache friendly
S, a list of image scales when it requires multiple iterations over the same data. To
A, a list of angles illustrate the gains in execution times by applying minor CPU
O, a list of objects optimizations, we have refactored and implemented the code
labels, an array of labels for both modules in C++.
for Each group g in G do The proposed CPU improvements to the Gabor filter are
for Each scale s in S do threefold. First, in the original approach, similar Gabor calcu-
if Compute Gabor is True then lations are repeated over distinct angles of the Gabor filter.
Run image Gabor() However, most of the time consuming calculations are in-
end dependent of those angles and can be calculated only once.
for Each angle a in A do Secondly, a large portion of the execution time is spent
Run image Haralick()
on calculating sines and cosines. To solve this, we have
end
constructed a preprocessed lookup table. Each required sine
for Each object o in O do
or cosine is obtained by linearly interpolating values in this
for Each angle a in A do
Run one Haralick(labels) lookup table. Unfortunately using a lookup table comes at the
end cost of reduced accuracy. Adding more preprocessed data in
if Compute Gabor is True then the lookup table, will result in more accurate approximations
Run one Gabor(labels) of sines and cosines. In our experiments, we observed a max-
end imum relative change of about 2.80e-5 compared to original
end results. Lastly, we implemented the full Gabor filter in C++.
end The average speedup we obtain for the optimized Gabor filter
end is ranging between 6 to 8 depending on the amount of cells
Algorithm 2: Pseudo code of the MeasureTexture module present in the image. In terms of cache efficiency, which is
in CellProfiler. essential for scaling purposes, we obtain 21 less loads, 3 less
stores, 12 less load misses and 4 less store misses.
The second main improvement is done in the processing
As illustrated in the high level Algorithm 3, of Haralick features. Here, we have observed that the most
processing time is spent in the calculation of a normaliza-
Data: O, a list of objects tion step per cell. It uses a generalized and complex min-
labels, an array of labels imum/maximum operator in the original Python code. By
for Each object o in O do focusing on implementing only the specific needs for the
Measure area()
Haralick feature detector, we are able to gain a speedup of
Measure perimeter()
about 2.4 to 2.7.
Measure solidity()
Measure extent() A last improvement is optimizing the Zernike Moments,
Measure eulernumber() which is a key part in the MeasureObjectSizeShape mod-
Measure center X Y() ule. The Zernike polynomials are constructed using similar
Measure max radius() exponent operations. To reduce the amount of calculations,
Measure mean radius() we implemented an on-the-fly caching mechanism for these
Measure min max feretdiameter() exponents, resulting in a speedup factor of about 2.2. The
for Each index i in Zernike indexes do cache efficiency is improved by 16 less loads, 10 less stores,
Measure zernike(labels) 10 less load misses and 33 less store misses.
end The joint speedup of the new optimized approach using the
end three improvements above is shown in Table I on the right.
Algorithm 3: Pseudo code of the MeasureObjectSizeShape The MeasureObjectSizeShape module has a speedup factor of
module in CellProfiler. approximately 1.6 and the MeasureTexture module a speedup
factor of approximately 6.5.
III. CPU O PTIMIZATIONS
IV. G RAPHICS P ROCESSING U NITS
The original CellProfiler [9] code is written in Python
2.x. Although it is really effective and flexible in terms of Graphics Processing Units are at the leading edge of
image processing on cell data, it is not optimized for time many-core parallel computational platforms in several research
performance. As we have shown in the previous section, it fields. GPUs are especially well-suited for fine-grained, data-
contains some major time performance bottlenecks. Here we parallel computations, consisting of thousands of independent
will show that by refactoring the code of the core functionality, threads executing the same program concurrently. It excels in
we can drastically decrease the amount of calculations. In running programs on many data elements in parallel and with
addition, the vectorized nature of Python code allows for clean a high ratio of arithmetic operations to memory operations.
for Each module M in the input pipeline (2) do
for Each image I in the input image set (1) do
if M is M T || M is M OSS then
if limit GPU reached is False then
Push I in a pool P
end
else
Run M on GPU for all images in pool
P
Save the measurements of I in W I
end
end
else
Run module M on CPU for I
Save the measurements in W I
end
end
if Module name is Save module then
Write all measurements in file
Delete all workspaces
Fig. 1. Illustration of the GPU programming model.
end
end
Algorithm 4: High level main pipeline of the CP as
A GPU pipeline starts from the CPU by calling a function rewritten for the GPU version.
named kernel that is executed on the GPU device N times
in parallel by N threads organized into thread blocks and
grids of thread blocks. Each thread within a thread block For each input image I, a workspace W I is created where
executes an instance of the kernel, and has a thread identifier all its measurements are saved. If the current module M is
within its thread block. Threads are partitioned into groups of the MT module or the MOSS module, then the execution will
threads called warps, which execution is scheduled following occur at the GPU side. The program checks whether the limits
a time-sharing strategy. A thread block is a set of concur- of the GPU are reached, meaning that the maximum number
rently executing threads that can cooperate among themselves of threads to be launched and the size of consumed memory
through barrier synchronization and shared memory. A thread do not exceed the hardware characteristics of the underlying
block has a block identifier within its grid. A grid is an device. This assessment is done automatically and without any
array of thread blocks that execute the same kernel, read user intervention. If the limits of the GPU are not exceeded,
inputs from global memory, write results to global memory, the image is pushed into a pool of images to be executed in
and synchronize between dependent kernel calls. Active GPU next iterations. If the GPU limits are reached or all images
threads have access to several memory spaces with different have been analyzed, module M is executed on the GPU and
characteristics that reflect their distinct usages. These memory the resulting measurements are saved into the workspace W I
spaces include global, local, shared, texture, and registers. For of I. If M is other than MT or MOSS it’s executed on CPU
further information about GPU programming and the GPU and here as well the resulting measurements are saved into
memory hierarchy please refer to [15] as an entry to the W I. Finally, if M is the saving module all measurements
literature. are saved into a file or displayed on the GUI.

V. GPU- ENABLED C ELL P ROFILER A. GPU-based MeasureTexture

An opportunity for a GPU-accelerated CellProfiler exists Algorithm 5 gives the pseudo code of the re-implementation
in speeding up the most time-consuming algorithmic steps of the MT module for GPU. First, all the data is copied
identified in Section II. To do so, the current execution pipeline to the device where four kernels are executed. CPU-GPU
of CP has to be redesigned in order to generate more data data movement best practices have been applied here: the
parallelism. Recall that CP executes pipelines by iterating over input image pixels and labels together with the angles and
the list of modules in the pipeline and calling their run function frequencies are common data for all kernels, therefore they
one by one on the image. By interchanging the order of the are copied only once and kept in the global memory of the
loops (1) and (2) in Algorithm 1, better data parallelism can be GPU until all kernels are finished. Similar optimizations have
achieved, as shown in Algorithm 4. The idea is to run the same been applied when assigning back the results to the CPU by
module on all the images before executing the subsequent one, merging loops and iterating over the workspaces only once.
instead of the other way around. Intermediate data structures have been used in device memory
that are operated on by the device and deleted without being
mapped or copied to host memory. We have also batched many Data: P , a list of images
small transfers into a larger one which eliminates most of the O, a list of objects
per-transfer overhead. L, an array of labels
for Each image in P do
Data: P , a list of images for Each object o in O do
G, a list of image groups Measure area()
S, a list of image scales Measure perimeter()
A, a list of angles Measure solidity()
O, a list of objects Measure extent()
L, an array of labels Measure eulernumber()
Measure center X Y()
Copy S and P to the GPU()
Measure max radius()
Copy A to the GPU()
Measure mean radius()
Copy O to the GPU()
Measure min max feretdiameter()
Image gabor gpu(P , S, L) end
Image Haralick gpu(P , S, A, L) end
One Haralick gpu(P , S, A, O, L) Copy P to the GPU()
One gabor gpu(P , S, A, O, L) Zernike GPU(P , L)
Copy measurements back to the CPU() Copy measurements back to the CPU()
Algorithm 5: Pseudo code of the MeasureTexture module Algorithm 6: Pseudo code of the MeasureObjectSizeShape
on GPU. module on GPU.

1) Gabor Filter: In the first kernel (Image gabor gpu()),

every GPU thread computes one angle and one group (eg: VI. E XPERIMENTS AND RESULTS
Hoechst, Alexa568, etc.) of an input image. So per kernel A. Experimental settings
execution nb angles × nb groups × nb images threads can Hardware: The experiments are performed using a GeForce
be run in parallel. In One gabor gpu(), every GPU thread GTX 1080 Ti [20] which is based on Nvidia’s Pascal GPU
computes one of the indices of the labels array, an angle of the architecture [21]. The GTX 1080 Ti comes with 11GB of
gabor angles and one image group. Here as well, nb angles GDDR5X memory, with a 11Gbps memory speed, 352-bit
× nb groups × nb images × sizeof labels threads can memory interface and a memory bandwidth of 484GB/sec.
be run in parallel per kernel. The GPU implementation of the GeForce GTX 1080 Ti has 4 Graphics Processing Clusters
filter has been refactored by computing many moving averages with each a dedicated raster engine and seven Streaming
and rearranging redundant code such as described in Section Multiprocessors (SM). Each SM has 128 CUDA cores, 256
III. KB of register file capacity, a 96 KB shared memory unit, 48
2) Haralick Filter: Here, only the normalization and quan- KB of total L1 cache storage, and eight texture units.
tization segments–the most time consuming parts–of the im- PyCUDA: In this work, we have used PyCUDA [14] which
plementation of the Haralick filter in CP have been ported to provides access to the CUDA parallel computation API from
the GPU. In the first kernel, namely Image Haralick gpu(), Python. This choice was made for portability reasons since
every thread computes one angle and one image group. So Python is the programming language used in CellProfiler as
per kernel execution nb angles × nb groups × nb images of version 2.0.
threads can be run in parallel. In One Haralick gpu(), every
GPU thread computes one angle, one image group and an Docker: For running the experiments we have used Docker
index of the labels array of an image. So per kernel execution which performs operating-system-level virtualization using
nb angles × nb groups × nb images × sizeof labels software packages called containers. To use a container, an
threads can be run in parallel. image has to be created where specific software, libraries and
configurations are bundled. A docker image is easy to share
B. GPU-based MeasureObjectSizeShape and to run from one node to another because every container
runs on a single operating system kernel which makes it
In the GPU-accelerated MOSS, only the Zernike filter is more lightweight than using a virtual machine. Docker images
computed on the device. The pool of images is copied to the of the CellProfiler already exist [19] but in our case, extra
GPU and the output features are copied back to the CPU configurations were required. Indeed, GPUs are exposed as
after the kernel finishes. Here as well, the data movement separate device files which have to be mapped following the
optimization described before has been applied. Every GPU PCI bus ordering (mainly if different models of GPUs coexist).
thread computes an index of the labels array and a Zernike Moreover, the nvidia uvm kernel module must be manually
index. As a result, 30 × nb images × sizeof labels threads loaded before starting any CUDA container.
can be run in parallel when the kernel is executed.
MeasureTexture GPU (seconds) MeasureTexture CPU (seconds)

One Gabor One Haralick Image Gabor Image Haralick Pre-processing Post-processing One Gabor One Haralick Image Gabor Image Haralick

38180.352 2769.6 17030.4 1230.432 3082.56 18944.448 366589.44 87690.24 181002.24 31772.16

81237.792 (22.56 hours) 667054.08 (185.29 hours ∼ 7.72 days)

TABLE II
C OMPARING EXECUTION TIME OF THE MeasureTexture MODULE ON CPU AND GPU.

Benchmarks: The dataset used in this paper consist of high 13.09× bigger for the second dataset than for the first smaller
content images taken from plates with 384 wells (16 rows one. The performance gain perfectly scales with the size of
× 24 columns). Every well has 6 fields and is stained with the dataset. The same tendency have been observed with the
3 chemicals (referenced as image group in Algorithms 1–5). MeasuredObjectSizeShape.
Resultant images are sized (2498×2098). The proprietary used
pipeline extracts 1400 features per cell. VII. C ONCLUSION AND F UTURE WORK
In this paper, we introduced a GPU-accelerated version
B. Experimental results
of the CellProfiler application, an open source software that
In Table II, the results of executing the MeasureTexture excels at image processing pipelines to identify individual
module on GPU are listed for the whole plate (2304 images). cells and to extract sub-cellular morphological features but
The reported times for One Gabor GPU, Image Gabor GPU, lacked high performance capabilities. Indeed, even if in theory
One Haralick GPU and Image Haralick GPU include the performing image analysis on a large number of images from
transfer of the data from the CPU, the execution of the kernel independent samples is an embarrassingly parallel problem
and bringing the data back to the CPU. The pre-processing that can be sped up by allocating more compute nodes,
time refers to the time elapsed in constructing the data for empirical estimate of the compute cost showed that using
the GPU. Post-processing time refers to the time elapsed CellProfiler to analyze 50TB of data within 24 hours needed
in assigning the results to associated images. This steps are 896 nodes with each 24 cores (or 21504 cores). This is
executed on the CPU side but are compulsory for the GPU too costly and exceeds the limits of readily available cloud
hence we include them in the GPU evaluation. These two computing solutions. Major time-consuming modules of the
latter timings are for both Gabor and Haralick filters. Indeed as CellProfiler have been identified and re-implemented for GPU.
mentioned in Section V-A, part of the optimization consisted Speedups of × 7.5 have been reached, which meant that the
of fusing loops on the CPU and reusing the data structures that processing of a particular dataset was reduced from 9.83 days
already live in the GPU memory. The overall improvement to 31.64 hours.
in the performances for the whole plate between the single For future work we are looking into using other types
core original version of the Python code and the GPU one is of hardware acceleration, such as FPGAs, to optimize other
about 8.21×. The original code run for 185.29 Hours ∼ 7.72 CellProfiler steps for which GPU acceleration is less amenable.
Days for the whole plate. Using GPU accelerations it has been
reduced to 22.56 Hours. ACKNOWLEDGMENT
In Table III, the results of the MeasuredObjectSizeShape on This work was supported by the VLAIO industrial R&D
GPU are listed. Here as well, the pre-processing and post- project ImmCyte. The NVIDIA Corporation generously do-
processing times refers respectively to the time spent in data nated a GPU.
preparation and in assigning data to the related workspace
objects. The execution time of the Zernike filter for the R EFERENCES
whole plate have been reduced from 50.82 Hours ∼2.11 Days [1] Malo N, Hanley JA, Cerquozzi S, Pelletier J, Nadon R: Statistical
using the original Python code to 9.08 hours used the GPU practice in high-throughput screening data analysis. Nat Biotechnol
accelerator. 2006, 24(2):167–175.
[2] Boutros M, Bras LP, Huber W: Analysis of cell-based RNAi screens.
Same experimental protocol have been applied to smaller Genome Biol 2006, 7(7):R66. 10.1186/gb-2006-7-7-r66
[3] Cook D, Swayne DF: Interactive and Dynamic Graphics for Data
images (1056 × 1256) from a plate of 384 wells with two Analysis: With Examples Using R and GGobi: Springer. 2007.
fields each (768 images). The objective here was to evaluate [4] Gee AG, Li H, Yu M, Smrtic MB, Cvek U, Goodell H, Gupta V,
the weak scaling of the proposed GPU-CP software. In Table Lawrence C, Zhou J, Chiang C, et al.: The Universal Visualization
Platform. SPIE Visualization and Data Analysis 2005, 5669: 274–283.
IV, we show the results for the MeasureTexture. The speedup [5] Selinummi, J., J. Seppala, O. Yli-Harja, and J.A. Puhakka. Software
between the original CPU single core version and the GPU for quantification of labeled bacteria from digital microscope images by
accelerated version is 5.47 for the smaller dataset and 8.21 automated image analysis. BioTechniques 39:859–863. 2005.
[6] Michael R. Lamprecht, David M. Sabatini, and Anne E. Carpenter.
for the bigger one. The second dataset is roughly 4× bigger CellProfiler: free, versatile software for automated biological image
than the first one. The execution time using the GPU version is analysis. BioTechniques 42:71-75 (January 2007).
MeasuredObjectSizeShape GPU (seconds) MeasuredObjectSizeShape CPU (seconds)

Zernike GPU Pre-processing Post-processing Zernike CPU

27887.04 4412.16 395.52 182960.64

32694.72 (9.08 Hours) 182960.64 (50.82 Hours ∼ 2.11 Days)

TABLE III
C OMPARING EXECUTION TIME OF THE MeasuredObjectSizeShape MODULE ON CPU AND GPU FOR A FULL PLATE .

Size of the Execution on Execution on Speedup

images GPU (seconds) CPU (seconds) (CPU/GPU)

(1056 × 1256) 6205.48 33943.68 5.47

(2498 × 2098) 81237.792 667054.08 8.21

TABLE IV
R ESULTS FOR DIFFERENT SIZES OF IMAGES .

[7] Abramoff, M.D., P.J. Magalhaes, and S.J. Ram. 2004. Image processing
with ImageJ. Biophotonics Int. 11:36-42.
[8] http://cellprofiler.org/impact/#impact Accessed August 2018
[9] Carpenter, Anne E et al. CellProfiler: image analysis software for
identifying and quantifying cell phenotypes. Genome Biology, 2006,
7(10), R100.
[10] https://www.singerinstruments.com/resource/
what-is-high-throughput-screening/ Accessed August 2018
[11] Haralick RM, Shanmugam K, Dinstein I. (1973), ”Textural Features for
Image Classification” IEEE Transaction on Systems Man, Cybernetics,
SMC-3(6):610-621.
[12] D. Gabor, ”Theory of communication. Part 1: The analysis of infor-
mation,” in Journal of the Institution of Electrical Engineers - Part III:
Radio and Communication Engineering, vol. 93, no. 26, pp. 429-441,
November 1946.
[13] P. Bhaskara Rao, D.vara Prasad, Ch.pavan Kumar, ”Feature extraction
using zernike moments”, International Journal of Latest Trends in
Engineering and Technology (IJLTET). Volume 2 Issue 2 - March 2013.
[14] https://documen.tician.de/pycuda/ Accessed August 2018
[15] https://developer.download.nvidia.com/GPU Programming Guide/
GPU Programming Guide.pdf Accessed August 2018
[16] https://www.docker.com Accessed August 2018
[17] Amazon Web Services Faq. How many instances can I run in Amazon
EC2. https://aws.amazon.com/ec2/faqs/ Accessed August 2018
[18] https://www.docker.com Accessed August 2018
[19] https://github.com/CellProfiler/docker Accessed August 2018
[20] https://www.nvidia.com/en-us/geforce/products/10series/
geforce-gtx-1080-ti/#specs Accessed August 2018
[21] https://www.nvidia.com/object/pascal-architecture-whitepaper.html Ac-
cessed August 2018
[22] https://aws.amazon.com/ec2/purchasing-options/dedicated-instances/
Accessed August 2018
[23] Simm, J., Klambauer, G., Arany, A., et al. (2018). Repur-
posing High-Throughput Image Assays Enables Biological Ac-
tivity Prediction for Drug Discovery. Cell Chemical Biology.
https://doi.org/10.1016/j.chembiol.2018.01.015

Machine Learning for Time Series Forecasting with Python
From Everand
Machine Learning for Time Series Forecasting with Python
Francesca Lazzeri
4/5 (2)
Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
CellT-Net A Composite Transformer Method For 2-D Cell Instance Segmentation
No ratings yet
CellT-Net A Composite Transformer Method For 2-D Cell Instance Segmentation
12 pages
Computer Engineering Department, Epoka University, Tirana, Albania
No ratings yet
Computer Engineering Department, Epoka University, Tirana, Albania
1 page
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Motion Estimation: Advancements and Applications in Computer Vision
From Everand
Motion Estimation: Advancements and Applications in Computer Vision
Fouad Sabry
No ratings yet
SImA 1.3 Customer Facing (HCS) Draft
No ratings yet
SImA 1.3 Customer Facing (HCS) Draft
17 pages
Automated Recognition and Analysis of Head Thrashes Behavior in C. Elegans
No ratings yet
Automated Recognition and Analysis of Head Thrashes Behavior in C. Elegans
21 pages
Accelerating The Computation of Haralick'S Texture Features Using Graphics Processing Units (Gpus)
No ratings yet
Accelerating The Computation of Haralick'S Texture Features Using Graphics Processing Units (Gpus)
6 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Part 2 (Main)
No ratings yet
Part 2 (Main)
87 pages
3D Hardware design:: Software applications for GPU
From Everand
3D Hardware design:: Software applications for GPU
S Mathioudakis
No ratings yet
Artificial Intelligence Multiprocessing Scheme For Pathology Images Based On Transformer For Nuclei Segmentation
No ratings yet
Artificial Intelligence Multiprocessing Scheme For Pathology Images Based On Transformer For Nuclei Segmentation
19 pages
The Impact of Pre and Post Image Processing Techniques
No ratings yet
The Impact of Pre and Post Image Processing Techniques
24 pages
Accelerated Computing With HIP: Second Edition
From Everand
Accelerated Computing With HIP: Second Edition
Yifan Sun
No ratings yet
Translocation
No ratings yet
Translocation
17 pages
NIH Public Access: Author Manuscript
No ratings yet
NIH Public Access: Author Manuscript
16 pages
Learn OpenCV with Python by Examples
From Everand
Learn OpenCV with Python by Examples
James Chen
No ratings yet
Machine Learning: Hands-On for Developers and Technical Professionals
From Everand
Machine Learning: Hands-On for Developers and Technical Professionals
Jason Bell
No ratings yet
Keras to Kubernetes: The Journey of a Machine Learning Model to Production
From Everand
Keras to Kubernetes: The Journey of a Machine Learning Model to Production
Dattaraj Rao
No ratings yet
Final Report
No ratings yet
Final Report
77 pages
U-Net: Deep Learning For Cell Counting, Detection, and Morphometry
No ratings yet
U-Net: Deep Learning For Cell Counting, Detection, and Morphometry
10 pages
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
From Everand
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
Granino A. Korn
No ratings yet
Hongkeemoon 2008
No ratings yet
Hongkeemoon 2008
6 pages
Feedback Control Theory
From Everand
Feedback Control Theory
Bruce Francis
5/5 (1)
Analysis of Cellular Images Using Machine Learning and Image Processing Models
No ratings yet
Analysis of Cellular Images Using Machine Learning and Image Processing Models
5 pages
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Minor
No ratings yet
Minor
7 pages
Biomolecules 11 00090 v2
No ratings yet
Biomolecules 11 00090 v2
4 pages
Pyramid Image Processing: Exploring the Depths of Visual Analysis
From Everand
Pyramid Image Processing: Exploring the Depths of Visual Analysis
Fouad Sabry
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
PyTorch Essentials: A Comprehensive Guide to Machine Learning Techniques
From Everand
PyTorch Essentials: A Comprehensive Guide to Machine Learning Techniques
Adam Jones
No ratings yet
Litmus Chaos Experiments in Practice: The Complete Guide for Developers and Engineers
From Everand
Litmus Chaos Experiments in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Digital Image Processing: Fundamentals and Applications
From Everand
Digital Image Processing: Fundamentals and Applications
Fouad Sabry
No ratings yet
BCD Project Report
No ratings yet
BCD Project Report
144 pages
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Introduction to Machine Learning and Neural Classification
From Everand
Introduction to Machine Learning and Neural Classification
Trilokesh Khatri
No ratings yet
Compiler Frontiers Unveiled
From Everand
Compiler Frontiers Unveiled
Azhar ul Haque Sario
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
Machine Learning for iOS Developers
From Everand
Machine Learning for iOS Developers
Abhishek Mishra
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Professional CUDA C Programming
From Everand
Professional CUDA C Programming
John Cheng
5/5 (1)
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Machine Learning with PyTorch: From Basics to Expert Proficiency
From Everand
Machine Learning with PyTorch: From Basics to Expert Proficiency
William Smith
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Fullyandfinal
No ratings yet
Fullyandfinal
49 pages
Collaborative Analysis of Multi-Gigapixel
No ratings yet
Collaborative Analysis of Multi-Gigapixel
7 pages
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
PyTorch Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Aphelion Software: Unlocking Vision: Exploring the Depths of Aphelion Software
From Everand
Aphelion Software: Unlocking Vision: Exploring the Depths of Aphelion Software
Fouad Sabry
No ratings yet
Volume Rendering: Exploring Visual Realism in Computer Vision
From Everand
Volume Rendering: Exploring Visual Realism in Computer Vision
Fouad Sabry
No ratings yet
Nmeth2024-CelloType A Unified Model For Segmentation and Classification of Tissue Images
No ratings yet
Nmeth2024-CelloType A Unified Model For Segmentation and Classification of Tissue Images
21 pages
Digital Raster Graphic: Unveiling the Power of Digital Raster Graphics in Computer Vision
From Everand
Digital Raster Graphic: Unveiling the Power of Digital Raster Graphics in Computer Vision
Fouad Sabry
No ratings yet
Beginning Software Engineering
From Everand
Beginning Software Engineering
Rod Stephens
4.5/5 (2)
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
CAMPAIGNClustering Algorithmsin Modular, Parallel, and Accelerated Implementationfor GPUNodes
No ratings yet
CAMPAIGNClustering Algorithmsin Modular, Parallel, and Accelerated Implementationfor GPUNodes
1 page
1 s2.0 S0208521622001127 Main
No ratings yet
1 s2.0 S0208521622001127 Main
11 pages
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet
From Prognostics and Health Systems Management to Predictive Maintenance 2: Knowledge, Reliability and Decision
From Everand
From Prognostics and Health Systems Management to Predictive Maintenance 2: Knowledge, Reliability and Decision
Brigitte Chebel-Morello
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Mediatek Dimensity 8300 Release FINAL
No ratings yet
Mediatek Dimensity 8300 Release FINAL
2 pages
Feflow 6.1: Its Role and Place in The FEFLOW Development Timeline
No ratings yet
Feflow 6.1: Its Role and Place in The FEFLOW Development Timeline
20 pages
1 - 3 Hardware Recommendations (CAT Grade 12)
No ratings yet
1 - 3 Hardware Recommendations (CAT Grade 12)
26 pages
Java Projects 2012 Ieee
No ratings yet
Java Projects 2012 Ieee
23 pages
PDF document-4588E141BC5B-1
No ratings yet
PDF document-4588E141BC5B-1
39 pages
By: Muqri Aqil Bin Mazman 4 Al-Farabi: Computer System
No ratings yet
By: Muqri Aqil Bin Mazman 4 Al-Farabi: Computer System
13 pages
Biased and Unbiased
No ratings yet
Biased and Unbiased
2 pages
An Analytical Model For A GPU Architecture With Memory-Level and Thread-Level Parallelism Awareness
No ratings yet
An Analytical Model For A GPU Architecture With Memory-Level and Thread-Level Parallelism Awareness
12 pages
(Ebooks PDF) Download Parallel Computers Architecture and Programming V. Rajaraman Full Chapters
100% (1)
(Ebooks PDF) Download Parallel Computers Architecture and Programming V. Rajaraman Full Chapters
47 pages
Modern Trends in Computer Architecture
No ratings yet
Modern Trends in Computer Architecture
4 pages
D3D Workstations Dec22jan23
No ratings yet
D3D Workstations Dec22jan23
40 pages
Ca LP
No ratings yet
Ca LP
6 pages
CS122: Computer Architecture & Organization: Semester I, 2011
No ratings yet
CS122: Computer Architecture & Organization: Semester I, 2011
27 pages
Ug1221 Zcu102 Base TRD PDF
No ratings yet
Ug1221 Zcu102 Base TRD PDF
88 pages
Unit5 RMD PDF
No ratings yet
Unit5 RMD PDF
27 pages
A Comparative Study On Recent Mobile Phone Processors
No ratings yet
A Comparative Study On Recent Mobile Phone Processors
6 pages
Csc121 - Topic 1 Introduction To Computer Systems
No ratings yet
Csc121 - Topic 1 Introduction To Computer Systems
83 pages
Coverity Multi-Threaded Whitepaper
No ratings yet
Coverity Multi-Threaded Whitepaper
13 pages
HP Compaq 8200 Elite Specs - CNET
No ratings yet
HP Compaq 8200 Elite Specs - CNET
6 pages
HP Probook Documentation
0% (1)
HP Probook Documentation
50 pages
Introduction To Information and Communications Technology (Ict)
No ratings yet
Introduction To Information and Communications Technology (Ict)
22 pages
PSU Motherboard CPU RAM HDD Graphics Card System Casing
No ratings yet
PSU Motherboard CPU RAM HDD Graphics Card System Casing
91 pages
History: History of General-Purpose Cpus
No ratings yet
History: History of General-Purpose Cpus
17 pages
W1 Intro.4u
No ratings yet
W1 Intro.4u
7 pages
Python vs. C++ Comparison For High-Performance Computing
No ratings yet
Python vs. C++ Comparison For High-Performance Computing
4 pages
Level 18 (Chapter 18 - Multicore Computers)
No ratings yet
Level 18 (Chapter 18 - Multicore Computers)
10 pages
Oracle® VM Virtualbox Installation Instructions For Windows 10 and Linux Virtual Machine Creation Targeting Avnet Development Boards
No ratings yet
Oracle® VM Virtualbox Installation Instructions For Windows 10 and Linux Virtual Machine Creation Targeting Avnet Development Boards
61 pages
CSIS 2200-Midterm Note
No ratings yet
CSIS 2200-Midterm Note
32 pages
Chapter 4
100% (1)
Chapter 4
52 pages
BTP Report Final
No ratings yet
BTP Report Final
40 pages

GPU-accelerated CellProfiler

Uploaded by

GPU-accelerated CellProfiler

Uploaded by

GPU-accelerated CellProfiler

Imen Chakroun Nick Michiels Roel Wuyts

V. GPU- ENABLED C ELL P ROFILER A. GPU-based MeasureTexture

1) Gabor Filter: In the first kernel (Image gabor gpu()),

81237.792 (22.56 hours) 667054.08 (185.29 hours ∼ 7.72 days)

Zernike GPU Pre-processing Post-processing Zernike CPU

27887.04 4412.16 395.52 182960.64

32694.72 (9.08 Hours) 182960.64 (50.82 Hours ∼ 2.11 Days)

Size of the Execution on Execution on Speedup

images GPU (seconds) CPU (seconds) (CPU/GPU)

(1056 × 1256) 6205.48 33943.68 5.47

(2498 × 2098) 81237.792 667054.08 8.21

You might also like