0% found this document useful (0 votes)
26 views73 pages

Image Processing Using FPGAs Donald G Bailey Editor PDF Download

The document discusses a special issue on image processing using FPGAs, edited by Donald G. Bailey, which compiles nine articles addressing various topics related to FPGA technology for image processing applications. It highlights the challenges and advancements in using FPGAs for real-time embedded image processing, emphasizing the need for novel algorithms and hardware architectures to exploit the parallelism of FPGAs. The document serves as a resource for researchers to present their findings and innovations in this rapidly evolving field.

Uploaded by

nubyellson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views73 pages

Image Processing Using FPGAs Donald G Bailey Editor PDF Download

The document discusses a special issue on image processing using FPGAs, edited by Donald G. Bailey, which compiles nine articles addressing various topics related to FPGA technology for image processing applications. It highlights the challenges and advancements in using FPGAs for real-time embedded image processing, emphasizing the need for novel algorithms and hardware architectures to exploit the parallelism of FPGAs. The document serves as a resource for researchers to present their findings and innovations in this rapidly evolving field.

Uploaded by

nubyellson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Image Processing Using FPGAs Donald G Bailey

Editor install download

https://ebookmeta.com/product/image-processing-using-fpgas-
donald-g-bailey-editor/

Download more ebook from https://ebookmeta.com


We believe these products will be a great fit for you. Click
the link to download now, or visit ebookmeta.com
to discover even more!

Design for Embedded Image Processing on FPGAs, 2nd


Edition Donald G. Bailey

https://ebookmeta.com/product/design-for-embedded-image-
processing-on-fpgas-2nd-edition-donald-g-bailey/

Digital Image Processing Using MATLAB 4th Edition


Rafael C. Gonzalez

https://ebookmeta.com/product/digital-image-processing-using-
matlab-4th-edition-rafael-c-gonzalez/

Embedded Microprocessor System Design using FPGAs Uwe


Meyer-Baese

https://ebookmeta.com/product/embedded-microprocessor-system-
design-using-fpgas-uwe-meyer-baese/

The Fairness Doctrine and the Media Steven J. Ssmmons

https://ebookmeta.com/product/the-fairness-doctrine-and-the-
media-steven-j-ssmmons/
NSW Civil Procedure Handbook 2017 Carol Webster

https://ebookmeta.com/product/nsw-civil-procedure-
handbook-2017-carol-webster/

Sin and Soil 12 1st Edition Anya Merchant

https://ebookmeta.com/product/sin-and-soil-12-1st-edition-anya-
merchant/

International Negotiation and Mediation in Violent


Conflict The Changing Context of Peacemaking 1st
Edition Chester A. Crocker

https://ebookmeta.com/product/international-negotiation-and-
mediation-in-violent-conflict-the-changing-context-of-
peacemaking-1st-edition-chester-a-crocker/

Atmosphere, Architecture, Cinema: Thematic Reflections


on Ambiance and Place 1st Edition Michael Tawa

https://ebookmeta.com/product/atmosphere-architecture-cinema-
thematic-reflections-on-ambiance-and-place-1st-edition-michael-
tawa/

Unproduction Studies and the American Film Industry 1st


Edition James Fenwick

https://ebookmeta.com/product/unproduction-studies-and-the-
american-film-industry-1st-edition-james-fenwick/
Starting a Business All In One For Dummies The Experts
At Dummies

https://ebookmeta.com/product/starting-a-business-all-in-one-for-
dummies-the-experts-at-dummies/
Journal of
Imaging

Image Processing
Using FPGAs
Edited by
Donald G. Bailey
Printed Edition of the Special Issue Published in Journal of Imaging

www.mdpi.com/journal/jimaging
Image Processing Using FPGAs
Image Processing Using FPGAs

Special Issue Editor


Donald G. Bailey

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade


Special Issue Editor
Donald G. Bailey
Massey University
New Zealand

Editorial Office
MDPI
St. Alban-Anlage 66
4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal
Journal of Imaging (ISSN 2313-433X) from 2018 to 2019 (available at: https://www.mdpi.com/
journal/jimaging/special issues/Image FPGAs).

For citation purposes, cite each article independently as indicated on the article page online and as
indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. Journal Name Year, Article Number,
Page Range.

ISBN 978-3-03897-918-0 (Pbk)


ISBN 978-3-03897-919-7 (PDF)


c 2019 by the authors. Articles in this book are Open Access and distributed under the Creative
Commons Attribution (CC BY) license, which allows users to download, copy and build upon
published articles, as long as the author and publisher are properly credited, which ensures maximum
dissemination and a wider impact of our publications.
The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons
license CC BY-NC-ND.
Contents

About the Special Issue Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Preface to ”Image Processing Using FPGAs” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Donald Bailey
Image Processing Using FPGAs
Reprinted from: Journal of Imaging 2019, 5, 53, doi:10.3390/jimaging5050053 . . . . . . . . . . . . 1

Fahad Siddiqui, Sam Amiri, Umar Ibrahim Minhas, Tiantai Deng, Roger Woods,
Karen Rafferty and Daniel Crookes
FPGA-Based Processor Acceleration for Image Processing Applications
Reprinted from: Journal of Imaging 2019, 5, 16, doi:10.3390/jimaging5010016 . . . . . . . . . . . . 5

Paulo Garcia, Deepayan Bhowmik, Robert Stewart, Greg Michaelson and Andrew Wallace
Optimized Memory Allocation and Power Minimization for FPGA-Based Image Processing
Reprinted from: Journal of Imaging 2019, 5, 7, doi:10.3390/jimaging5010007 . . . . . . . . . . . . . 27

Runbin Shi, Justin S.J. Wong and Hayden K.-H. So


High-Throughput Line Buffer Microarchitecture for Arbitrary Sized Streaming Image
Processing
Reprinted from: Journal of Imaging 2019, 5, 34, doi:10.3390/jimaging5030034 . . . . . . . . . . . . 50

Donald Bailey and Anoop Ambikumar


Border Handling for 2D Transpose Filter Structures on an FPGA
Reprinted from: Journal of Imaging 2018, 4, 138, doi:10.3390/jimaging4120138 . . . . . . . . . . . . 70

Andrew Tzer-Yeu Chen, Rohaan Gupta, Anton Borzenko, Kevin I-Kai Wang and
Morteza Biglari-Abhari
Accelerating SuperBE with Hardware/Software Co-Design
Reprinted from: Journal of Imaging 2018, 4, 122, doi:10.3390/jimaging4100122 . . . . . . . . . . . . 91

Aiman Badawi and Muhammad Bilal


High-Level Synthesis of Online K-Means Clustering Hardware for a Real-Time Image
Processing Pipeline
Reprinted from: Journal of Imaging 2019, 5, 38, doi:10.3390/jimaging5030038 . . . . . . . . . . . . 108

Haonan Zhou, Raju Machupalli and Mrinal Mandal


Efficient FPGA Implementation of Automatic Nuclei Detection in Histopathology Images
Reprinted from: Journal of Imaging 2019, 5, 21, doi:10.3390/jimaging5010021 . . . . . . . . . . . . 125

Donald Bailey, and Michael Klaiber


Zig-Zag Based Single-Pass Connected Components Analysis
Reprinted from: Journal of Imaging 2019, 5, 45, doi:10.3390/jimaging5040045 . . . . . . . . . . . . 138

Zhe Wang, Trung-Hieu Tran, Ponnanna Kelettira Muthappa and Sven Simon
A JND-Based Pixel-Domain Algorithm and Hardware Architecture for Perceptual Image
Coding
Reprinted from: Journal of Imaging 2019, 5, 50, doi:10.3390/jimaging5050050 . . . . . . . . . . . . 164

v
About the Special Issue Editor
Donald G. Bailey received his Bachelor of Engineering (Honours) degree in Electrical Engineering
in 1982, and Ph.D. degree in Electrical and Electronic Engineering from the University of Canterbury,
New Zealand, in 1985. From 1985 to 1987, he applied image analysis to the wool and paper industries
of New Zealand. From 1987 to 1989, he was a Visiting Research Engineer at University of California,
Santa Barbara. Prof. Bailey joined Massey University in Palmerston North, New Zealand, as Director
of the Image Analysis Unit in November 1989. He was a Visiting Researcher at the University of
Wales, Cardiff, in 1996; University of California, Santa Barbara, in 2001–2002; and Imperial College
London in 2008. He is currently Professor of Imaging Systems in the Department of Mechanical
and Electrical Engineering in the School of Food and Advanced Technology at Massey University,
where he is Leader of the Centre for Research in Image and Signal Processing. Prof. Bailey has spent
over 35 years applying image processing to a range of industrial, machine vision, and robot vision
applications. For the last 18 years, one area of particular focus has been exploring different aspects
of using FPGAs for implementing and accelerating image processing algorithms. He is the author
of many publications in this field, including the book “Design for Embedded Image Processing on
FPGAs”, published by Wiley/IEEE Press. He is a Senior Member of the IEEE, and is active in the
New Zealand Central Section.

vii
Preface to ”Image Processing Using FPGAs”
Over the last 20 years, FPGAs have moved from glue logic through to computing platforms.
They effectively provide a reconfigurable hardware platform for implementing logic and algorithms.
Being fine-grained hardware, FPGAs are able to exploit the parallelism inherent within a hardware
design while at the same time maintaining the reconfigurability and programmability of software.
This has led to FPGAs being used as a platform for accelerating computationally intensive tasks. This
is particularly seen in the field of image processing, where the FPGA-based acceleration of imaging
algorithms has become mainstream. This is even more so within an embedded environment, where
the power and computational resources of conventional processors are not up to the task of managing
the data throughput and computational requirements of real-time imaging applications.
Unfortunately, the fine-grained nature of FPGAs also makes them difficult to programme
effectively. Conventional processors have a fixed computational architecture, which is able to provide
a high level of abstraction. By contrast, on an FPGA, it is necessary to design not only the algorithm
but also the computational architecture, which leads to an explosion in the design space complexity.
This, coupled with the complexities of managing the concurrency of a highly parallel design and the
bandwidth issues associated with the high volume of data associated with images and video, has
led to a wide range of approaches and architectures used for realising FPGA-based image processing
systems. This Special Issue provides an opportunity for researchers in this area to present some of
their latest results and designs. The diversity of presented techniques and applications reflects the
nature and current state of FPGA-based design for image processing.

Donald G. Bailey
Special Issue Editor

ix
Journal of
Imaging
Editorial
Image Processing Using FPGAs
Donald G. Bailey
Department of Mechanical and Electrical Engineering, School of Food and Advanced Technology,
Massey University, Palmerston North 4442, New Zealand; [email protected]

Received: 6 May 2019; Accepted: 7 May 2019; Published: 10 May 2019

Abstract: Nine articles have been published in this Special Issue on image processing using
field programmable gate arrays (FPGAs). The papers address a diverse range of topics relating
to the application of FPGA technology to accelerate image processing tasks. The range includes:
Custom processor design to reduce the programming burden; memory management for full frames,
line buffers, and image border management; image segmentation through background modelling,
online K-means clustering, and generalised Laplacian of Gaussian filtering; connected components
analysis; and visually lossless image compression.

Keywords: field programmable gate arrays (FPGA); image processing; hardware/software co-design;
memory management; segmentation; image analysis; compression

1. Introduction to This Special Issue


Field programmable gate arrays (FPGAs) are increasingly being used for the implementation
of image processing applications. This is especially the case for real-time embedded applications,
where latency and power are important considerations. An FPGA embedded in a smart camera is able
to perform much of the image processing directly as the image is streamed from the sensor, with the
camera providing a processed output data stream, rather than a sequence of images. The parallelism of
hardware is able to exploit the spatial (data level) and temporal (task level) parallelism implicit within
many image processing tasks. Unfortunately, simply porting a software algorithm onto an FPGA often
gives disappointing results, because many image processing algorithms have been optimised for a
serial processor. It is usually necessary to transform the algorithm to efficiently exploit the parallelism
and resources available on an FPGA. This can lead to novel algorithms and hardware computational
architectures, both at the image processing operation level and also at the application level.
The aim of this Special Issue is to present and highlight novel algorithms, architectures, techniques,
and applications of FPGAs for image processing. A total of 20 submissions were received for the
Special Issue, with nine papers being selected for final publication.

2. Contributions
Programming an FPGA to accelerate complex algorithms is difficult, with one of four approaches
commonly used [1]:

• Custom hardware design of the algorithm using a hardware description language, optimised for
performance and resources;
• implementing the algorithm by instantiating a set of application-specific intellectual property
cores (from a library);
• using high-level synthesis to convert a C-based representation of the algorithm to
synthesisable hardware; or
• mapping the algorithm onto a parallel set of programmable soft-core processors.

J. Imaging 2019, 5, 53; doi:10.3390/jimaging5050053 1 www.mdpi.com/journal/jimaging


J. Imaging 2019, 5, 53

The article by Siddiqui et al. [1] took this last approach, and describes the design of an efficient
16-bit integer soft-core processor, IPPro, capable of operating at 337 MHz, specifically targetting the
dataflow seen in complex image processing algorithms. The presented architecture uses dedicated
stream access instructions on the input and output, with a 32-element local memory for storing pixels
and intermediate results, and a separate 32-element kernel memory for storing filter coefficients
and other parameters and constants. The exploitation of both data-level parallelism and task-level
parallelism is demonstrated through the mapping of a K-means clustering algorithm onto the
architecture, showing good scalability of processing speed with multiple cores. A second case study of
traffic sign recognition is partitioned between the IPPro cores and an ARM processor, with the colour
conversion and morphological filtering stages mapped to the IPPro. Again, the use of parallel IPPro
cores can significantly accelerate these tasks, compared to conventional software, without having to
resort to the tedious effort of custom hardware design.
Garcia et al. [2] worked on the thesis that the image processing operations which require random
access to the whole frame (including iterative algorithms) are particularly difficult to realise in FPGAs.
They investigate the mapping of a frame buffer onto the memory resources of an FPGA, and explore
the optimal mapping onto combinations of configurable on-chip memory blocks. They demonstrate
that, for many image sizes, the default mapping by the synthesis tools results in poor utilisation, and is
also inefficient in terms of power requirements. A procedure is described that determines the best
memory configuration, based on balancing resource utilisation and power requirements. The mapping
scheme is demonstrated with optical flow and mean shift tracking algorithms.
On the other hand, local operations (such as filters) only need part of the image to produce an
output, and operate efficiently in stream processing mode, using line buffers to cache data for scanning
a local window through the image. This works well when the image size is fixed, and is known in
advance. Two situations where this approach is less effective [3] are in the region of interest processing,
where only a small region of the image is processed (usually determined from the image contents at
run-time), and cloud processing of user-uploaded images (which may be of arbitrary size). This is
complicated further in high-speed systems, where the real-time requirements demand processing
multiple pixels in every clock cycle, because, if the line width is not a multiple of the number of pixels
processed each cycle, then it is necessary to assemble the output window pixels from more than one
memory block. Shi et al. [3], in their paper, extend their earlier work on assembling the output window
to allow arbitrary image widths. The resulting line buffer must be configurable at run-time, which is
achieved through a series of “instructions”, which control the assembly of the output processing
window when the required data spans two memory blocks. Re-configuration only takes a few clock
cycles (to load the instructions), rather than conventional approach of reconfiguring the FPGA each
time the image width changes. The results demonstrate better resource utilisation, higher throughput,
and lower power than their earlier approach.
When applying window operations to an image, the size of the output image is smaller than
the input because data is not valid when the window extends beyond the image border. If necessary,
this may be mitigated by extending the input image to provide data to allow such border pixels to be
calculated. Prior work only considered border management using direct form filter structures, because
the window formation and filter function can be kept independent. However, in some applications,
transpose-form filter structures are desirable because the corresponding filter function is automatically
pipelined, leading to fewer resources and faster clock frequencies. Bailey and Ambikumar [4] provide
a design methodology for border management using transpose filter structures, and show that the
resource requirements are similar to those for direct-form border management.
An important task in computer vision is segmenting objects from a complex background. While
there are many background modelling algorithms, the complexity of robust algorithms make them
difficult to realise on an FPGA, especially for larger image sizes. Chen et al. [5] address scalability issues
with increasing image size by using super-pixels—small blocks of adjacent pixels that are treated as a
single unit. As each super-pixel is considered to be either object or background, this means that fewer

2
J. Imaging 2019, 5, 53

models need to be maintained (less memory) and fewer elements need to be classified (reduced
computation time). Using hardware/software co-design, they accelerated the computationally
expensive steps of Gaussian filtering and calculating the mean and variance within each super-pixel
with hardware, with the rest of the algorithm being realised on the on-chip CPU. The resulting system
gave close to state-of-the-art classification accuracy.
A related paper, by Badawi and Bilal [6], used K-means clustering to segment objects within video
sequences. Rather than taking the conventional iterative approach to K-means clustering, they rely
on the temporal coherence of video streams and use the cluster centres from the previous frame as
initialisation for the current frame. Additionally, rather than waiting until the complete frame has
been accumulated before updating the cluster centres, an online algorithm is used, with the clusters
updated for each pixel. To reduce the computational requirements, the centres are updated using a
weighted average. They demonstrate that, for typical video streams, this gives similar performance to
conventional K-means algorithms, but with far less computation and power.
In another segmentation paper, Zhou et al. [7] describe the use of a generalised Laplacian of
Gaussian (LoG) filter for detecting cell nuclei for a histopathology application. The LoG filters detect
elliptical blobs at a range of scales and orientations. Local maxima of the responses are used as
candidate seeds for cell centres, and mean-shift clustering is used to combine multiple detections
from different scales and orientations. Their FPGA design gave modest acceleration over a software
implementation on a high-end computer.
Given a segmented image, a common task is to measure feature vectors of each connected
component for analysis. Bailey and Klaiber [8] present a new single-pass connected components
analysis algorithm, which does this with minimum latency and relatively few resources. The key novelty
of this paper is the use of a zig-zag based scan, rather than a conventional raster scan. This eliminates the
end-of-row processing for label resolution by integrating it directly within the reverse scan. The result is
true single-pixel-per-clock-cycle processing, with no overheads at the end of each row or frame.
An important real-time application of image processing is embedded online image compression
for reducing the data bandwidth for image transmission. In the final paper within this Special Issue,
Wang et al. [9] defined a new image compression codec which works efficiently with a streamed image,
and minimises the perceptual distortion within the reconstructed images. Through small local filters,
each pixel is classified as either an edge, a smooth region, or a textured region. These relate to a
perceptual model of contrast masking, allowing just noticeable distortion (JND) thresholds to be
defined. The image is compressed by downsampling; however, if the error in any of the contributing
pixels exceeds the visibility thresholds, the 2 × 2 block is considered a region of interest, with the
4 pixels coded separately. In both cases, the pixel values are predicted using a 2-dimensional predictor,
and the prediction residuals are quantised and entropy-encoded. Results typically give a visually
lossless 4:1 compression, which is significantly better than other visually lossless codecs.

3. Conclusions
Overall, this collection of papers reflects the diversity of approaches taken to applying FPGAs to
image processing applications. From one end, using the programmable logic to design lightweight
custom processors to enable parallelism, through overcoming some of the limitations of current
high-level synthesis tools, to the other end with the design of custom hardware designs at the
register-transfer level.
The range of image processing techniques include filtering, segmentation, clustering, and
compression. Applications include traffic sign recognition for autonomous driving, histopathology,
and video compression.

3
J. Imaging 2019, 5, 53

Funding: This research received no external funding.


Acknowledgments: The Guest Editor would like to acknowledge the time and contributions of the authors
(both successful and unsuccessful) who prepared papers for this Special Issue. Special thanks go to all the
reviewers who provided constructive reviews of the papers in a timely manner; your analysis and feedback has
ensured the quality of the papers selected. It is also necessary to acknowledge the assistance given by the MDPI
editorial team, in particular Managing Editors Alicia Wang and Veronica Wang, who made my task as Guest
Editor much easier.
Conflicts of Interest: The author declares no conflict of interest.

References
1. Siddiqui, F.; Amiri, S.; Minhas, U.I.; Deng, T.; Woods, R.; Rafferty, K.; Crookes, D. FPGA-based processor
acceleration for image processing applications. J. Imaging 2019, 5, 16. [CrossRef]
2. Garcia, P.; Bhowmik, D.; Stewart, R.; Michaelson, G.; Wallace, A. Optimized memory allocation and power
minimization for FPGA-based image processing. J. Imaging 2019, 5, 7. [CrossRef]
3. Shi, R.; Wong, J.S.; So, H.K.H. High-throughput line buffer microarchitecture for arbitrary sized streaming
image processing. J. Imaging 2019, 5, 34. [CrossRef]
4. Bailey, D.G.; Ambikumar, A.S. Border handling for 2D transpose filter structures on an FPGA. J. Imaging
2018, 4, 138. [CrossRef]
5. Chen, A.T.Y.; Gupta, R.; Borzenko, A.; Wang, K.I.K.; Biglari-Abhari, M. Accelerating SuperBE with
hardware/software co-design. J. Imaging 2018, 4, 122. [CrossRef]
6. Badawi, A.; Bilal, M. High-level synthesis of online K-Means clustering hardware for a real-time image
processing pipeline. J. Imaging 2019, 5, 38. [CrossRef]
7. Zhou, H.; Machupalli, R.; Mandal, M. Efficient FPGA implementation of automatic nuclei detection in
histopathology images. J. Imaging 2019, 5, 21. [CrossRef]
8. Bailey, D.G.; Klaiber, M.J. Zig-zag based single pass connected components analysis. J. Imaging 2019, 5, 45.
[CrossRef]
9. Wang, Z.; Tran, T.H.; Muthappa, P.K.; Simon, S. A JND-based pixel-domain algorithm and hardware
architecture for perceptual image coding. J. Imaging 2019, 5, 50. [CrossRef]

c 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

4
Journal of
Imaging
Article
FPGA-Based Processor Acceleration for Image
Processing Applications
Fahad Siddiqui 1,† , Sam Amiri 2,† , Umar Ibrahim Minhas 1 , Tiantai Deng 1 , Roger Woods 1, * ,
Karen Rafferty 1 and Daniel Crookes 1
1 School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast,
Belfast BT7 1NN, UK; [email protected] (F.S.); [email protected] (U.I.M.); [email protected] (T.D.);
[email protected] (K.R.); [email protected] (D.C.)
2 School of Computing, Electronics and Maths, Coventry University, Coventry CV1 5FB, UK;
[email protected]
* Correspondence: [email protected]; Tel.: +44-289-097-4081
† These authors contributed equally to this work.

Received: 27 November 2018; Accepted: 7 January 2019; Published: 13 January 2019

Abstract: FPGA-based embedded image processing systems offer considerable computing resources
but present programming challenges when compared to software systems. The paper describes an
approach based on an FPGA-based soft processor called Image Processing Processor (IPPro) which can
operate up to 337 MHz on a high-end Xilinx FPGA family and gives details of the dataflow-based
programming environment. The approach is demonstrated for a k-means clustering operation and
a traffic sign recognition application, both of which have been prototyped on an Avnet Zedboard
that has Xilinx Zynq-7000 system-on-chip (SoC). A number of parallel dataflow mapping options
were explored giving a speed-up of 8 times for the k-means clustering using 16 IPPro cores, and a
speed-up of 9.6 times for the morphology filter operation of the traffic sign recognition using
16 IPPro cores compared to their equivalent ARM-based software implementations. We show that for
k-means clustering, the 16 IPPro cores implementation is 57, 28 and 1.7 times more power efficient
(fps/W) than ARM Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded
GPU respectively.

Keywords: FPGA; hardware acceleration; processor architectures; image processing;


heterogeneous computing

1. Introduction
With improved sensor technology, there has been a considerable growth in the amount of data
being generated by security cameras. In many remote environments with limited communication
bandwidth, there is a clear need to overcome this by employing remote functionality in the system such
as employing motion estimation in smart cameras [1]. As security requirements grow, the processing
needs will only need to increase.
New forms of computing architectures are needed. In late 70’s, Lamport [2] laid the foundation
of parallel architectures exploiting data-level parallelism (DLP) using work load vectorisation and
shared memory parallelisation, used extensively in Graphical Processing Units (GPUs). Current energy
requirements and limitations of Dennard scaling have acted to limit clock scaling and thus reduce
future processing capabilities of GPUs or multi-core architectures [3]. Recent field programmable gate
array (FPGA) architectures represent an attractive alternative for acceleration as they comprise ARM
processors and programmable logic for accelerating computing intensive operations.
FPGAs are proven computing platforms that offer reconfigurability, concurrency and pipelining,
but have not been accepted as a mainstream computing platform. The primary inhibitor is the need to

J. Imaging 2019, 5, 16; doi:10.3390/jimaging5010016 5 www.mdpi.com/journal/jimaging


J. Imaging 2019, 5, 16

use specialist programming tools, describing algorithms in hardware description language (HDL), altough
this has been alleviated by the introduction of high-level programming tools such as Xilinx’s Vivado
High-level Synthesis (HLS) and Intel’s (Altera’s) compiler for OpenCL. While the level of abstraction
has been raised, a gap still exists between adaptability, performance and efficient utilisation of FPGA
resources. Nevertheless, the FPGA design flow still requires design synthesis and place-and-route that
can be time-consuming depending on the complexity and size of the design [4,5]. This FPGA design
flow is alien to software/algorithm developers and inhibits wider use of the technology.
One way to approach this research problem is to develop adaptable FPGA hardware architecture
that enables edit-compile-run flow familiar to software and algorithm developers instead of hardware
synthesis and place-and-route. This can be achieved by populating FPGA logic with a number of efficient
soft core processors used for programmable hardware acceleration. This underlying architecture will
be adaptable and can be programmed using conventional software development approaches. However,
the challenge is to build an FPGA solution that is more easily programmed whilst still providing high
performance. Whilst FPGA-based processor architectures exist such as Xilinx’s MicroBlaze, Altera’s
NIOS and others [6–9], we propose an Image Processing Processor (IPPro) processor [10] tailored to
accelerate image processing operations, thereby providing an excellent mapping between FPGA
resources, speed and programming efficiency. The main purpose of the paper is to give insights into
the multi-core processor architecture built using the IPPro architecture, its programming environment
and outline its applications to two image processing applications. Our main contributions are:

• Creation of an efficient, FPGA-based multicore processor which advances previous work [10],
[11] and an associated dataflow-based compiler environment for programming a heterogeneous
FPGA resource comprising it and ARM processors.
• Exploration of mapping the functionality for a k-means clustering function, resulting in a possible
speedup of up to 8 times that is 57, 28 and 1.7 times more power efficient (fps/W) than ARM
Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded GPU.
• Acceleration of colour and morphology operations of traffic sign recognition application, resulting
in a speedup of 4.5 and 9.6 times respectively on a Zedboard.

The rest of paper is organized as follows: Section 2 outlines the various image processing
requirements and outlines how these can be matched to FPGA; relevant research is also reviewed.
System requirements are outlined in Section 3 and the soft core processor architecture is also briefly
reviewed in Section 4. The system architecture is outlined in Section 5. Experiments to accelerate a
k-means clustering algorithm and a traffic sign recognition example, are presented in Sections 6 and 7
respectively. Conclusions and future work are described in Section 8.

2. Background
Traditionally, vision systems have been created in a centralized manner where video from
multiple cameras is sent to a central back-end computing unit to extract significant features. However,
with increasing number of nodes and wireless communications, this approach becomes increasingly
limited, particularly with higher resolution cameras [12]. A distributed processing approach can be
employed where data-intensive, front-end preprocessing such as sharpening, object detection etc. can
be deployed remotely, thus avoiding the need to transmit high data, video streams back to the server.

2.1. Accelerating Image Processing Algorithms


Nugteren et al. has characterized image processing operations based on the computation and
communication patterns [13] as highlighted in Table 1. The vision processing architecture can be
composed of general and special purpose processors, FPGAs or combinations thereof. FPGAs offer
opportunities to exploit the fine/coarse grained parallelism that most of the image processing
applications exhibit at front-end processing. Heterogeneous architectures comprising CPUs and FPGA
fabrics thus offer a good balance in terms of performance, cost and energy efficiency.

6
J. Imaging 2019, 5, 16

Brodtkorb et al. has compared architectural and programming language properties of


heterogeneous architectures comprising CPU, GPU and FPGA [14] showing that FPGAs deliver
a better performance/W ratio for fixed-point operations; however, they are difficult to program.
Different design approaches have been adopted by the research community to build FPGA-based
hardware accelerators. These include:

• Customised hardware accelerator designs in HDLs which require long development times but
can be optimised in terms of performance and area.
• Application specific hardware accelerators which are generally optimized for a single function,
non-programmable and created using IP cores.
• Designs created using high-level synthesis tools such as Xilinx’s Vivado HLS tool and
Altera’s OpenCL compiler which convert a C-based specification into an RTL implementation
synthesizable code [15] allowing pipelining and parallelization to be explored.
• Programmable hardware accelerator in the form of vendor specific soft processors such as
Xilinx’s Microblaze and Altera’s NIOS II processors and customized hard/soft processors.

Table 1. Categorisation of image processing operations based on their memory and execution
patterns [13] allow features of compute and memory patterns to be highlighted and therefore identifying
what can be mapped into FPGA.

Operation Domain Output Memory Execution Examples


Type Depends on Pattern Pattern
Point and Spatial Single input Pipelined One-to-one Intensity change by factor,
Line pixel Negative image-inversion.
Area/Local Spatial Neighbouring Coalesced Tree Convolution functions: Sobel,
pixels Sharpen, Emboss.
Geometric Spatial Whole frame Recursive Large reduction Rotate, Scale, Translate, Reflect,
non-coalesced tree Perspective and Affine.

2.2. Soft Processor Architectures


Numerous FPGA multiprocessor architectures have been created to accelerate applications.
Strik et al. used a heterogeneous multiprocessor system with a reconfigurable network-on-chip to
process multiple video streams concurrently in real-time [16]. VectorBlox MXP [7] is the latest of a
series of vector-based soft core processor architectures designed to exploit DLP by processing vectors.
Optimizations employed include replacing a vector register file with a scratchpad memory to allow for
arbitrary data packing and access, removing vector length limits, enabling sub-word single-instruction,
multiple-data (SIMD) within each lane and a DMA-based memory interface.
Zhang et al. has created composable vector units [17] and allows a vector program of a dataflow
graph (DFG) to be statically compiled and clusters of operations to be composed together to create a
new streaming instruction that uses multiple operators and operands. This is similar to traditional
vector chaining but is not easily extended to support wide SIMD-style parallelism. The reported
speed-ups were less than a factor of two. Further optimizations have been employed in a custom
SVP Bluespec [18] where they compared a custom pipeline to the SVP implementation and found that
performance was within a factor of two given similar resource usage. Kapre et al. has proposed a
GraphSoC custom soft processor for accelerating graph algorithms [19]. It is a three-stage pipelined
processor that supports graph semantics (node, edge operations). The processor was designed with
Vivado HLS. Each core uses nine BRAMs and runs at 200 MHz.
Octavo [20] is a multi-threaded, ten-cycle processor that runs at 550 MHz on a Stratix IV, equivalent
to the maximum frequency supported by memory blocks. A deep pipeline is necessary to support this
high operating frequency, but suffers from the need to pad dependent instructions to overcome data
hazards. The authors sidestep this issue by designing Octavo as a multi-processor, thus dependent
instructions are always sufficiently far apart and NOP padding is not needed. Andryc et al. presented

7
J. Imaging 2019, 5, 16

a GPGPU architecture called FlexGrip [8] which like vector processors, supports wide data parallel,
SIMD-style computation using multiple parallel compute lanes, provides support for conditional
operations, and requires optimized interfaces to on- and off-chip memory. FlexGrip maps pre-compiled
CUDA kernels on soft core processors which are programmable and operate at 100 MHz.

3. System Implementation
Whilst earlier versions of FPGAs just comprised multiple Lookup Tables (LUT) connected to
registers and accelerated by fast adders, FPGAs now comprise more coarse-grained functions such as
dedicated, full-custom, low-power DSP slices. For example, the Xilinx DSP48E1 block comprises a
25-bit pre-adder, a 25 × 18-bit multiplier and a 48-bit adder/subtracter/logic unit, multiple distributed
RAM blocks which offer high bandwidth capability (Figure 1), and a plethora of registers which
supports high levels of pipelining.

)' %(  +$ 


      
    
)**! &%)!
 

Figure 1. Bandwidth/memory distribution in Xilinx Virtex-7 FPGA which highlight how bandwidth
and computation improves as we near the datapath parts of the FPGA.

Whilst FPGAs have been successfully applied in embedded systems and communications,
they have struggled as a mainstream computational platform. Addressing the following considerations
would make FPGAs a major platform rival for “data-intensive” applications:

• Programmability: there is a need for a design methodology which includes a flexible data
communication interface to exchange data. Intellectual Property (IP) cores and HLS tools [15]/
OpenCL design routes increase programming abstraction but do not provide the flexible system
infrastructure for image processing systems.
• Dataflow support: the dataflow model of computation is a recognized model for data-intensive
applications. Algorithms are represented as a directed graph composed of nodes (actors) as
computational units and edges as communication channels [21]. While the actors run explicitly in
parallel decided by the user, actor functionality can either be sequential or concurrent. Current
FPGA realizations use the concurrency of the whole design at a higher level but eliminate
reprogrammability. A better approach is to keep reprogrammability while still maximizing
parallelism by running actors on simple “pipelined” processors; the actors still run their code
explicitly in parallel (user-specified).
• Heterogeneity: the processing features of FPGAs should be integrated with CPUs. Since dataflow
supports both sequential and concurrent platforms, the challenge is then to allow effective
mapping onto CPUs with parallelizable code onto FPGA.
• Toolset availability: design tools created to specifically compile user-defined dataflow programs at
higher levels to fully reprogrammable heterogeneous platform should be available.

High-Level Programming Environment


The proposed methodology employs a reprogrammable model comprising multi-core processors
supporting SIMD operation and an associated inter-processor communication methodology.
A dataflow design methodology has chosen as the high-level programming approach as it offers
concurrency, scalability, modularity and provides data driven properties, all of which match the design
requirements associated with image processing systems. A dataflow model allows algorithms to be
realized as actors with specific firing rules that are mapped into directed graphs where the nodes
represent computations and arcs represent the movement of data. The term data-driven is used to
express the execution control of dataflow with the availability of the data itself. In this context, an actor

8
J. Imaging 2019, 5, 16

is a standalone entity, which defines an execution procedure and can be implemented in the IPPro
processor. Actors communicate with other actors by passing data tokens, and the execution is done
through the token passing through First-In-First-Out (FIFO) units. The combination of a set of actors
with a set of connections between actors constructs a network, which maps well to the system level
architecture of the IPPro processors. An earlier version of the programming environment has been
is detailed in [11] allowing the user to explore parallel implementation and providing the necessary
back-end compilation support.
In our flow, every processor can be thought of as an actor and data is fired through the FIFO
structures but the approach needs to be sensitive to FPGA-based limitations such as restricted memory.
Cal Actor Language (CAL) [22] is a dataflow programming language that has been focussed at image
processing and FPGAs and it offers the necessary constructs for expressing parallel or sequential
coding, bitwise types, a consistent memory model, and a communication between parallel tasks
through queues. RVC-CAL is supported by an open source dataflow development environment and
compiler framework, Orcc, that allows the trans-compilation of actors and generates equivalent code
depending on the chosen back-ends [23]. An RVC-CAL based design is composed of a dataflow
network file (.xdf file) that supports task and data-level parallelism.
Figure 2 illustrates the possible pipelined decomposition of dataflow actors. These dataflow
actors need to be balanced as the worst-case execution time of the actor determines the overall
achievable performance. Data-level parallelism is achieved by making multiple instances of an actor
and requires SIMD operations that shall be supported by the underlying processor architecture.
In addition, it requires software configurable system-level infrastructure that manages control and data
distribution/collection tasks. It involves the initialisation of the soft core processors (programming the
decomposed dataflow actor description), receiving data from the host processor, distributing them to
first-level actors, gathering processed data from the final-level actors and send it back to host processor.
Data-level parallelism directly impacts the system performance; the major limiting factor is the
number of resources available on FPGA. An example pipeline structure with an algorithm composed
of four actors each having different execution times, and multiple instances of the algorithm realised
in SIMD fashion is shown in Figure 2. The performance metric, frames-per-second (fps) can be
approximated using N(total_pixels) the number of pixels in a frame, N( pixel_consumption) the number of
pixels consumed by an actor in each iteration and f ( processor) is operating frequency of processor.

f ( processor) ∗ N( pixel_consumption)
fps ≈ (1)
N(total_pixels)
To improve the f ps, the following options are possible:

• Efficient FPGA-based processor design that operates at higher operating frequency f ( processor) .
• Reducing the actor’s execution time by decomposing it into multiple pipelined stages, thus reducing
t( actor) to improve the f ps. Shorter actors can be merged sequentially to minimise the data transfer
overhead by localising data into FIFOs between processing stages.
• Vertical scaling to exploit data parallelism by mapping an actor on multiple processor cores, thus
N(total_pixels)
reducing (n ∗ N( pixel_consumption) ) at the cost of additional system-level data distribution, control, and
collection mechanisms.

Pipeline stage delay

Actor 1 Actor 2 Actor 3 Actor 4


SIMD degree

Actor 1 Actor 2 Actor 3 Actor 4

Actor 1 Actor 2 Actor 3 Actor 4


...

Figure 2. Illustration of possible data and task parallel decomposition of a dataflow algorithm found in
image processing designs where the numerous of rows indicate the level of parallelism.

9
J. Imaging 2019, 5, 16

The developed tool flow (Figure 3) starts with a user-defined RVC-CAL description composed
of actors selected to execute in FPGA-based soft cores with the rest to be run in the host CPUs.
By analyzing behaviour, software/hardware partitioning is decided by two main factors, the actors
with the worse execution time (determined exactly by number of instructions and the average waiting
time to receive the input tokens and send the produced tokens), and the overheads incurred in
transferring the image data to/from the accelerator. The behavioural description of an algorithm could
be coded in different formats:

• No explicit balanced actors or actions are provided by the user.


• The actors include actions which are balanced without depending on each other, e.g., no global
variables in an actor is updated by one action and then used by the other ones; otherwise,
these would need to be decomposed into separate actors.
• The actors are explicitly balanced and only require hardware/software partitioning.

BehaviouralDescriptionin
RVCCAL

Software/Hardware
Partitioning

RedesignofCPUͲTargeted RedesignofFPGAͲTargeted
ActorsinRVCCAL ActorsinRVCCAL

SIMDApplication

CompilerInfrastructure
RVCCAL–CCompilation InterfaceSettings
XDFAnalysis

ActorCodeGeneration

SystemImplementation ControlͲRegisterValue/
ParameterGeneration

Figure 3. A brief description of the design flow of a hardware and software heterogeneous system
highlighting key features. More detail of the flow is contained in reference [11].

There are two types of decomposition, “row-” and “column-wise”. The newly generated data-
independent actors can be placed row-wise at the same pipeline stage; otherwise they can be placed
column-wise as consecutive pipeline stages. Row-wise is preferred as the overhead incurred in token
transmission can be a limiting factor but typically a combination is employed.
If the actors or actions are not balanced, then they need to be decomposed. This is done by
detecting a sequence of instructions without branches (unless this occurs at the end) and then breaking
the program into basic blocks. The “balance points” whereby the actor needs to be divided into
multiple sets of basic blocks such that if each set is placed in a new actor, then need to be found;
this will ensure that the overhead of transferring tokens among the sets will not create a bottleneck
and infer the selection and use of one with the lowest overhead (See Ref. [11]). Once the graph is
partitioned, the original xdf file no longer represents the network topology, so each set of actors must
be redesigned separately and their input/output ports fixed and a new set of xdf dataflow network
description files, generated. The actors to run on the host CPU are compiled from RVC-CAL to C
using the C backend of Orcc development environment, whereas the FPGA-based functionality is then
created using the proposed compiler framework.
The degree of SIMD applied will affect the controller interface settings. For a target board,
the design will have a fixed number of IPPro cores realized and interconnected with each other and

10
J. Imaging 2019, 5, 16

controllers, determined by the FPGA resources and fan-out delay; for the Zedboard considered here,
32 cores are selected. The compilation infrastructure is composed of three distinctive steps:

• Examination of the xdf dataflow network file and assignment and recording of the actor mapping
to the processors on the network.
• Compilation of each actor’s RVC-CAL code to IPPro assembly code.
• Generation of control register values, mainly for AXI Lite Registers, and parameters required by
the developed C-APIs. running on the host CPU

While FPGA-targeted actor interaction is handled by the compiler, the processes for receiving
the image data and storing the output in the edge actors need to be developed. Multiple controllers
(programmable by the host CPU) are designed to provide the interface to transfer the data to the
accelerators, gather the results and transfer them back to the host. With the host CPU running part
of the design and setting control registers, and the IPPro binary codes of the other actors loaded to
the proper cores on the accelerator, and the interface between the software/hardware sections set
accordingly, the system implementation is in place and ready to run.

4. Exploration of Efficient FPGA-Based Processor Design


Image processing applications extensively use multiply and accumulate operations for image
segmentation and filtering which can be efficiently mapped to FPGA. On the FPGA, the dedicated
memory blocks are located next to the DSP blocks to minimise any timing delays and it is this that
determines the maximum operating frequency ( f max ) of the processor. It is one of the reasons that
many-core and multi-core architectures use simple, light-weight processing datapaths over complex
and large out-of-order processors. However, to maintain the balance among soft processor functionality,
scalability, performance and efficient utilisation of FPGA resources remain an open challenge.
Figure 4 presents the impact of different configurations of DSP48E1 and BRAM on f max and
the parameters required by the developed C-APIs running on the host CPU using different FPGAs.
The DSP48E1 has five configurations that offer different functionalities (multiplier, accumulator,
pre-adder and pattern detector) based on different internal pipeline configurations that directly
impacts f max . It varies 15–52% for the same speed grade and reduces by 12–20% when the same design
is ported from −3 to −1 speed grade. Configuring the BRAM as a single and true-dual port RAM,
Figure 4b has been created to show that a true-dual port RAM configuration gives a reduction of 25%
in f max . However an improvement of 16% is possible by migrating the design from Artix-7 to Kintex-7
FPGA technology.
Table 2 shows the distribution of compute (DSP48E1) and memory (BRAM) resources, and
highlights the raw performance in GMAC/s (giga multiply-accumulates per second) across the largest
FPGA devices covering both standalone and Zynq SoC chips. A BRAM/DSP ratio metric is reported
to quantify the balance between compute and memory resources. In Zynq SoC devices, it is higher
than standalone devices because more memory is required to implement substantial data buffers to
exchange data between FPGA fabric and the host processor, while it is close to unity for standalone
devices. This suggests that BRAM/DSP ratio can be used to quantify area efficiency of FPGA designs.

11
J. Imaging 2019, 5, 16

500
800 Single Port RAM
NOPATDET
True-Dual Port RAM
PATDET
PREADD_MULT_NOADREG 450
700 15% 52%
MULT_NOMREG
MULT_NOMREG_PATDET
400
Frequency (MHz)

Frequency (MHz)
600

500 350

400 300

300
250

200
-3 -2 -1 200
Virtex-7 Kintex-7 Artix-7
Kintex-7 fabric (speed grade)
FPGA Fabric

(a)
(b)
Figure 4. (a) Impact of DSP48E1 configurations on maximum achievable clock frequency using
different speed grades using Kintex-7 FPGAs for fully pipelined with no (NOPATDET) and with
(PATDET) PATtern DETector, then multiply with no MREG (MULT_NOMREG) and pattern detector
(MULT_NOMREG_PATDET) and a Multiply, pre-adder, no ADREG (PREADD_MULT_NOADREG)
(b) Impact of BRAM configurations on the maximum achievable clock frequency of Artix-7, Kintex-7
and Virtex-7 FPGAs for single and true-dual port RAM configurations.

Table 2. Computing resources (DSP48E1) and BRAM memory resources for a range of Xilinx Artix-7,
Kintex-7, Virtex-7 FPGA families implemented using 28nm CMOS technology.

Part BRAM BRAM/


Product Family DSP48E1 GMAC/s
Number (18 Kb Each) DSP
Standalone Artix-7 XC7A200T 730 740 929 0.99
Standalone Kintex-7 XC7K480T 1910 1920 2845 0.99
Standalone Virtex-7 XC7VX980T 3000 3600 5335 0.83
Zynq SoC Artix-7 XC7Z020 280 220 276 1.27
Zynq SoC Kintex-7 XC7Z045 1090 900 1334 1.21

4.1. Exploration of FPGA Fabric for Soft Core Processor Architecture


A system composed of light-weight and high-performance soft core processors that supports
modular computation with fine and coarse-grained functional granularity is more attractive than fixed
dedicated hardware accelerators. A lightweight, soft core processor allows more programmable
hardware accelerators to be accommodated onto a single SoC chip which would lead to better
acceleration possibilities by exploiting data and task-level parallelism.
Gupta et al. [24,25] have reported different dataflow graph models where the functionality
corresponds to soft core datapath models , 1 2 and  3 as shown in Figure 5. These dataflow models
are used to find a trade-off between the functionality of soft core processor and f max and laid the
foundation to find the suitable soft core datapath to map and execute the dataflow specification.
The input/output interfaces are marked in red while the grey box represents the mapped functionality
onto the soft core datapath models as shown in Figure 6.
The first model  1 exhibits the datapath of a programmable ALU as shown in Figure 6a. It has an
instruction register (IR) that defines a DFG node (OP1) programmed at system initialisation. On each
clock cycle, the datapath explicitly reads a token from the input FIFO, processes it based on the
programmed operation and stores the result into the output FIFO that is then consumed by the
following dataflow node (OP3). This model only allows the mapping of data independent, fine-grained
dataflow nodes as shown in Figure 5a which limits its applicability due to lack of control and data
dependent execution, commonly found in image processing applications where the output pixel

12
J. Imaging 2019, 5, 16

depends on the input or neighbouring pixels. This model is only suitable for mapping a single
dataflow node.
The second model  2 increases the datapath functionality to a fine-grained processor by including
BRAM-based instruction memory (IM), program counter PC and kernel memory (KM) to store constants as
shown in Figure 6b. Conversely,  2 can support mapping of multiple data independent dataflow nodes
as shown in Figure 5b. The node (OP2) requires a memory storage to store a variable (t1) to compute
the output token (C) which feeds back from the output of the ALU needed for the next instruction in
the following clock cycle. This model supports improved dataflow mapping functionality over  1 by
introducing an IM which comes at the cost of variable execution time and throughput proportional
to the number of instructions required to implement the dataflow actor. This model is suitable for
accelerating combinational logic computations.
The third model  3 increases the datapath functionality to map and execute a data dependent
dataflow actor as shown in Figure 5c. The datapath has memory in the form of a register file (RF) which
represents a coarse-grained processor shown in Figure 6c. The RF stores intermediate results to execute
data dependent operations, implements (feed-forward, split, merge and feedback) dataflow execution
patterns and facilitates dataflow transformations (actor fusion/fission, pipelining etc.) constraints
by the size of the RF. It can implement modular computations which are not possible in  1 and .2
In contrast to 
1 and ,
2 the token production/consumption (P/C) rate of  3 can be controlled through
program code that allows software-controlled scheduling and load balancing possibilities.

(a)
(b) (c)
Figure 5. A range of dataflow models taken from [24,25]. (a) DFG node without internal storage
called configuration ;
1 (b) DFG actor without internal storage t1 and constant i called configuration
;
2 (c) Programmable DFG actor with internal storage t1, t2 and t3 and constants i and j called
configuration .
3

(a) (b) (c)


Figure 6. FPGA datapath models resulting from Figure 5. (a) Programmable ALU corresponding to
configuration ;1 (b) Fine-grained processor corresponding to configuration ;
2 (c) Coarse-grained
processor corresponding to configuration .
3

4.2. Functionality vs. Performance Trade-Off Analysis


The presented models show that the processor datapath functionality significantly impacts the
dataflow decomposition, mapping and optimisation possibilities, but also increases the processor
critical path length and affects f max by incorporating more memory elements and control logic.
Figure 6 shows the datapath models and their memory elements, where the memory resources
(IM, KM, RF) have been incrementally allocated to each model. Each presented model has been coded

13
J. Imaging 2019, 5, 16

in Verilog HDL, synthesised and placed and routed using the Xilinx Vivado Design Suite v2015.2 on
Xilinx chips installed on widely available development kits which are Artix-7 (Zedboard), Kintex-7
(ZC706) and Virtex-7 (VC707). The obtained f max results are reported in Figure 7.
In this analysis, f max is considered as the performance metric for each processor datapath model
and has a reduction of 8% and 23% for  2 and  3 compared to 1 using the same FPGA technology.
For ,
2 the addition of memory elements specifically IM realised using dedicated BRAM affects f max
by ≈ 8% compared to . 1 Nevertheless, the instruction decoder (ID) which is a combinational part of a
datapath significantly increases the critical path length of the design. A further 15% f max degradation
from  2 to 3 has resulted by adding memory elements KM and RF to support control and data
dependent execution, which requires additional control logic and data multiplexers. Comparing
different FPGA fabrics, a f max reduction of 14% and 23% is observed for Kintex-7 and Artix-7. When  3
is ported from Virtex-7 to Kintex-7 and Artix-7, a maximum f max reduction of 5% and 33% is observed.
This analysis has laid firm foundations by comparing different processor datapath and dataflow
models and how they impact the f max of the resultant soft-core processor. The trade-off analysis
shows that an area-efficient, high-performance soft core processor architecture can be realised that
supports requirements to accelerate image pre-processing applications. Among the presented models,
3 provides the best balance among functionality, flexibility, dataflow mapping and optimisation
possibilities, and performance. This model is used to develop a novel FPGA-based soft core IPPro
architecture in Section 4.3.

0RGHO 
   
0RGHO 

  0RGHO 


)UHTXHQF\ 0+]


 








9LUWH[ 9& .LQWH[ =& $UWL[ =HGERDUG


)3*$IDEULF

Figure 7. Impact of the various datapath models ,


1 ,
2 
3 on f max across Xilinx Artix-7, Kintex-7 and
Virtex-7 FPGA families.

4.3. Image Processing Processor (IPPro)


The IPPro is a 16-bit signed fixed-point, five-stage balanced pipelined RISC architecture that
exploits the DSP48E1 features and provides balance among performance, latency and efficient resource
utilization [10]. The architecture here is modified to support mapping of dataflow graphs by replacing
the previously memory mapped, data memory by stream driven blocking input/output FIFOs as
shown in Figure 8. The IPPro is designed as in-order pipeline because: (1) it consumes fewer area
resources and can achieve better timing closure leading to the higher processor operating frequency
f max ; (2) the in-order pipeline execution is predictable and simplifies scheduling and compiler
development. The datapath supports the identified execution and memory access patterns (Table 1),
and can be used as a coarse-grained processing core. IPPro has an IM of size 512 × 32, a RF of size 32 ×
16 to store pixels and intermediate results, a KM of size 32 × 16 to store kernel coefficients and constant
values, blocking input/output FIFOs to buffer data tokens between a producer, and a consumer to
realise pipelined processing stages.

14
J. Imaging 2019, 5, 16

Figure 8. Block diagram of FPGA-based soft core Image Processing Processor (IPPro) datapath
highlighting where relevant the fixed Xilinx FPGA resources utilised by the approach.

Table 3 outlines the relationship between data abstraction and the addressing modes, along with
some supported instructions for the IPPro architecture, facilitating programmable implementation of
point and area image processing algorithms. The stream access reads a stream of tokens/pixels from
the input FIFO using GET instruction and allows processing either with constant values (Kernel
Memory-FIFO) or neighbouring pixel values (Register File-FIFO or Register File-Register File).
The processed stream is then written to the output FIFO using PUSH instruction. The IPPro supports
arithmetic, logical, branch and data handling instructions. The presented instruction set is optimized
after profiling use cases presented in [10,26].

Table 3. IPPro supported addressing modes highlighting the relation to the data processing
requirements and the instruction set.

Addressing Mode Data Abstraction Supported Instructions


FIFO handling Stream access get, push
Register File–FIFO Stream and randomly accessed data addrf, subrf, mulrf, orrf, minrf, maxrf etc
Register File–Register File Randomly accessed data str, add, mul, mulacc, and, min, max etc.
Kernel Memory–FIFO Stream and fixed values addkm, mulkm, minkm, maxkm etc.

The IPPro supports branch instructions to handle control flow graphs to implement commonly
known constructs such as if-else and case statements. The DSP48E1 block has a pattern detector that
compares the input operands or the generated output results depending on the configuration and
sets/resets the PATTERNDETECT (PD) bit. The IPPro datapath uses the PD bit along with some
additional control logic to generate four flags zero (ZF), equal (EQF), greater than (GTF) and sign (SF)
bits. When the IPPro encounters a branch instruction, the branch controller (BC) compares the flag
status and branch handler (BH) updates the PC as shown in Figure 8.
The IPPro architecture has been coded in Verilog HDL and synthesized using Xilinx Vivado
v2015.4 design suite on Kintex-7 FPGA fabric giving a f max of 337 MHz. Table 4 shows that the
IPPro architecture has achieved 1.6–3.3× times higher operating frequency ( f max ) than the relevant
processors highlighted in Section 2.2 by adopting the approach presented in Section 4. Comparing
the FPGA resource usage of Table 4, the flip-flop utilisation (FF) is relatively similar except for the
FlexGrip which uses 30× more flip-flops. Considering LUTs, the IPPro uses 50% less LUT resources
compared to MicroBlaze and GraphSoC. To analyse design efficiency, a significant difference (0.76–9.00)
in BRAM/DSP ratio can be observed among processors. Analysing design area efficiency, a significant
difference 0.76–9.00 in BRAM/DSP ratio is observed which makes IPPro an area-efficient design based
on the proposed metric.

15
J. Imaging 2019, 5, 16

Table 4. Comparison of IPPro against other FPGA-based processor architectures in terms of FPGA
resources used and timing results achieved.

Resource IPPro Graph-SoC [19] FlexGrip 8 SP * [8] MicroBlaze


FFs 422 551 (103,776/8 =) 12,972 518
LUTs 478 974 (71,323/8 =) 8916 897
BRAMs 1 9 (120/8 =) 15 4
DSP48E1 1 1 (156/8 =) 19.5 3
Stages 5 3 5 5
Freq. (MHz) 337 200 100 211
* Scaled to a single streaming processor.

4.4. Processor Micro-Benchmarks


A commonly used performance metric for a processor is the time required to accomplish a defined
task. Therefore, a set of commonly used micro-benchmarks [9,27] has been chosen and implemented
on the IPPro and compared against a well-established MicroBlaze soft core processor as shown in
Table 5a. Each of the chosen micro-benchmarks are fundamental kernels of larger algorithms and
often the core computation of more extensive practical applications. The micro-benchmarks were
written in standard C and implemented using Xilinx Vivado SDK v2015.1 Xilinx, San Jose, CA, USA.
MicroBlaze has been configured for performance with no debug module, instruction/data cache and
single AXI-Stream link enabled to stream data into the MicroBlaze using getfsl and putfsl instructions
in C, equivalent to (GET and PUT) in assembly.
Table 5a reports the performance results of the micro-benchmarks and Table 5b shows the area
utilisation comparison of the IPPro and the MicroBlaze both implemented on the same Xilinx Kintex-7
FPGA. It shows that the IPPro consumes 1.7 and 2.3 times fewer FFs and LUTs respectively than
the MicroBlaze. It can be observed that for streaming functions (3 × 3 filter, 5-tap FIR and Degree-2
Polynomial), the IPPro achieved 1.80, 4.41 and 8.94 times better performance compared to MicroBlaze
due to support of single cycle multiply-accumulate with data forwarding and get/push instructions
in the IPPro processor. However, as the IPPro datapath does not support branch prediction that
impacts its performance implementing data dependent or conditional functions (Fibonacci and Sum of
absolute differences); thus, the SAD implementation using the IPPro resulted in a 5% performance
degradation compared to Microblaze. On the other hand, for memory-bounded functions such as
Matrix Multiplication, IPPro performed 6.7 times better than MicroBlaze due to higher frequency.

Table 5. Performance comparison of IPPro and MicroBlaze implementations (a) Comparison of


micro-benchmarks. (b) Area comparison.

a
Processor MicroBlaze IPPro
FPGA Fabric Kintex-7
Freq (MHz) 287 337
Micro-benchmarks Exec. Time (us) Speed-up
Convolution 0.60 0.14 4.41
Degree-2 Polynomial 5.92 3.29 1.80
5-tap FIR 47.73 5.34 8.94
Matrix Multiplication 0.67 0.10 6.7
Sum of Abs. Diff. 0.73 0.77 0.95
Fibonacci 4.70 3.56 1.32
b
Processor MicroBlaze IPPro Ratio
FFs 746 422 1.77
LUTs 1114 478 2.33
BRAMs 4 2 2.67
DSP48E1 0 1 0.00

16
J. Imaging 2019, 5, 16

5. System Architecture
The k-means clustering and Traffic Sign Recognition algorithms has been used to explore and
analyse the impact of both data and task parallelism using a multi-core IPPro implemented on a
ZedBoard. The platform has a Xilinx Zynq XC7Z020 SoC device interfaced to a 256 MB flash memory
and 512 MB DDR3 memory. The SoC is composed of a host processor known as programmable system
(PS) which configures and controls the system architecture, and the FPGA programmable logic (PL)
on which the IPPro hardware accelerator is implemented, as illustrated in Figure 9. The SoC data
communication bus (ARM AMBA-AXI) transfers the data between PS and PL using the AXI-DMA
protocol and the Xillybus IP core is deployed as a bridge between PS and PL to feed data into the
image processing pipeline. The IPPro hardware accelerator is interfaced with the Xillybus IP core
via FIFOs. The Linux application running on PS streams data between the FIFO and the file handler
opened by the host application. The Xillybus-Lite interface allows control registers from the user space
program running on Linux to manage the underlying hardware architecture.
Figure 9 shows the implemented system architecture which consists of the necessary control
and data infrastructure. The data interfaces involve stream (Xillybus-Send and Xillybus-Read);
uni-directional memory mapped (Xillybus-Write) to program the IPPro cores; and Xillybus-Lite
to manage Line buffer, scatter, gather, IPPro cores and the FSM. Xillybus Linux device drivers are used
to access each of these data and control interfaces. An additional layer of C functions is developed
using Xillybus device drivers to configure and manage the system architecture, program IPPro cores
and exchange pixels between PS and PL.

Figure 9. System architecture of IPPro-based hardware acceleration highlighting data distribution and
control infrastructure, FIFO configuration and Finite-State-Machine control.

Control Infrastructure
To exploit parallelism, a configurable control infrastructure has been implemented using the PL
resources of the Zynq SoC. It decomposes statically the data into many equal-sized parts, where each
part can be processed by a separate processing core. A row-cyclic data distribution [28] has been used
because it allows buffering of data/pixels in a pattern suitable for point and area image processing

17
J. Imaging 2019, 5, 16

operations after storing them into the line buffers. The system-level architecture (Figure 9) is composed
of line buffers, a scatter module to distribute the buffered pixels, a gather module to collect the
processed pixels and a finite-state-machine (FSM) to manage and synchronise these modules.

6. Case Study 1: k-Means Clustering Algorithm


k-means clustering classifies a data set into k centroids based on the measure e.g., a distance
between each data item and the k centroid values. It involves: Distance Calculation from each data
point to the centroids which gives k distances and the associated pixels, and a minimum distance is
computed from the k distance values; Averaging where data pixels in the dimension are added up
and divided by the number in their dimensions for each cluster, giving an updated centroid value for
the following frame. Here we accelerate a functional core of the k-means clustering algorithm with
4 centroids to be applied to a 512 × 512 image.

6.1. High-Level System Description


The behavioural description is captured in RVC-CAL using Orcc and includes mainly the actor
CAL files and the xdf network, derived from .xml format. A dataflow network is constructed with
FIFO channels between actors to allow high-throughput passage of tokens from one actor’s output
port to another’s input port. The size of FIFO channels can be set. Whilst the length of execution times
are the key factor for FPGA acceleration, overheads incurred in transferring the data to/from the PL
and accelerators are also important. The SIMD degree was explored by redesigning the FPGA-targeted
actors in RVC-CAL and using the compiler to generate the IPPro assembly code. This is done by
analysing the xdf file to decide the allocation of actors to the processors and then compiling the function
and interconnections.
Every IPPro core sets the hardware units around input/output port connections for the proper
flow of tokens, and the compiler is designed to provide the proper signals required by each core.
The compiler also generates the setup registers settings and C-APIs parameters, in order to help the
controllers distribute the tokens among the cores and gather the produced results. Figure 10 shows the
two stages of k-means clustering algorithm to be accelerated, and also cores port connections, sample
distance calculation code in RVC-CAL and its compiled IPPro assembly code. As Xillybus IP has
been used in the system architecture (Section 5), it restricts the clock rate to 100 MHz on Zedboard.
To evaluate the IPPro architecture and different dataflow mapping possibilities by exploiting data and
task-level parallelism, the k-means clustering is accelerated using four acceleration designs listed in
Table 6 and illustrated in Figure 11.

Table 6. Dataflow actor mapping and supported parallelism of IPPro hardware accelerator design
presented in Figure 11.

Parallelism
Design Acceleration Paradigm Mapping
Data Task

1 Single core IPPro Single actor No No

2 8-way SIMD IPPro Single actor Yes No

3 Dual core IPPro Dual actor No Yes

4 Dual core 8-way SIMD IPPro Dual actor Yes Yes

18
J. Imaging 2019, 5, 16

Distance.cal Distance.ippro
packageorg.proj.kmeansorcc; DISTCAL:
actorDistance()int(size=8)DisInput==>int(size=8)DisOutput: GET R1, 1
//Get28Şbitpixelsandpusheachwithitsassociatedcentroid GET R2, 1
DistCal:actionDisInput:[Pix1,Pix2]==>DisOutput:[Pix1, STR R3, 31
Cent1,Pix2,Cent2] STR R4, 40
var STR R5, 50
 uintCent1, STR R6, 76
 uintCent2, SUB R7, R1, R3
 uintCent[4]=[31,40,50,76],//4initialcentroids SUB R8, R1, R4
 uintTemp1[4], SUB R9, R1, R5
(a)  uintTemp2[4], SUB R10, R1, R6
 uintTemp3[4], SUB R11, R2, R3
TopKMeansOrcc.xdf  uintTemp4[4] SUB R12, R2, R4
<?xml version="1.0" encoding="UTF-8"?> do SUB R13, R2, R5
<XDF name="TopKMeansOrcc">  //Manhattandistanceestimation SUB R14, R2, R6
... foreachint(size=8)countin1..4do MUL R15, R7, R7
<Instance id="Distance"> //Pixel1'sdistancefromeverycentroid MUL R16, R8, R8
... Temp1[count]:=Pix1ŞCent[count]; MUL R17, R9, R9
</Instance> //Pixel1'sabsolutevalueestimationbysquaring MUL R18, R10, R10
<Instance id="Averaging"> Temp3[count]:=Temp1[count]*Temp1[count]; MUL R19, R11, R11
... //Pixel2'sdistancefromeverycentroid MUL R20, R12, R12
</Instance> Temp2[count]:=Pix2ŞCent[count]; MUL R21, R13, R13
<Connection dst="Averaging" dst-port="AvInput" //Pixel2'sabsolutevalueestimationbysquaring MUL R22, R14, R14
src="Distance" src-port="DisOutput"/> Temp4[count]:=Temp2[count]*Temp2[count]; ...
<Connection dst="Distance" dst-port="DisInput" end
src="" src-port="InputPort"/> ... (d)
<Connection dst="" dst-port="OutputPort" end
src="Averaging" src-port="AvOutput"/> ...
</XDF> end
(b) (c)

Figure 10. High-level implementation of k-means clustering algorithm: (a) Graphical view of Orcc
dataflow network; (b) Part of dataflow network including the connections; (c) Part of Distance.cal file
showing distance calculation in RVC-CAL where two pixels are received through an input FIFO channel,
processed and sent to an output FIFO channel; (d) Compiled IPPro assembly code of Distance.cal.

Figure 11. IPPro-based hardware accelerator designs to explore and analyse the impact of parallelism
on area and performance based on Single core IPPro , 1 eight-way parallel SIMD IPPro , 2 parallel
Dual core IPPro 3 and combined Dual core 8-way SIMD IPPro called . 4

6.2. IPPro-Based Hardware Acceleration Designs


Table 6 shows the dataflow actor mapping and the exploited parallelism for each design. The block
diagram of each IPPro hardware acceleration design is illustrated in Figure 11. Design  1 and 
2 are
used to accelerate Distance Calculation and Averaging stages, where each stage is mapped separately
onto individual IPPro cores. To investigate the impact of data and task parallelism, design 
3 and 
4 are
used to accelerate both Distance Calculation and Averaging stages as shown in Figure 11. The detailed
area and performance results are reported in Tables 7 and 8. The execution time depends on the
number of IPPro instructions required to compute the operation and the time require to execute a
instruction which corresponds to the operating frequency ( f max ) of IPPro.
Table 7 reports the results obtained by individually accelerating the stages of k-means clustering
using 1 and .2 In each iteration, the distance calculation takes two pixels and classifies them into

19
J. Imaging 2019, 5, 16

one of the four clusters which take an average of 45 cycles/pixel. To classify the whole image, it takes
118.2 ms which corresponds to 8.45 fps. On the other hand, the averaging takes four tokens and
produces four new cluster values, which takes an average of 55 clock cycles/pixel results in 145 ms or
6.88 fps. Both the stages involve point-based pixel processing. Therefore design  2 was developed
and used to exploit data-level parallelism. As a result, the execution time is reduced to 23.32 ms
and 27.02 ms for distance calculation and averaging respectively. This is an improvement of 5.1 and
5.4 times over  1 (and not the expected 8 times) of the 8-way SIMD implementation ( 2 over )
1 as
the overhead of data transfer time from/to the accelerator restricts the performance improvement.
This came at the cost of 4.1, 2.3 and 8.0 times more BRAMs, LUTs and DSP blocks respectively
as reported in Table 8. The major contributor to increased area utilisation is data distribution and
control infrastructure.
Table 8 reports the execution time and performance (fps) numbers of both stages together to
exploit task and data parallelism using designs  3 and . 4 The reported results of  1 and  2 were
obtained by combining the execution time of both stages previously reported in Table 7. Using design
,
3 the effect of task parallelism implemented via intermediate FIFO results in an average of 63 clock
cycles/pixel which is 163 ms (6 fps). By pipelining both actors,  3 has achieved 1.6 times better
performance compared to  1 at the cost of 1.6 and 2.0 times more BRAM and DSP blocks using the
same Xillybus IP infrastructure as .
1 The reason for the improvement is the localisation of intermediate
data within FPGA fabric using an intermediate FIFO, which hides the data transfer overhead to and
from host processor as shown in Figure 11. Investigating the reported area utilisation numbers in
Table 8 shows that the area utilisation for design 
3 and  4 is not twice as big as 
1 and 2 respectively
due to the FPGA resources utilised by the input and output data ports of Xillybus IP. Design  1 and
3 requires a single input and output data port, while  2 and  4 requires eight input and output data
ports. Therefore, a part of FPGA logic used by the Xillybus IP is constant/fixed for , 1 3 and ,2 .
4
Analysing the impact of exploiting both task and data-level parallelism using 4 results on average
14 clock cycles/pixel and an execution time of 35.9 ms (2 fps). It is 1.4, 4.5 and 7.3 times better than
,
2  3 and 1 respectively. For comparison, both stages were coded in C language and executed on
an embedded ARM Cortex-A7 processor that achieved execution time of 286 ms (354 fps) which is
8 times slower than the performance achieved by . 4

Table 7. Performance measurements for designs 


1 and 
2 highlighted in Figure 11.


1 Single Core IPPro 
2 8-Way SIMD IPPro
Single Actor
Exec. (ms) fps Exec. (ms) fps
Distance Calculation 118.21 8.45 23.37 42.78
Averaging 145.17 6.88 27.02 37.00

Table 8. Area utilisation and performance results of IPPro-based hardware accelerator designs in
Figure 11 exploiting data and task parallelism namely ,
1 ,
2 
3 and .
4

Area Performance
k-Means Acceleration
LUT FF BRAM DSP Exec. (ms) fps
1 Combined stages using Single-core IPPro 4736 5197 4.5 1 263.38 3.8
2 Combined stages using 8-way SIMD IPPro 10,941 12,279 18.5 8 50.39 19.8
3 Dual-core IPPro 4987 5519 4.5 2 163.2 6
4 Dual 8-way SIMD IPPro 13,864 16,106 18.5 16 35.9 28
Software implementation on ARM Cortex-A7 - - - - 286 3.5

6.3. Power Measurement


This section presents the details of adopted power measurement methods and compares the
IPPro-based implementation to the equivalent k-means implementation on GPU and CPU. The IPPro

20
J. Imaging 2019, 5, 16

power measurements obtained by running post-implementation timing simulation. A Switch activity


interchange format (SAIF) file is used to record the switching activity of designs data and control signals
of each presented IPPro designs. The Xilinx Power Estimator (XPE) takes the SAIF file and reports
the power consumption. An equivalent version of k-means in CUDA and OpenCL was implemented
and profiled on nVIDIA GeForce GTX980 (desktop GPU), ODRIOD-XU3 (Embedded GPU) and ARM
Cortex-A7 (CPU) due to in-house availability of both GPU platforms. The nVIDIA desktop GPU card
supports 2048 CUDA cores running at a base frequency of 1126 MHz. OpenCL and CUDA were
used for programming the GPU, and both stages merged into the single kernel. For performance
measurement, OpenCL’s profiling function clGetEventProfilingInfo is used which returns the execution
time of kernel in nanoseconds. The power consumption during kernel execution was logged using
nVIDIA System Management Interface (nvidia-smi) which allows to measure the power consumed by
the GPU and the host processor separately. It is a command line utility, based on top of the nVIDIA
Management Library (NVML), intended to aid the management and monitoring of nVIDIA GPUs.
To set the base line figures and for fair comparison of the FPGA against the GPU technology,
an embedded CPU (ARM Cortex-A7) and an embedded GPU (ARM Mali-T628) implementation were
carried out on a ODROID-XU3 platform. This is a heterogeneous multi-processing platform that hosts
28 nm Samsung Exynos 5422 application processor which has on-chip ARM Cortex-A7 CPUs and an
ARM Mali-T628 embedded GPU. The platform is suitable for power constraint application use cases
where the ARM Cortex-A7 CPU and mid-range ARM Mali-T628 GPU runs at 1.2 GHz and 600 MHz
respectively. The platform have separate current sensors to measure the power consumption of ARM
Cortex-A7 and ARM Mali-T628, thus allowing component-level power measurement capability.
Table 9 shows the results of IPPro-based accelerator designs running on Zedboard where both data
and task parallel implementation achieved 4.6 times better performance over task only implementation
at the cost of 1.57 times higher power consumption. Table 10 shows the performance results of the
k-means implementation on Kintex-7 FPGA and compares them against equivalent embedded CPU
(ARM Cortex-A7), embedded GPU (ARM Mali-T628) and desktop GPU (nVIDIA GeForce GTX680)
implementations in terms of speed (MHz), Power (W) and transistors utilised (TU). The presented
embedded CPU results has been considered as a baseline for the comparison.

Table 9. Power, resource and combined efficiency comparisons of IPPro-based k-means clustering
implementations on Zedboard (Xilinx Zynq XC7Z020 Artix-7).

Power (mW) Freq. Exec. Power TU Efficiency


Impl. Static Dyn. Tot. (MHz) (ms) fps Efficiency (×106 ) (fps/TU) (fps/W/TU)
(fps/W) (×10−8 ) (×10−9 )

3 118 18 136 100 163.2 6 44.1 591 (9%) 1.0 74.6

4 122 92 214 100 35.9 28 130.8 1564 (23%) 1.8 83.6

Table 10. Power, resource and combined efficiency comparisons for k-means clustering for Xilinx Zynq
XC7Z045 Kintex-7 FPGA, nVIDIA GPU GTX980, embedded ARM Mali-T628 GPU and embedded ARM
Cortex-A7 CPU.

Power (W) Freq. Exec. Power TU Efficiency


Plat. Impl. Static Dyn. Tot. (MHz) (ms) fps Effic. (×109 ) (fps/TU) (fps/W/TU)
(fps/W) (×10−8 ) (×10−9 )

3 0.15 0.03 0.19 337 48.43 21 114.1 0.6 (9%) 3.6 193.1
FPGA

4 0.16 0.15 0.31 337 10.65 94 300.3 1.0 (6%) 6.0 192.0
OpenCL 37 27 64 1127 1.19 840 13.1 1.3 (26%) 63.1 9.8
GPU
CUDA 37 22 59 1127 1.58 632 10.7 1.2 (24%) 51.5 8.7
eGPU Mali 0.12 - 1.56 600 3.69 271 173 - - -
eCPU Cortex 0.25 - 0.67 1200 286 3.49 5.2 - - -

21
J. Imaging 2019, 5, 16

Both FPGA implementations achieved 6 and 27 times better fps performance than the embedded
CPU, whilst the embedded GPU delivered 6.7 times better performance over the FPGA by exploiting
parallelism and higher operating frequency. Focusing on the power consumption results, the FPGA
consumed 2.1 and 4.9 times less power than both the embedded CPU and embedded GPU respectively.
It shows that the FPGA technology delivers a power-optimised solution while the GPU approach
provides a performance-optimised solution. Considering the performance and power together,
the power efficiency (fps/W) numbers shows that FPGA and embedded GPU implementations
are 57 and 33 times more power efficient than embedded CPU and that the FPGA implementation is
24 times more power efficient than embedded GPU. Nevertheless, this power efficiency edge can be
further improved by applying dataflow transformations and increasing the number of IPPro cores.
Table 10 compares the FPGA results against desktop GPU and reports resource efficiency as a
metric due to significant difference in the power consumption numbers. The resource efficiency has
been presented in terms of frames-per-second-per-Transistor-Utilisation (fps/TU) which is 6 and 63 for
the 28 nm FPGA and GPU technologies. For embedded CPU and GPU, these results are not reported
due to unavailability of transistor count numbers for the ARM. The reported resource efficiency
results shows that GPU utilises area resources more efficiently than the FPGA when power is kept
out of the equation. Combining all three metrics (fps/W/TU) shows that the advantage gained from
FPGA designs is significant i.e., 22 times more efficient than GPU. This advantage becomes more
valuable when it is acknowledged that the FPGA-based SoC design is adaptable and allows exploration,
profiling and implementation of different dataflow transformation possibilities over dedicated FPGA
approaches to accelerate image processing applications for low energy applications.

7. Case Study 2: Traffic Sign Recognition


Traffic sign recognition is applied in driver assistance systems [29]. In the detection stage, sign
candidate areas are extracted from the original image and matched against a list of known templates
in the recognition stage. The processing stages along with their execution time and percentage
contribution to the overall execution time for 600 × 400 image sign recognition implemented on
ARM Cortex-A9 are shown in Figure 12. It involves a colour filter to convert RGB to HSV, morphology
filters (erosion and dilation) using 3 × 3 and 5 × 5 circular kernels, edge detection, circles detection to
guide the matching process and reduce the number of shapes, bounding box detection to transform
the remaining objects into their convex hulls, classification by shape and then template matching.
The colour and morphology filters have been chosen for hardware acceleration as they are dominant
processing components as shown in Figure 12.

DetectionStage RecognitionStage

Edge Bounding
Pixel Circles Classification Template
ColourFilter Morphology Detection/ Boxes
Stream Detection byShape Matching
ContourFilling Detection

Exe.Time(ms)/ 88.865/9.28% 399.793/41.75% 151.768/15.85% 126.823/13.24% 90.421/9.44% 5.620/0.59% 94.270/9.85%


Percentageoftotal
PS/PL PL PL PS PS PS PS PS
Implementation

Figure 12. Section execution times and ratios for each stage of the traffic sign recognition algorithm.

7.1. Acceleration of Colour and Morphology Filter


The IPPro-based hardware accelerators for colour and morphology filter were implemented on
Zedboard using the system architecture presented in Section 5 that allows to distribute pixels for point
and window image processing operations. The high-level system description of colour filter actor
from RVC-CAL produced program code consists of 160 IPPro assembly instructions. A 3 × 3 circular
mask has been used for morphology filter as shown in Figure 13a, to find the maximum (dilation) or
minimum (erosion) value in a set of pixels contained within a masked region around the input pixel.

22
J. Imaging 2019, 5, 16

The simplified generated code of RVC CAL-IPPro compilation is shown in Figure 13a. GET and
PUSH instructions set the input or output port numbers through which the tokens are received or sent.
GET instructions read 9 pixels values and stores them into the register file from R1 to R9. Then, the
corner pixels are ignored to impose 3 × 3 circular mask, a maximum value among the remaining pixels
max(R1, R4, R5, R6, R8) is computed and stored in R7 to apply dilation operation. This value is
then pushed to the output using PUSH instruction. The output result of the implemented design are
shown in Figure 13b.

DILATION3X3:
GET R1, 1
GET R2, 1
GET R3, 1
GET R4, 1
GET R5, 1
GET R6, 1
GET R7, 1
GET R8, 1
GET R9, 1
MAX R30, R2, R5
MAX R29, R4, R6
MAX R28, R30, R8
MAX R27, R28, R29
PUSH R27, 1
JMP DILATION3X3

(b)
(a)
Figure 13. (a) The simplified IPPro assembly code of 3 × 3 dilation operation. (b) The output result of
implemented design.

Table 11 presents the results from the Zedboard implementation that has been tested with a set
of real images. The hardware accelerated implementation of colour filter stage using 32 IPPro cores
reduces the execution time from 88.87 ms down to 19.71 ms compared to software implementation
on-chip ARM Cortex-A9. Similarly, the morphology filter stage using 16 IPPro cores has reduced the
execution time from 399 ms down to 41.3 ms. The presented IPPro-based hardware acceleration design
has achieved a speed-up of 4.5 and 9.6 times over ARM for colour and morphology filters respectively.
The achieved speed up for colour filter stage using 32 cores is lower than that of morphology stage
using 16 cores, because of the higher number of clock cycles spent on every pixel for colour filter stage;
this is due to larger execution time of division coprocessor used for colour filtering.

Table 11. IPPro-based acceleration of colour and morphology operations implemented on Zedboard.

Description Colour Morphology


No. of cores 32 16
FF 41,624 (39%) 43,588 (41%)
LUT 29,945 (56%) 33,545 (63%)
DSP48E1 32 (15%) 48 (22%)
BRAM 60 (42%) 112 (80%)
Cycles/Pixel 160 26
Exec. (ms) 19.7 (8.7 *) 41.3 (18.3 *)
Speed-up 4.5× (10.3× *) 9.6× (21.75× *)
* The achievable performance using Zynq XC7Z045 Kintex-7.

Figure 14 shows the stage-wise acceleration of traffic sign recognition by accelerating colour and
morphology filters. Edge/contours detection and bounding boxes stages were improved partially by
accelerating the morphology operations. The edge detection is based on the morphology operations
by taking the difference between erosion and dilation. Therefore the morphology results obtained by
acceleration are further exploited in the host to factor out some operations when doing edge detection.

23
J. Imaging 2019, 5, 16


&RORXU
 0RUSKRORJ\
(GJH'HWHFWLRQ
 &LUFOH'HWHFWLRQ
%RXQGLQJER[HV
6KDSH&ODVVLILFDWLRQ

7HPSODWH0DWFKLQJ

7LPH PV 












$50 ,33UR$FFHOHUDWHG
6WDJHRI7UDIILF6LJQ5HFRJQLWLRQ

Figure 14. Stage-wise comparison of traffic sign recognition acceleration using ARM and IPPro
based approach.

8. Conclusions and Future Work


The paper has presented an FPGA-based hardware acceleration approach for image processing
applications using soft core processors which maps efficiently to FPGA resources thereby maintaining
performance. By using the DFG model of computations, a design flow has been created which allows
the user to partition the design based on processing needs and allows programming of each function.
The work has been demonstrated for a k-means clustering function and a traffic sign recognition
example where maximum speed up of 8 and 9.6 times, respectively, were achieved when compared to
software implementation on ARM CPU. For k-means clustering, the 16 IPPro cores implementation is
57, 28 and 1.7 times more power efficient (fps/W) than ARM Cortex-A7 CPU, nVIDIA GeForce GTX980
GPU and ARM Mali-T628 embedded GPU. The future work to improve this work is to investigate
further dataflow decomposition/mapping optimisations and software-controlled power optimisation
techniques such as on-demand enable/disable the IPPro cores.

Author Contributions: The project was conceptualized by R.W., F.S. and S.A. and R.W. and K.R. provided both
project supervision and project administration. The implementation was carried out primarily by F.S. and S.A.
and validation was undertaken by F.S., U.I.M. and T.D. The writing of the original draft was carried out by F.S.,
R.W., S.A. and D.C. and edits done by R.W. and F.S.
Funding: This work has been undertaken in collaboration with Heriot-Watt University in a project funded by the
Engineering and Physical Science Research Council (EPSRC) through the EP/K009583/1 grant.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Conti, F.; Rossi, D.; Pullini, A.; Loi, I.; Benini, L. PULP: A Ultra-Low Power Parallel Accelerator for
Energy-Efficient and Flexible Embedded Vision. J. Signal Process. Syst. 2016, 84, 339–354. [CrossRef]
2. Lamport, L. The Parallel Execution of DO Loops. Commun. ACM 1974, 17, 83–93. [CrossRef]
3. Markov, I.L. Limits on Fundamental Limits to Computation. Nature 2014, 512, 147–154. [CrossRef] [PubMed]
4. Bacon, D.F.; Rabbah, R.; Shukla, S. FPGA Programming for the Masses. ACM Queue Mag. 2013, 11, 40–52.
[CrossRef]
5. Gort, M.; Anderson, J. Design re-use for compile time reduction in FPGA high-level synthesis flows.
In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT), Shanghai,
China, 10–12 December 2014; pp. 4–11.

24
J. Imaging 2019, 5, 16

6. Yiannacouras, P.; Steffan, J.G.; Rose, J. VESPA: Portable, scalable, and flexible FPGA-based vector processors.
In Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded
Systems, Atlanta, GA, USA, 19–24 October 2008; pp. 61–70.
7. Severance, A.; Lemieux, G.G. Embedded supercomputing in FPGAs with the VectorBlox MXP matrix
processor. In Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software
Codesign and System Synthesis, Montreal, QC, Canada, 29 September–4 October 2013; pp. 1–10.
8. Andryc, K.; Merchant, M.; Tessier, R. FlexGrip: A soft GPGPU for FPGAs. In Proceedings of the 23rd
International Conference on Field Programmable Logic and Applications (FPL 2013), Porto, Portugal,
2–4 September 2013; pp. 230–237.
9. Cheah, H.Y.; Brosser, F.; Fahmy, S.A.; Maskell, D.L. The iDEA DSP block-based soft processor for FPGAs.
ACM Trans. Reconfig. Technol. Syst. 2014, 7, 19. [CrossRef]
10. Siddiqui, F.M.; Russell, M.; Bardak, B.; Woods, R.; Rafferty, K. IPPro: FPGA based image processing processor.
In Proceedings of the IEEE Workshop on Signal Processing Systems, Belfast, UK, 20–22 October 2014; pp. 1–6.
11. Amiri, M.; Siddiqui, F.M.; Kelly, C.; Woods, R.; Rafferty, K.; Bardak, B. FPGA-Based Soft-Core Processors for
Image Processing Applications. J. Signal Process. Syst. 2017, 87, 139–156. [CrossRef]
12. Bourrasset, C.; Maggiani, L.; Sérot, J.; Berry, F. Dataflow object detection system for FPGA-based smart
camera. IET Circuits Devices Syst. 2016, 10, 280–291. [CrossRef]
13. Nugteren, C.; Corporaal, H.; Mesman, B. Skeleton-based automatic parallelization of image processing
algorithms for GPUs. In Proceedings of the 2011 International Conference on Embedded Computer Systems:
Architectures, Modeling and Simulation, Samos, Greece, 18–21 July 2011; pp. 25–32. [CrossRef]
14. Brodtkorb, A.R.; Dyken, C.; Hagen, T.R.; Hjelmervik, J.M.; Storaasli, O.O. State-of-the-art in Heterogeneous
Computing. Sci. Program. 2010, 18, 1–33. [CrossRef]
15. Neuendorffer, S.; Li, T.; Wang, D. Accelerating OpenCV Applications with Zynq-7000 All Programmable SoC
Using Vivado HLS Video Libraries; Technical Report; Xilinx Inc.: San Jose, CA, USA, 2015.
16. Strik, M.T.; Timmer, A.H.; Van Meerbergen, J.L.; van Rootselaar, G.J. Heterogeneous multiprocessor for
the management of real-time video and graphics streams. IEEE J. Solid-State Circuits 2000, 35, 1722–1731.
[CrossRef]
17. Zhang, J.; Zhang, Z.; Zhou, S.; Tan, M.; Liu, X.; Cheng, X.; Cong, J. Bit-level Optimization for High-level
Synthesis and FPGA-based Acceleration. In Proceedings of the 18th Annual ACM/SIGDA International
Symposium on Field Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2010; pp. 59–68.
18. Nikhil, R. Bluespec System Verilog: Efficient, correct RTL from high level specifications. In Proceedings
of the Second ACM and IEEE International Conference on Formal Methods and Models for Co-Design
(MEMOCODE ’04), San Diego, CA, USA, 23–25 June 2004; pp. 69–70.
19. Kapre, N. Custom FPGA-based soft-processors for sparse graph acceleration. In Proceedings of the 2015
IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP),
Toronto, ON, Canada, 27–29 July 2015; pp. 9–16.
20. LaForest, C.E.; Steffan, J.G. Octavo: An FPGA-centric processor family. In Proceedings of the ACM/SIGDA
International Symposium on Field Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2012;
pp. 219–228.
21. Sutherland, W.R. On-Line Graphical Specification of Computer Procedures. Technical Report, DTIC
Document. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1966.
22. Eker, J.; Janneck, J. CAL Language Report; Tech. Rep. UCB/ERL M; University of California: Berkeley, CA,
USA, 2003; Volume 3.
23. Yviquel, H.; Lorence, A.; Jerbi, K.; Cocherel, G.; Sanchez, A.; Raulet, M. Orcc: Multimedia Development
Made Easy. In Proceedings of the 21st ACM International Conference on Multimedia (MM ’13), Barcelona,
Spain, 21–25 October 2013; pp. 863–866.
24. So, H.K.H.; Liu, C. FPGA Overlays. In FPGAs for Software Programmers; Springer: Berlin, Germany, 2016;
pp. 285–305.
25. Gupta, S. Comparison of Different Data Flow Graph Models; Technical Report; University of Stuttgart: Stuttgart,
Germany, 2010.
26. Kelly, C.; Siddiqui, F.M.; Bardak, B.; Woods, R. Histogram of oriented gradients front end processing:
An FPGA based processor approach. In Proceedings of the 2014 IEEE Workshop on Signal Processing
Systems (SiPS), Belfast, UK, 20–22 October 2014; pp. 1–6.

25
J. Imaging 2019, 5, 16

27. Schleuniger, P.; McKee, S.A.; Karlsson, S. Design Principles for Synthesizable Processor Cores. In Proceedings
of the 25th International Conference on Architecture of Computing Systems (ARCS); Springer: Berlin/Heidelberg,
Germany, 2012; pp. 111–122.
28. García, G.J.; Jara, C.A.; Pomares, J.; Alabdo, A.; Poggi, L.M.; Torres, F. A survey on FPGA-based sensor
systems: Towards intelligent and reconfigurable low-power sensors for computer vision, control and signal
processing. Sensors 2014, 14, 6247–6278. [CrossRef] [PubMed]
29. Mogelmose, A.; Trivedi, M.M.; Moeslund, T.B. Vision-Based Traffic Sign Detection and Analysis for Intelligent
Driver Assistance Systems: Perspectives and Survey. IEEE Trans. Intell. Transp. Syst. 2012, 13, 1484–1497.
[CrossRef]

c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

26
Journal of
Imaging
Article
Optimized Memory Allocation and Power
Minimization for FPGA-Based Image Processing
Paulo Garcia 1, * , Deepayan Bhowmik 2 , Robert Stewart 3 , Greg Michaelson 3 and
Andrew Wallace 4
1 Department of Systems and Computer Engineering, Carleton University, Ottawa, ON K1S 5B6, Canada
2 Div. of Computing Science and Mathematics, University of Stirling, Stirling FK9 4LA, UK;
[email protected]
3 School of Mathematical and Computer Sciences, Heriot Watt University, Edinburgh EH14 4AS, UK;
[email protected] (R.S.); [email protected] (G.M.)
4 School of Engineering and Physical Sciences, Heriot Watt University, Edinburgh EH14 4AS, UK;
[email protected]
* Correspondence: [email protected]

Received: 19 November 2018; Accepted: 27 December 2018; Published: 1 January 2019

Abstract: Memory is the biggest limiting factor to the widespread use of FPGAs for high-level
image processing, which require complete frame(s) to be stored in situ. Since FPGAs have limited
on-chip memory capabilities, efficient use of such resources is essential to meet performance, size and
power constraints. In this paper, we investigate allocation of on-chip memory resources in order to
minimize resource usage and power consumption, contributing to the realization of power-efficient
high-level image processing fully contained on FPGAs. We propose methods for generating memory
architectures, from both Hardware Description Languages and High Level Synthesis designs, which
minimize memory usage and power consumption. Based on a formalization of on-chip memory
configuration options and a power model, we demonstrate how our partitioning algorithms can
outperform traditional strategies. Compared to commercial FPGA synthesis and High Level Synthesis
tools, our results show that the proposed algorithms can result in up to 60% higher utilization
efficiency, increasing the sizes and/or number of frames that can be accommodated, and reduce
frame buffers’ dynamic power consumption by up to approximately 70%. In our experiments
using Optical Flow and MeanShift Tracking, representative high-level algorithms, data show that
partitioning algorithms can reduce total power by up to 25% and 30%, respectively, without impacting
performance.

Keywords: field programmable gate array (FPGA); memory; power; image processing; design

1. Introduction
Advances in Field Programmable Gate Array (FPGA) technology [1] have made them the de facto
implementation platform for a variety of computer vision applications [2]. Several algorithms, e.g.,
stereo-matching [3], are not feasibly processed in real-time on conventional general purpose processors
and are best suited to hardware implementation [4,5]. The absence of a sufficiently comprehensive,
one size fits all hardware pipeline for the computer vision domain [6] motivates the use of FPGAs
in a myriad of computer vision scenarios, especially in applications where processing should be
performed in situ, such as in smart cameras [7], where FPGAs embed data acquisition, processing and
communication subsystems. Adoption of FPGA technology by the computer vision community has
accelerated during recent years thanks to the availability of High Level Synthesis (HLS) tools which
enable FPGA design within established software design contexts.

J. Imaging 2019, 5, 7; doi:10.3390/jimaging5010007 27 www.mdpi.com/journal/jimaging


J. Imaging 2019, 5, 7

However, since FPGAs have limited on-chip memory capabilities (e.g., approx. 6MB of on-chip
memory on high end Virtex 7 FPGAs), external memory (i.e., DDR-RAM chips connected to the FPGA)
is often used to accommodate frames [8,9]. This causes penalties on performance (latency is much higher
for off-chip memory access) and perhaps more importantly, on size (two chips, FPGA and DDR, rather
than just FPGA), power (DDR memories are power hungry [10]) and have associated monetary costs,
hindering the adoption of FPGAs.
In this paper, we research allocation of on-chip memory resources in order to minimize resource
usage and power consumption, contributing to the realization of power-efficient high-level image
processing systems fully contained on FPGAs. We propose methods for generating on-chip memory
architectures, applicable from both HLS and Hardware Description Languages (HDL) designs, which
minimize FPGA memory resource usage and power consumption for image processing applications.
Our approach does not exclude external memory access: rather, it is orthogonal to any memory
hierarchy, and applicable to any instances of on-chip memory. Specifically, this paper offers the
following contributions:
• A formal analysis of on-chip memory allocation schemes and associated memory usage for given
frame sizes and possible on-chip memory configurations.
• Methods for selecting a memory configuration for optimized on-chip memory resource usage and
balanced usage/power for a given frame size.
• A theoretical analysis of the effects on resource usage and power consumption of our partitioning
methods.
• Empirical validation of resource usage, power and performance of the proposed methods,
compared to a commercial HLS tool.
Our experiments show that on-chip memory dynamic power consumption can be reduced by up
to approximately 70%; using representative high-level algorithms, this corresponds to a reduction of
total power by up to 25% and 30%, respectively, without impacting performance. The remainder of
this paper is organized as follows: Section 2 describes related work within FPGA memory systems
architecture and design for image processing. In Section 3, we formally describe the research problem
of power-size optimization, present a motivational example that highlights the limitations of standard
HLS approaches, and present alternative partitioning methods. Section 4 describes our experimental
methodology and experimental results, and Section 5 presents a thorough discussion of said results.
Finally, Section 6 presents our concluding remarks.
Throughout this paper, we use the term BRAM (Block Random Access Memory), a Xilinx
nomenclature for on-chip memories, to refer to on-chip FPGA memories in general.

2. Background and Related Work


Within FPGA processing sub-systems, algorithms evolve from typical software-suitable
representations into more hardware-friendly ones [6,11] which can fully exploit data parallelism [11]
through application-specific hardware architectures [3], often substantially different from the
traditional Von Neumann model, such as dataflow [12,13] or biologically inspired processing [14].
These heterogeneous architectures are customized for FPGA implementation not just for performance
(e.g., by exploiting binary logarithmic arithmetic for efficient multiplication/division [15]), but also
for power efficiency (e.g., by static/dynamic frequency scaling across parallel datapaths for reduced
power consumption [16]).
More often than not, computer vision applications deployed on FPGAs are constrained
by performance, power and real-time requirements [3]. Real time streaming applications (i.e.,
performing image processing on real-time video feeds [6]) require bounded acquisition, processing and
communication times [16] which can only be achieved, while maintaining the required computational
power, through exploitation of data parallelism [11] by dedicated functional blocks [7].
However, the greatest limiting factor to the widespread use of FPGAs for complex image
processing applications is memory [9]. Algorithms that perform only point or local region operators

28
J. Imaging 2019, 5, 7

(e.g., sliding window filters) [15] are relatively simple to implement using hardware structures such as
line buffers [3]. However, complex algorithms based on global operations require complete frame(s) to
be stored in situ [11]; examples of contemporary applications that require global operations are object
detection, identification and tracking, critical to security. Notice we use the term “global operations” to
simultaneously refer to two characteristics: the use of global operators (atomic operations which require
the whole image, such as transposition or rotation) and undetermined (unpredictable) access patterns
(e.g., a person identification system might only need a subset of a frame, but which subset cannot be
decided at design time, as it depends on person location at runtime).
A possible approach is to refine image processing algorithms so they can perform on smaller frame
sizes that can be contained on an FPGA [2]. Several algorithms maintain robustness for downscaled
images [17], e.g., the Face Certainty Map [18]) or employ intelligent on-chip memory allocation
schemes [8] to accommodate complete frames that take into account power profiles. The latter requires
methods to optimize on-chip memory configurations in order to maximize valuable usage; often at
odds with performance-oriented allocation schemes standard in HLS code generators. Other possible
approaches include stream-processing algorithm refactoring to minimize memory requirements [19]
or programming-language abstractions for efficient hardware pipeline generation [20]; these are
orthogonal to our approach, and outside the scope of this work.
In our context, the most significant related work on the use of FPGA on-chip memory for image
processing applications has focused on four aspects: processing-specific memory architectures, caching
systems for off-chip memory access, partitioning algorithms for performance and on chip memory
power reduction.

2.1. Processing-Specific Memory Architectures


Memory architectures specialized for specific processing pipelines typically exhibit poor BRAM
utilization. Torres-Huitzil and Nuno-Maganda [9] presented a mirrored memory system: in order to
cope with dual access required by computational datapaths; data is replicated in two parallel memories
and a third one is used for intermediate computations. The need for data replication to support
paralellism inhibits scaling for higher frame sizes. Mori et al. [21] described the use of neighbourhood
loader: input pixels are fed to shift registers which de-serialize the input stream into a neighbourhood
region. Their approach supports only one output port, and sequential region read (no random access).
This approach does not exploit datapath parallelism, nor does it support classes of algorithms which
require disparate region access. Chen et al. [22] use distributed data buffers for expediting Fast Fourier
computations; they partially exploit spatial parallelism, focusing on time-multiplexing as a means
for reducing resource-usage and power consumption. Although time-multiplexing is a convenient
technique for certain classes of applications, it cannot be used in real-time streaming where input pixels
arrive at steady rates (without discarding frames). Klaiber et al. [23] have developed a distributed
memory that divides input frames into vertical regions stored in separate memories. Their approach
allows fine grained parallelism, but is only capable of handling single-pass algorithms, i.e., which do
not require storage of intermediate values. While this suffices for simple computations, it does not
satisfy the requirements of sophisticated computer vision algorithms which process data iteratively
(e.g., MeanShift Tracking [24]).

2.2. Caching Systems


Delegating frame storage to off-chip memory solves the capacity problem, at the cost of
performance and monetary expense. Caching techniques are used to minimize the performance
implications: e.g., Sahlbach et al. [25] use parallel matching arrays for accelerating computation;
however, each array is only capable of holding one row of interest (the complete frame is stored in
off-chip memory) and their results do not discriminate resource usage across modules, making it
hard to estimate the precise array costs. This approach can only support a limited class of algorithms:
column-wise operations, for instance, require off-chip memory re-ordering for data to be loaded

29
J. Imaging 2019, 5, 7

on-chip as rows, consuming precious processing time. Similarly, Chou et al. [26] have shown the
use of vector scratchpad memories for accelerating vector processing on FPGAs, but still rely on
random-access external memories; a similar approach is followed by Naylor et al. [27] in the context of
FPGAs as accelerators. The use of external memories solves the storage limitation: however, it greatly
limits parallelism (only one access per external memory chip can be performed at once) or greatly
exacerbates financial and power costs, if several external memories are used.

2.3. Partitioning Algorithms


For HLS-based designs, computer vision algorithms are naturally expressed by assuming frames
are stored in unbounded address spaces [28]. This software approach to FPGA design not only easily
exceeds FPGA memory capabilities but is also not easily integrated in streaming designs without
significant refactoring. This has led to the development of custom hardware blocks and APIs for
software integration [29]: “naive” C-based HLS results in several on-chip memory structures, whose
sizes and interfaces are dependent on variables’ types, often sub-utilizing available on-chip memory.
Most HLS tools offer compiler directives—pragmas—which guide the synthesis tool according to
the designer’s intention: optimizing for performance through loop unrolling, or selecting different
implementations (on-chip memories or LUTs). We advocate that more directives, invoking different
synthesis strategies, are required in order to tackle design constraints such as space and power.
The majority of research into partitioning algorithms has mainly focused on performance: namely,
throughput. Gallo et al. [30] have shown how to construct efficient parallel memory architectures
through High-Level Synthesis: however, their approach is predicated on re-organizing memory
placement at algorithm level, by examining computational behavior and placing data accordingly
through lattice-based partitioning, which is not feasible on streaming applications where pixels
are inputted sequentially. Although possible, it would require a complex memory addressing
mechanism between pixel input and memory structure. The authors then expanded their work
to incorporate information about loop unrolling [31], providing new partitioning algorithms for
maximizing parallelism; however, they did not tackle the utilization problem. Similarly, Wang et al. [32]
have demonstrated an extremely efficient algorithm for improving throughput, by creating memory
structures that facilitate loop pipelining in high level synthesis. Their approach saves up to 21%
of BRAMs compared to previous work [33]; still, since their objective is maximizing throughput,
supporting loop pipelining, their approach does not achieve optimized memory allocation in terms of
utilization efficiency.

2.4. Memory Power Reduction


The impact of memory partitioning on power consumption has been researched by
Kadric et al. [34]. Their approach investigates the impact of parallelism, i.e., how data placement
can be leveraged for parallel access, minimizing communication power. A similar approach is taken
in [35]. Tessier et al. [36] show on chip memory power reduction through partitioning, similar to our
approach and previous work by the same authors [37], and more recently in [38]. However, none of
these investigations assume constraints on memory availability. In contrast, we investigate tradeoffs
between power and scarce availability, inherent to the image processing domain, future work need
clearly identified by Tessier et al: “an investigation to determine the optimal size and availability of
different-sized embedded memory blocks is needed".

3. Memory Partitioning on FPGA


In this paper we describe how to partition image frames into BRAMs in order to maximize
utilization (i.e., minimize the number of required on-chip memories), subject to minimization of
power consumption. We begin by by formulating the utilization efficiency problem, without paying
any consideration to power aspects; the following section integrates power consumption in our

30
J. Imaging 2019, 5, 7

problem formulation. We assume that only one possible BRAM configuration is used for each image
frame buffer.

3.1. Problem Formulation: Utilization Efficiency


Definition 1. Given a BRAM storage capacity C, and a number of possible configurations i, the configurations
set Cfg is a vector of i elements:
⎛ ⎞ ⎛ ⎞
( M1 , N1 ) Cfg1
⎜( M , N )⎟ ⎜Cfg ⎟
⎜ 2 2 ⎟ ⎜ 2⎟
⎜ ⎟ ⎜ . ⎟
⎜ . ⎟ ⎜ ⎟
Cfg = ⎜ ⎟=⎜ ⎟ (1)
⎜ . ⎟ ⎜ . ⎟
⎜ ⎟ ⎜ ⎟
⎝ . ⎠ ⎝ . ⎠
( Mi , Ni ) Cfgi
where the first component of each element depicts BRAM width M and the second component depicts BRAM
height N, such that:

Mx × Nx ≤ C, ∀ x ∈ [0, i − 1] (2)

For any given frame size, several possible BRAM topologies are possible (Different BRAM
configurations do not always equal the same logical bit capacity. Whilst the total physical capacity is the
same, in some configurations parity bits can be used as additional data bits. E.g., configuration (1,16384)
can store 16384 bits, whilst configuration (9,2048) can store 18432 bits). A frame is a 3-dimensional array,
of dimensions width W, height H, and pixel bit width Bw (typically defined as a 2-dimensional array
where the type defines the bit width dimension). BRAM topologies are defined based on a mapping of
3-D to 2-D arrays and a partitioning of a 2-D array to a particular memory structure (Figure 1).

Figure 1. Mapping a 3-D array into row-major and colum-major order 2-D arrays.

31
J. Imaging 2019, 5, 7

Throughout the remainder of this paper, we assume the use of a mapping scheme which assigns
Bw to the x dimension and H and W to the y dimension, in both row-major and column-major
order (where x and y are 2-D array width and height, respectively). This is the default approach in
software implementations, where the type/bit width dimension is considered implicit, and a sensible
approach for hardware implementations. Mapping bit width Bw across the y dimension would result
in implementations where different bits of the same array element (pixel) would be scattered among
different memory positions of the same BRAM. This would require sequential logic to read/write a
pixel, accessing several memory positions, creating performance, power and size overheads. It should
be noted that this approach might offer performance advantages for certain classes of algorithms
which might want to compare individual bits of different elements; however, we delegate this aspect
to future work. Hence, we define only the default mapping scheme:

Definition 2. A mapping scheme m transforms a 3-D array A3 into a 2-D array A2 of dimensions x and y by
assigning Bw to the x dimension and ordered combinations of W and H to the y dimension, for a total of two
possible configurations, as depicted in Figure 1. Mapping schemes are defined as:

( x, y) = m(W, H, Bw ) (3)
A2x,y = A3y\W,y%W,x , x = Bw , y = W × H (4)
A2x,y = A3y%H,y\ H,x , x = Bw , y = W × H (5)

where \ and % represent integer division and modulo, respectively.

Definition 3. Given a 2-D mapped image frame of dimensions x and y, a partitioning scheme p which assigns
pixels across a × b BRAMs, depicted in Figure 2, is defined as the linear combination:

p( x, y) = Cfg ∗ ( a1 , b1 ), ( a2 , b2 ), ..., ( ai , bi ) (6)

where ∗ stands for linear combination, such that only one ( a x , bx ), ∀ x ∈ [0, i − 1] pair has non-zero components
(such a pair is generated as a function of x and y), selecting M p and Np subject to:

(( a × M p ) ≥ x ) ∩ ((b × Np ) ≥ y) (7)

Figure 2. Mapping 2-D array of dimensions x = Bw and y = W × H to a × b BRAMs configured for


width M and height N.

Different partitioning schemes p, implementing different functions of x and y, result in different


addressing, input and output logic requirements, each with a particular impact on performance

32
J. Imaging 2019, 5, 7

and resource usage. As this is the greatest bottleneck in implementing high-level image processing
pipelines on an FPGA, it is paramount to define BRAM usage efficiency, i.e., the ratio between the total
data capacity of the assigned BRAMs and the amount of data which is actually used.

Definition 4. Given a partitioning scheme p and maximum BRAM capacity C, the utilization efficiency E is
defined as the ratio:

x×y
E= (8)
a p × bp × C
The default mapping and partitioning schemes in state of the art HLS tools are geared towards
minimizing addressing logic (abundant in contemporary FPGAs), resulting in sub-par efficiency in
BRAMs usage (still scarce for the requirements of high-level image processing systems). Alternative
schemes must be used in order to ensure memory availability within HLS design flows. We define the
problem as:

Problem 1 (Utilization Efficiency). Given an image frame of width W, height H and pixel width Bw , select a
partitioning scheme, in order to:
x ×y
Maximize E = a p ×b p ×C

Subject to (( a × M p ) ≥ x ) ∩ ((b × Np ) ≥ y)

3.2. Utilization Example


Consider an image frame of width W = 320 and height H = 240, where each pixel is 8 bits
(monochrome), and BRAMs which can be configured according to:
⎛ ⎞
(1, 16384)
⎜ (2, 8192) ⎟
⎜ ⎟
⎜ (4, 4096) ⎟
⎜ ⎟
Cfg = ⎜ ⎟ (9)
⎜ (9, 2048) ⎟
⎜ ⎟
⎝(18, 1024)⎠
(36, 512)
which is representative of state of the art FPGAs (Xilinx Virtex 7 family 18Kbits BRAM.), where total
BRAM capacity C is given by C = 36 × 512. Using a partitioning scheme
⎛ ⎞T
(8, 8)
⎜(0, 0)⎟
⎜ ⎟
⎜(0, 0)⎟
⎜ ⎟
p(m(320, 240, 8)) = Cfg ∗ ⎜ ⎟ (10)
⎜(0, 0)⎟
⎜ ⎟
⎝(0, 0)⎠
(0, 0)
where m(320, 240, 8) = (8, 76800) (Equation (3)), yields a BRAM usage count of 64 (8 × 8 BRAMs
configured for width 1 and height 16384), with storage efficiency:

8 × (320 × 240)
E= = 0.520833333 (11)
8 × 8 × (36 × 512)
We have observed that this is the default behaviour for Xilinx Vivado HLS synthesis tools:
empirical results show that configuration ( M1 , N1 ) = (1, 16384) is selected through a partitioning
scheme where a1 = Bw and

W×H
b1 = (12)
N1

33
J. Imaging 2019, 5, 7

rounded up to the nearest power of 2. Our experiments show that for any frame size, the synthesis
tools’ default partitioning scheme can be given by:
⎛ ⎞T
log2 ( WN× H )
( Bw , 2 1 )
⎜ ⎟
⎜ (0, 0) ⎟
⎜ ⎟
⎜ (0, 0) ⎟
p(m(W, H, Bw )) = Cfg ∗ ⎜


⎟ (13)
⎜ (0, 0) ⎟
⎜ ⎟
⎝ (0, 0) ⎠
(0, 0)
log ( W × H )
where 2 2 N1 should be read as 2 to the rounded up (ceiled) result of the logarithm operation (i.e.,
2 to an integer power).
Now consider the same mapping (x = Bw , y = W × H), but with a partitioning scheme:
⎛ ⎞T
(8, 5)
⎜(0, 0)⎟
⎜ ⎟
⎜(0, 0)⎟
⎜ ⎟
p(m(320, 240, 8)) = Cfg ∗ ⎜ ⎟ (14)
⎜(0, 0)⎟
⎜ ⎟
⎝(0, 0)⎠
(0, 0)
which partitions data unevenly across BRAMs, rather than evenly. This scheme yields a BRAM usage
count of 40, with storage efficiency:

320 × 240 × 8
E= = 0.833333333 (15)
8 × 5 × (36 × 512)
Yet a better partitioning scheme for the same mapping would be:
⎛ ⎞T
(0, 0)
⎜ (0, 0) ⎟
⎜ ⎟
⎜(2, 19)⎟
⎜ ⎟
p(m(320, 240, 8)) = Cfg ∗ ⎜ ⎟ (16)
⎜ (0, 0) ⎟
⎜ ⎟
⎝ (0, 0) ⎠
(0, 0)
yielding a BRAM count of 38 and efficiency:

320 × 240 × 8
E= = 0.877192982 (17)
2 × 19 × (36 × 512)
Clearly, partitioning schemes depend on the frame dimensions, width, height, and bit width, to
enable efficient use of on-chip memory blocks.

3.3. Power Considerations


Having formalized the utilization problem, we may proceed to analyse the power implications
of each configuration. We model BRAM dynamic power consumption using the model described by
Tessier et al. [37]: a power quantum is consumed per read and/or write. BRAM static power is directly
proportional to utilization, hence addressed in the utilization problem.
For any given BRAM cell, the read power is consumed by a sequence of operations: the clock
signal is strobed; the read address is decoded; the read data is strobed into a column multiplexer; the
read data passes to BRAM external port. Write power is consumed by the following sequence: the
clock signal is strobed; the write enable signal transfers write data to the write buffers; a line is selected
by address decoding; data is stored in the RAM cell.

34
Exploring the Variety of Random
Documents with Different Content
“Mark my footsteps, good my page,
Tread thou in them boldly:
Thou shalt find the winter’s rage
Freeze thy blood less coldly!”
Gurth paused on the threshold an instant listening to the singing,
then entered without knocking. The little boys rushed to hang about
him and explore his pockets, and the pastor and Theresa welcomed
him warmly. It was Andrea alone who saw a change in his whole
demeanour, and wondered at the bird-cages. The evening meal was
soon eaten, and the boys went to the kitchen with the toys that
Gurth had brought them; the pastor, scenting something, sat erect in
his arm-chair, all forgetful of his pipe and expectant of some news,
while Theresa hung over him. Gurth stood by the stove, nervous and
uncertain how to begin.
Andrea went to him, and, putting her hand through his arm, said
quietly, but with an infinite tenderness in her voice, “You are going
away, dearest, and you have brought your birds for me to keep for
you.”
It was as if her voice smoothed away his fears and perplexities,
so all four together they discussed the situation without reserve.
Gurth, forgetting his prudent plans, begged the pastor to marry
them then, or at the latest in a few days, when the necessary
legalities could be complied with, so that he might leave Andrea as
his wife. Upon this point the pastor was obdurate. His practical
instinct, born partly of the peasant suspicion of another class, and
partly of hard experience, forbade this. Among his parishioners many
a wedding had taken place on the eve of parting, and the husband
had been swallowed up in that vast new world, while the poor girl at
home waited in vain, not knowing whether she was wife or widow.
He liked Gurth in a way, but he was sadly disappointed in his
failure to reconcile his mother to the marriage, and, while he
believed him sincere at present, he did not know how the separation
might affect either of the young people; so he insisted upon delay. If
Gurth had established himself by the next Christmas, he might
return and marry. If not—well, there were other men who, under the
circumstances, would be more suitable for Andrea, though he did
not voice his opinion. In reality he had no romance in his nature,
and he disbelieved in unequal marriages, especially if money was
not coupled with the rank.
If, after a year’s trial, Gurth was in a position to come for Andrea,
—well and good,—but further than that the pastor could be neither
coaxed nor driven.
Moreover, he allowed them little privacy for saying good-by.
“I know how to work, and I like it, but you must learn how,”
Andrea whispered, as she clung to him. “But I will be ready, Gurth,
and, more, if you can’t return, I will go to you!” This understanding
was their farewell.
His mother, when she found that he had gone, laughingly told
her friends that Gurth had a foolish love affair, and, taking her
advice, he had gone away to travel it off.

II
There is nothing that tends so to destroy the conceit of a man
little used to the sea, as an ocean voyage in midwinter, especially if
it is made on board an emigrant ship. On a good liner he may prop
up his flimsy importance in a dozen ways, from feeing stewards to
bring him six meals a day while he lies in his berth, to pulling himself
together and wearing the distinction of being the only cabin
passenger at table during a furious squall. But on an emigrant ship it
is impossible to veil or soften stern reality.
Gurth had chosen this way of travel that he might more quickly
realize his changed circumstances. For two weeks or perhaps three
he must live in this community. Previously he had a theoretical
knowledge of the conditions that surround and make poverty. Now
for the first time he saw the reality. His first thought was of the
wonderful patience of these people; the next conviction was of their
unconquered hope.
A dozen perhaps had settled homes in America and had returned
to their native land merely to visit, but the multitude were going,
they were not quite sure where, to earn their bread, they did not
know how. Doubts did not trouble them, their pink pasteboard
tickets seemed the pledge of landing somewhere, and as for the
rest, they were used to uncertainty.
The fourth day out, a day when a mild streak and a few hours’
sunshine brought all the grotesque animated bundles of clothes from
their berths, Gurth took his violin and, without ado, began to play a
native ballad, and then another. Silently the people grouped about
him, some stealing below to coax up a comrade who was ill.
The intensely earnest look on their faces stimulated him, and he
played on and on, grading his music from grave to gay, to suit each
in turn, until at last, feeling his wrist failing, he made the national
hymn a final effort. Scarcely had the tune taken form than a chorus
rose, at first swaying and uncertain, and then gaining power and
steadiness, until the last word was reached. The men rubbed their
eyes with the backs of horny hands, and women hugged him, and
before he realized the situation, one stolid, square-faced man, who
had virtually declined to talk to him the day before, was passing
around his peaked fur cap to receive a ready shower of small coin,
which Gurth could not refuse. So thus he earned his first money. By
his violin and its speech, which, however exquisite, no man feels
above him, he was admitted to the freemasonry of his companions.
A carpenter who had been home to see his old parents asked
Gurth where he was going to settle, and then he realized that he did
not know, save what his port was, and that he did not wish to locate
far from the sea, nor in a sultry climate. The carpenter drew from
him such scant outline of his schemes as he wished to tell—his plan
of buying a farm, after he had learned the country’s ways. This man
told him about the village where he lived, which was near a New
England town whose railways offered a market for small fruits, and
he advised Gurth to work for his board and lodging with one of the
numerous fruit-growers until he learned the craft, saying that as he
spoke English well, Waldsen might earn a trifle above his board, but
that a man who had never done hard work was not worth much.

III
It was a bitterly cold winter; the wind swept fiercely through the
cut between Sunset and Rocky hills, rushing down the main street at
Glen Village, separating the neighbours on either side more
effectively than drifts of snow could have done. However deep, there
is something cheerful and exhilarating about snow. Children think
that it is sent for their special amusement; the shy young man, who
drives his sweetheart over to the “Social” in the next village, needs
no excuse for putting his arm around her, for light sleighs have been
known to upset suddenly without the slightest warning. The old folks
are cheerful in their reminiscences of just such episodes, and
compare each storm with some long-remembered one in the thirties,
noting always the frail and inferior wearing quality of modern snow.
But Wintry Wind is the most exasperating and prying of nature’s
messengers, whose mission is the uncovering of weaknesses in all
things animate and inanimate. It soon discovers if your eyes are
sensitive, your hat a size too small, that you are subject to
rheumatism, that your breath is short when you walk uphill, and that
your knees bend as you go down, and so turns your cloak over your
head like an extinguisher. It knows precisely which shingle lacks a
nail, and will lay bare spots calculated to make obstinate leaks. It
also spies out the blind whose catch is loose, the gate with one
hinge, the elm that is split in the crotch, and the particular chimney
flue that leads to the room where your most important relation (who
suffers from bronchitis) is being entertained at tea, and it gauges
accurately which article on the clothes-line you value the most.
It was this sort of weather, combined with his daughter
Margaret’s delicate health, that made Ezra Tolford, living at the Glen
Mill, for which the village was named, resolve to have a hired man.
Now Ezra Tolford had many titles to local distinction. He was
Deacon of the First Church, and his parents had been zealous before
him, his grandfather having had the hardihood to fly to the woods
with the church plate on the approach of the British in 1779, thereby
risking his life via wild beasts, Hessians, and exposure,—a fact that
is brought up in every local historical discourse to this day.
Incidentally it might be mentioned that the plucky ancestor (owing
to fright and darkness) was never able afterwards to locate the
marshy spot where the precious metal was buried; this fact,
however, is usually omitted.
Ezra was also Judge of Probate, thanks to a fragmentary law
course taken in days when a fond mother had pinched and saved
that her only boy might “make his mark.” Thirdly, he was the owner
of the best mill on the Pequotuck. A mill that, in spite of the sale of
flour and meal at the village store, kept its wheel going five days out
of seven during nine months of the year, sawing wood when no one
wished flour, and turning out middlings for the cattle when the
stacks grew low. So swift was the river that ice very seldom silenced
the song the old wheel hummed as it worked.
Lastly, by wise drainage the deacon had turned a dozen acres of
protected meadow-land, heretofore regarded as next to useless, into
one of the thriftiest fruit farms in southern New England.
All these things made Ezra’s daughter Margaret of special
importance in many eyes besides his own, and it was for her sake
that he resolved to have a man to hook up the team for her, when
he was busy in the mill or away in the village, and do a thousand
and one little errands that the sturdier daughters of his neighbours
accomplished for themselves.
The Mill House, as it was called, stood on a hill between the
Pequotuck and a little brook that, curving, joined the river below the
dam. It was a placid-looking white house of a style of architecture
that might be called New England Restored. It had been Colonial,
but a modern bay-window, a piazza, and a lean-to in the rear had
hybridized it; yet it still possessed a dignity never seen in the rural
interpretations of the Queen Anne villa.
This particular house had a very attractive outlook. Raised well
above them, it was bounded on the western side by the river and
the mill-pond that always held the sunset reflections until the
twilight absorbed them, while the old red mill with its moss-mottled
roof focussed the view. Toward the north and east the meadows ran
slantwise up a hillside, where, dotted here and there like grazing
sheep, you could see the stones of the burying-ground, where the
inhabitants of the glen took their final rest, as if their friends had left
them as near heaven as possible, and safe from the floods that used
once to sweep the valley. To the south the road ran tolerably straight
for three miles down to Glen Village itself.
The interior of the house differed but slightly from others of its
class, and that difference consisted in the greater genuineness of its
fittings. Evidently the woman who presided over it appreciated
relative values, for the sitting-room had glowing crimson curtains
and a fire of logs in place of the usual “air-tight,” while in one corner,
in the location usually chosen for the inevitable asthmatic parlour
organ, stood an upright piano. On the table was a comfortable litter
of books and papers.
By the window, looking down the road, stood Margaret Tolford.
At the first glance there was nothing striking about her personality.
Medium in height and colouring, her slight frame was wrapped in a
soft white shawl that gave her a fragile air. At a second glance the
deep gray eyes, that looked from under a brow narrowed by a
quantity of smooth, coal-black hair, were magnetic in their intelligent
wonder. Her eyes said, “There is much that I would understand, but
I cannot;” whereas a shallower nature would have thought, “I am
misunderstood!”
The wind whistled in the chimney, and the pud, pud, of a heavy
flatiron came from the kitchen, with snatches of inharmonious song,
as the thick-lipped Polack who was the “help” pummelled the towels
and folded them at angles that would have distracted a
mathematician. In fact, this very Polack was one of Margaret’s lesser
problems, a sort of necessary evil who, in summer, bareheaded and
barefooted, pervaded the premises, but having with her gay
neckerchief a certain sort of picturesque fitness, which, when
brought nearer, booted and confined to the winter kitchen, became
an eyesore. Other farmers’ daughters did the cooking and the lighter
work, and only had a woman to help with the washing.
Margaret had never done manual labour; her mother, dead now
two years, had stood between this only child and all hardship, and
coaxed the Deacon to send her to a collegiate school when her
playdays were over. In the summer holidays she was petted and
caressed and kept from soiling her hands, and when at eighteen she
was coming home for good to mingle as an equal with her parents
and learn her part in life, her mother died, and her father closed the
one tender spot in his stern heart around his daughter. So she lived
shut up within herself, craving a more intellectual companionship
than the neighbourhood furnished, and starving unconsciously for
demonstrative affection.
Tolford was a silent sort of man, who had been so thoroughly
understood by his wife that she seemed to know his unvoiced
wishes. Because he showed so few signs of an affection that would
have won a hearty response from Margaret, he failed to comprehend
the difference between a deeply reserved nature and physical
weakness, to which cause he laid her abstraction. His love for her,
therefore, took the schooltime form of shielding her from work. He
liked to hear her play hymns on Sunday evenings, and was very
proud to have her train the children of the Sunday School in their
carols, but it never occurred to him to ask her advice in any of his
plans, or expect aid from her. She stood apart, not understanding
the love her mother had drawn from the stern, lonely man, and
while he excused her reserve, and told the neighbours she was
delicate and peaky, her only ailment lay in lack of motive.

It grew dark, and points of light appeared here and there in the
landscape; an icy slip of a moon pierced the driving clouds. Margaret
drew the curtains and sat down by the fire, its light sending a glow
to her usually colourless face. A brisk, though heavy, footstep came
along the entry from the kitchen, and Ezra Tolford opened the door,
and, stopping a moment to adjust his eyes to the fitful light, went
toward the fire, rubbing his hands. Margaret immediately arose and,
pushing a rocking-chair towards him, prepared to light the lamp.
“Never mind that now, daughter,” he said; “sit down, I want to
talk a bit. You know I said I’d get a hired man to ‘piece out’ with the
work? Well, he’s come!”
The Deacon was, in reality, fairly well educated, but since his
wife’s death (she had kept him to her standard, for she had been a
schoolmistress) his English had relapsed into localisms, and, besides
this, at the present moment he seemed ill at ease. Margaret merely
understood the announcement as a roundabout question as to
whether any accommodations were prepared for the man, and said:
“The shed bedroom is just as Hans Schmidt left it last fall; I suppose
a bed could be made up now, and Zella can clean the room to-
morrow, but it will be very cold unless you give him a stove.”
“Well—er—you see,” said the Deacon, “I don’t suppose that room
will do,—em!—hem! You see in the beginning he is to live with me
without wages, and—” here the Deacon came to an embarrassed
standstill, and Margaret broke in,—“Without wages! If he is as poor
as that, he will scarcely object to the shed room without a fire for
the night!” She did not say this because she was at all mean or hard-
hearted, but from her experience of the servant question, any one
who was willing to work for nothing must either be utterly worthless
or bereft of reason.
“Not at all, not at all, daughter! You see, the man is not a
common workman, but may buy the Hill Farm some day as a home
for his sister, and wants me to teach him to grow small fruits, and
learn the way of things here while he gets it to rights. I’ve
contracted with him for a year—” and as Margaret did not reply, he
continued, “You know Peter Svenson, the carpenter, who went home
to Denmark last summer to see his folks? Well, he brought this
young man back with him. Peter knows all about him, and says he is
perfectly honest and speaks good English, but is close-mouthed, and
doesn’t like to talk of his affairs, because his family used to be well
fixed, but now they are all dead but one sister. He has a few
thousand dollars and is going to make a home and bring her over in
a year.
“Peter says he can play a fiddle, but isn’t used to hard work, and
advised me not to pay him money, but to offer to show him how I
work my farm and give him his board for his services.” Then the
Deacon continued, giving the account of Gurth that the garrulous
carpenter had pieced together to cover his lack of real knowledge.
As Margaret still said nothing, he added:—
“Now I think the attic east room might be straightened up,—it
won’t take long, and it can be bettered to-morrow.”
Instantly Margaret was divided between extreme wonderment at
this strange arrangement on her father’s part, and fierce resentment
at the intrusion of a stranger in the house,—a man who was and
was not a servant, who must necessarily eat with them, who would
not perhaps leave the room when the meal was finished.
If Margaret had a decided eccentricity, it was her positive
resentment of male society, and she bore the reputation of being
proud, because, when the village swains drove up in their newly
washed buggies with bows of ribbon tied to the whip handles, and
with self-satisfied glances asked her to take a drive, the usual rural
compliment, she invariably declined, and their irate mothers settled
that she either must be in a decline, experiencing religion, or else,
woful thought, “engaged to some fellow Northampton way,” where
she had been to school.
The truth was that she had, through a wide range of reading and
no experience, built up a well-nigh impossible ideal, half mediæval
heroism, half modern, intellectual refinement, that was irreconcilable
with the type of men with whom she came in contact.
Margaret was thoroughly accustomed to her father’s silent mood
and considered him by far (as he was) the best-informed man she
knew. He was also fond of reading, not only subscribed to a daily
paper, but several weeklies and magazines, and always allowed her
to buy any book she fancied, so that their winter evenings, when
Margaret read aloud, were comfortably sociable, and sympathetic. It
was no wonder, therefore, that she resented the presence of a
stranger, and it was with rather a lowering brow that she followed
her father to the kitchen.
Deacon Tolford went in first, and said abruptly, but in a tone that
Margaret knew was meant to be cordial: “Daughter, this is Gurth
Waldsen, who is going to help me out this year; we want to make
him feel so much at home that he’ll settle in Glen Village. You’d
better tell Zella to hurry supper; I guess we are both of us hungry.”
Margaret added some ordinary words of greeting before she
looked at the figure who rose from the settle back of the stove and
bowed, without offering to shake hands, as a native would have
done. Then she raised her eyes and saw the tall, easy figure with
the golden-tipped hair and beard, his dreamy gray eyes looking at
her with a directness that was not curious, but almost as of pleading
for mercy, while the mouse-coloured corduroy suit that Waldsen
wore brought out the clearness of his skin in a degree that was
almost startling.
“I hope that I put you not to great trouble,” he said in his soft
baritone. “If you will tell me where I may place my things, I can
arrange all myself.” The English was musical, and doubly so from the
slight hesitation and accent.
What passed through Margaret’s brain she never clearly realized,
but she heard her voice as from a long distance asking him to follow
her upstairs, and found herself lighting a lamp, and leading the way.
It was strange that she had never noticed before how dreary the
attic was. She merely indicated the room, saying that he might leave
his things there, and to-morrow he could bring up firewood, while
to-night she would give him an extra supply of bedding. As she left,
Gurth looked after her and at the bare room, and shivered, but the
room seemed less cold to him than the woman. There was no
reason that he should expect her to be cordial; doubtless she would
have preferred a field hand to whom she need not speak.
He realized that his very disappointment grew from the lack of
proper comprehension of his present position. “Oh, Andrea! Andrea!
for one sight of her sweet, sympathetic face, one touch only!” A
harsh, clanging bell from below waked him to the fact that if he
wanted water to wash his hands, he must bring it up himself; he
looked at them dubiously, smoothed his hair, flipped off his clothes
with his handkerchief, and went down.
He hoped that he might be allowed to eat his meals in the
kitchen; it would indicate his position more clearly, and he should be
less lonely than with constrained companionship. This was not to be.
As he passed the dining room door, he saw a table laid for three, at
which Ezra Tolford was already sitting, wrapped in a gaily figured
dressing-gown, and collarless, as was his habit when either at ease
or at work. He was reading a paper which was propped against a
pitcher, and he barely raised his eyes as he asked Gurth to be
seated.
Margaret came in with a coffee-pot and a plate of biscuits. She
had thrown off her shawl, and her crimson cashmere waist
accentuated the depth of her eyes. Gurth unconsciously arose and
drew out her chair, waited until she was seated, and pushed it in
again. It was a very simple and ordinary act of courtesy, and done
as a matter of course without the slightest manner of conferring a
favour. Margaret coloured at this hitherto unknown civility, but said
“Thank you” as if she were quite accustomed to it, while the Deacon
did not notice it at all.
The meal began in silence, but the Deacon finished his paper
with the first cup of coffee, and began to discuss the affairs of the
farm in a businesslike manner. The ice-cutting must begin to-
morrow, it was quite clear, for the last snowstorm had been dry and
had drifted away from the pond.
Had Waldsen ever cut ice? No! Well, he could superintend the
weighing of it, then. Could he milk? No! The hay must be transferred
from the left side of the great barn to the right, as the supports
were giving way, and Peter Svenson, the carpenter, must come and
straighten them, as well as do some tinkering at the mill. Squire
Black at the village needed two tons of hay, so that much could be
carted in next morning.
Waldsen fortunately was thoroughly familiar with horses, and
was a good deal of a carpenter, having always had a fancy for such
work, and, when a boy, he had for amusement built an arbour for
his mother in the garden of her country-house. He was able to
volunteer to repair the barn and mill, if the Deacon had the
necessary tools. The Deacon was too keen to show his surprise, but
accepted the offer, and said it would come handy to have some
patching up done before it came time to clear the land. He could
manage the cows and the mill, if Gurth took charge of the horses
and the chores.
The Deacon, having finished his meal, shook the crumbs from a
fold of the tablecloth of which he made a sort of apron in his lap,
and left the table. Margaret followed him, and Waldsen, hesitating a
moment, went to the back entry and began to collect his
possessions, taking his violin case and a small box first. When he
returned for his trunk, the Deacon appeared, and, as a matter of
course, helped him carry it upstairs. The trunk was very heavy, being
half full of books. Then the two men went out to feed the horses;
the sharp, dry snow blew in like powdered flint when they opened
the door, and made rainbows about the lantern as they went down
the path.
After the table was clear, Margaret took up the paper, read for a
few moments, then dropped it suddenly and went into the kitchen.
Zella, who was knitting a skirt of scarlet yarn, seemed very sulky and
angry when Margaret bade her take some wood to the attic
bedroom. “I no carry for hired man,” was her rejoinder. “You will
take the wood up to-night,” said Margaret, in the quiet, decided tone
that was habitual to her; “to-morrow he will carry it himself.” In a
short time a fire was started in the old, open-fronted wood stove,
that sent a welcome glow across the long, low room with its deeply
recessed dormer windows. The furniture consisted of an old-
fashioned four-posted bedstead and some spindle-backed chairs,
discarded long ago from the lower rooms, an old chest of drawers
and a table, while a row of wooden pegs behind the chimney did
duty as a closet.
Going to the adjoining lumber room, Margaret pulled open a long
trunk and took a chintz quilt, some curtains that had originally
belonged to the old bed, and three or four carpet rugs. These she
dragged into the attic, and then brought from a downstairs room a
large rocking-chair, covered with Turkey red, and a blue china bowl
and pitcher. The last man who had slept in the attic had washed at
the pump. In a few minutes the bare room looked quite habitable,
and Margaret returned to her newspaper.
In perhaps half an hour her father returned, and she heard
Waldsen’s steps going up the creaking back stairs.
“Well, daughter, quite a figure of a man, isn’t he? I know you
don’t like to have men folks about, but you see this arrangement will
advantage me greatly. If I can sell him the Hill Farm, it will be so
much clear gain, besides being a bargain for him, for it’s running
down and needs lots of tinkering. And if we get a good neighbour
there, it won’t be so lonesome for you when I go over town. I can
arrange with him for half-time work in the growing season, so he
can get his fruit running. I’ll sell that place for three thousand dollars
—and three thousand dollars in hand,—why, Margaret, you might go
to Europe next summer with Judge Martin’s folks! He told me
yesterday they expected to take a tour, and that if I’d let you go,
you’d be good company for Elizabeth. What do you say to that,
daughter?”
Going to him and sitting on the arm of his chair, she hid her face
on his shoulder, a childish habit of hers, and said: “Dear old dad, I
should want you to go with me, and then, besides, it is all so
uncertain. This man may not really want to buy a place, or he may
have no money, or—or, a great many things may not be true!”
“No, no, child! the man is all right, he wants to have a home of
his own by next Christmas. There is some reason why his sister
cannot come until then. I like to keep you with me, but my little girl
is too lonely; she must see more company, and if she’s too wise and
too proud for the folks about home, why, this place isn’t the whole
world.”
Meanwhile Waldsen was sitting on his trunk in the attic room in
an attitude of dejection. Then, as the fire flickered, he saw the
change that had been wrought. Not great in fact, but in the
womanly touch, and he was comforted. Taking from his pocket the
little case containing Andrea’s portrait, he placed it on the chest of
drawers, and, after closing the door, took out his violin.
Margaret and her father were playing their nightly game of
backgammon when she started, dropped her checkers with a rattle,
and grasped his arm. The Deacon looked up in surprise, and then,
as he heard a far-away strain of music that seemed to come from
the chimney, said, “Don’t be scared, daughter, it’s only the young
man playing his fiddle!” But somehow neither father nor daughter
cared to continue their game, and a moment later Margaret opened
the door of the sitting-room and one at the foot of the stairs, and
stood there listening, in spite of the cold air that swept down.
Accustomed at most to the trick playing of travelling concert troupes,
who visited the next town, this expressive legato music was a
revelation to Margaret, and stirred her silent nature to untested
depths. The first theme was pleading and wholly unknown to her,
but presently the air changed to the song she had taught the
children during the last Christmas season; through it she heard two
voices singing,—the violin and the man.
“Brightly shone the moon that night
Though the frost was cruel
When a poor man came in sight
Gathering winter fuel.

“Hither page and stand by me


If thou know’st it telling
Yonder peasant, who is he
Where and what his dwelling?”
“Hymn tunes,” said Deacon Tolford, pursing his mouth in a
satisfied way. “I forgot to ask him if he is a church member. Perhaps
he might help out at the Endeavour Concert next month.” But
Margaret, shaking her head impatiently, stood with her finger on her
lips.
The Tolford household was more cheerful after Waldsen’s
coming. Not that he intruded upon the Deacon and his daughter,
merely talking a few minutes after meals, perhaps, and then going
to his attic, but little by little the mutual strangeness wore off.
Though Waldsen fulfilled to the letter the work that he had engaged
to do, he found that it was impossible to keep up the illusion of
being a mere labourer, and reconciled himself from the fact that in
other farming families the steady male “help” stands placed on a
different footing with the household, from the transient field hands
who come and go with the crops and seasons. Farmer Elliott’s “help”
was his brother-in-law, and Farmer Bryce’s, his wife’s cousin.
The Deacon looked at the whole matter from a commercial
standpoint. Here was a likely young man who, though he was
unused to many kinds of manual labour, eked out his lack of
knowledge with extreme willingness, and asked no wages other than
instruction. At the same time he was a prospective purchaser of a
house that had been difficult to sell. That was the beginning and end
of the matter. That Waldsen was rarely intelligent, and added to their
home life, was also an advantage, but secondary.
Every day Gurth held Margaret’s chair, and placed it at the table;
there was no longer any restraint between them. He saw in her a
sweet, womanly nature, whose best part was evidently held in
check, owing to the peculiarities of the community in which she
lived, which he could not fathom in spite of freedom from all
prejudice. He admitted the beauty of purpose with which she clung
to her ideals, but could not help contrasting her reserve with
Andrea’s spontaneous cheerfulness, her love of everything that grew
from the ground and every bird that flew, while Margaret seemed
but half conscious of the natural beauties that surrounded her.
Waldsen was most contented when employed at the mill. Birds
that braved the winter gathered about it for scraps of grain.
Nuthatches pried under the mossy shingles, meadow-larks stalked
solemnly in the stubbly grass for sweepings, and robins fed upon the
berries of many bushes that hedged the pond. Wild geese rested
there, and for days at a time flocks of ducks would pass and pause
for shelter, and owls roosted nightly in the mill loft, making hearty
meals of mice. Many a time he saw the quail coveys far up on the
hill running about among the gravestones, and he put a sheaf of rye
there for them, and it waved its shadowy pinions above the snow, as
if saying to the silent community, “I, too, have slept in the ground;
have courage!”
Another sheaf he fastened over the mill door, and, seeing it, the
Deacon lectured him upon the folly of gathering a lot of birds that
must be shot or scared away in berry season, saying, “It’s all very
well now, but if you encourage them, where will the profit be when
all the biggest berries are bird marked?”
Gurth felt like answering, “I will let the birds have them all, so
long as they come to me.” But then, where would be the bread for
Andrea? He felt beauty so keenly that he could not bear to harness
Nature and drive her like a cart-horse for his profit. His needs and
his desires were almost irreconcilable, and the consciousness of it
well-nigh appalled him. He could not change his temperament in the
least degree; even his experiment of passing for a labourer was
partly frustrated; he might possibly have masqueraded as a
wandering musician, but he began to feel his incapacity for material
toil.
Margaret all this time lived in a waking dream; unknown to
herself, all the pent-up forces of her affection had crystallized about
this stranger. His natural courtesy seemed to her a gentle personal
tribute; the mystery he allowed to surround him (being wholly
unconscious of the version of his story the carpenter had told), and
his poetic personality, made him seem like some one she had met in
an old romance. Then the music, too, for often now in the evening
he brought his violin and accompanied her when she sang or played,
giving her new understanding, while he corrected the hardness of
her method so tactfully that she did not realize it. Lending her new
music, substituting the “Songs without Words” for the hackneyed
“Airs with Variations,” and teaching her German and Danish ballads,
that lent themselves to her rich contralto voice.
Margaret became a different creature, and rare glints of red
touched her cheeks. The Deacon accounted for this arousing in the
pleasure she anticipated in going abroad if the Hill Farm was sold.
He was so thoroughly convinced of her indifference to men, that he
was blind to the awakening of her heart.
Margaret noticed with pleasure the various details and changes in
Waldsen’s attic, where she went occasionally to dust, and thought
that they betokened contentment. The room was no longer bare,
festoons of ground pine hung from the rafters and canopied the
windows, a half-dozen home-made cages filled the dormer nearest
the stove, and sheltered a collection of wild birds rescued from cold
and hunger, which chirped from them merrily, while a little screech-
owl blinked sleepily from a perch in the corner. Books lay on the
table and filled a rough shelf under the eaves. Writing implements
and paper also lay about, and traces of bold, irregular characters
were on the big sheets of blotting-paper.
It was Andrea’s picture, however, that interested Margaret more
than anything. She looked at it day after day, trying to trace a
resemblance to Gurth. One day she kissed the lips, and then,
suddenly remembering that he might also do this, fled precipitately
to her room, and, locking the door, stayed until dark, when she went
down to supper with her face flushed, and a nervous air. So nervous
was she that her hand trembled until she almost dropped the cup
that she was passing to her father. Gurth grasped it, and thus their
hands met for the first time.

IV
The last of February a southerly rain inaugurated the spring
thaw. Great cakes of ice came down the river, and barricaded the
mill. Then a cold snap followed, and the trees hung thick with
fantastic icicles. In the morning the Deacon, Gurth, and several
neighbours went up the stream to dislodge, with long poles, cakes of
ice that were wedged threateningly between trees, and after dinner,
when the two men had been talking of the caprices of the storm, the
Deacon said: “It’s worth walking up to the Hill Farm, daughter, to
see the ice on those white pines, but you must mind your footing.
Waldsen’s going up there to shovel off the shed roof, and he’ll be
glad to beau you, I know.”
Margaret blushed painfully, but Gurth, totally missing the
significance of the word, said, in his precise language, that he was
about to ask Miss Margaret, but feared she could not walk so far. So
Margaret brought her coat, trimmed with a neck-band and cuffs of
fur, and, drawing a dark red tam-o’-shanter over her black hair, set
off with Waldsen.
As the Deacon watched them go down the road, dark and fair,
slender and tall, both talking with animation, he suddenly gave a
long whistle, for an idea, born of the word he had just used, flashed
across his matter-of-fact mind, and he said aloud,—“Well, I never!
Well, I never! She shan’t find her old dad a spoil sport, anyhow! I’ve
my doubts if he’ll ever make out with farming, but I suspect he
comes of good folks, and there’s a good living at the mill, and
Margaret’s my only one!” Then he smiled contentedly to himself. The
Deacon had loved his wife with a sentiment that was regarded as a
weakness by his neighbours, and he was prepared to enjoy the
courtship of his only daughter and forward it by all the innocent local
ruses. Yes, he would even make errands to town, and at the last
moment send Waldsen to drive Margaret in his stead.
The couple crossed the bridge and climbed the steep river bank
towards the Hill Farm. Waldsen was in high spirits and hummed and
whistled as they struggled and slipped along, steadying Margaret
every few steps. Happiness and the bracing air had given her a clear
colour, and her eyes were sparkling—she was a different being from
the pale, silent girl of two months ago. The mail-carrier, who met
them at the cross-roads and handed Gurth some letters, thought
what a fine couple they made, and immediately started his opinion
as a rumour around the community.
Margaret walked about outside the little brown house, while her
companion freed the roof from its weight of ice. Her own home was
in sight across the river, and at the left was a lovely strip of hill
country that rose and fell until it merged with the horizon. She was
so absorbed in the view that she did not realize when the shovelling
was finished, until Waldsen stood close beside her. “Has your father
told you that I buy this place, and that to-morrow the papers will be
signed? Yes, I have bought it for my home; I shall plant the ground
and work it, as your father says, to win my living. At evening we
shall sit here and look up the river and down to where the sun sets,
and then over to your house, thanking you for your kindness to a
lonely stranger.” The “we” dropped in unawares, but Margaret knew
that he meant Andrea, his sister.
“Next Christmas I shall move here, for my best resolves have
come on Christmas Day; meanwhile, there is much to be done, and I
shall ask your woman’s art how best to make my home attractive.”
Then they talked of the garden and of the house, how it would need
a summer kitchen, until he, through the subtilty of woman’s
sympathy, thought that he could not wait all the long months for
Andrea’s coming.
That night Waldsen sat a long time pondering over a letter that
had that day come from Andrea. At the first, nothing new suggested
itself, except that she perhaps was lonely, but on a second reading a
note of pain was evident. Carelessly feeling in the pocket of his
overcoat before going to bed, he found that he had received two
letters, when he thought he had but one, and, re-lighting his lamp,
he read the second, which was blotted and tear-stained. It ran thus:

“The stamp on the last letter that I wrote you, dear
Gurth, is hardly dried, yet I must write again and tell you
that which for the last month I have tried to conceal. Now it
is useless. My father will bring a new wife to fill my
mother’s place in two months from now. A hateful woman
who has in some strange way gained power over and
fascinated him, but who does not wish me in the house, for
my father is urging, nay, almost commanding, my betrothal
to Hans Kraus, the brewer’s son, whom I have seen hardly
twice, and whose mother is arranging the matter for him.
“In vain I protest and remind him of our betrothal. He
insists that your mother will surely win you back, as she is
making great efforts to discover where you are. He will not
hear of my going out to service. I know that you will say,
‘Come to me, and we will be married,’ but knowing your
plans and your agreement with your employer, this I will
not do until Christmas comes again. One thing is possible, if
you will undertake it. You are, of course, known in your
village as a working-man. There must be some one there
who wishes a young, strong woman to do housework,
sewing, anything, in short,—you know my hands are used
to work of all kinds. Find some lady who will pay my
passage money, to be taken out in service, and I will come.
Thus I, too, shall be independent. I can sometimes see
you, and when we then marry at Christmas, no one will
know that we are not as we seem, and we shall begin on a
sure footing. Do not attempt to stop me, dearest. Let me
also work.
“Your Andrea.”

This letter cut Waldsen to the heart as well as stirred his pride,
and his first impulse was to return at once to Denmark for Andrea.
Then he considered all the threads that must be unravelled, the
dispersal of many plans so nicely made, and he paused, perplexed.
Andrea clearly did not realize that he was not really a servant even
in name, and that he could not allow her to fill a drudge’s place in
some farm-house.
Stop! why should he not consult Margaret? She might suggest
something, and, at least, her advice would be in accord with local
custom, so that neither he nor Andrea would be criticised in future
by those among whom they were to live. He wrote a few comforting
lines to his betrothed, which he prepared to post that night that the
letter might go by the next day’s steamer, for he had the habit, that
a man bred in a large city seldom loses, of noting the coming and
going of the iron monsters that bind the continents.
It was after one o’clock when he went downstairs, shoes in hand,
and nearly three when he returned from his six-mile walk, after
dropping his letter through the well-worn slit in the post-office door.
The stairs creaked provokingly as he made his way up. He heard a
slight noise and saw a light under Margaret’s door, which, as he
passed by, opened, and Margaret herself peered out, shading her
candle with her hand, and looking down the hall. She almost
screamed when she saw Gurth so near, and said quickly, with a
catch in her breath: “I heard a noise and thought the stair door had
blown open. Are you ill? Can I do anything for you?” He looked at
her a moment as she stood there in her loose wool wrapper, her hair
hanging in long braids, and it seemed like an answer to his
perplexity. His heart whispered, Trust her, consult her, and he said
gravely, “I am not ill, I thank you, and you can do something for me,
but not to-night.”
Then Waldsen slept the sleep of deep fatigue, but Margaret,
misunderstanding wholly and wakeful with happiness, threw herself
on her knees by her bed and, falling asleep, stayed in this position
until the sun cast streaks across the room and scattered the mist
that betokened the final breaking up of winter.
The March days flew by rapidly, and it was almost April. The
willows were showing yellow stems, and the river swirled under
them with new fervour. Hepaticas bloomed in the wood edges, while
violets crept along in the sheltered garden border; bluebirds purled
about the mill, while the kingfishers quarrelled over the pond. At
every meal Waldsen brought the account of some new bird or
unknown flower, until the Deacon was almost vexed, and told him in
a sternly parental way that he would never make his salt, but fill his
farm with brakes and briers, growing strawberries for robins and
raspberries for catbirds; but Margaret only smiled, treasuring every
leaf he brought, and spent much time out of doors watching the
messengers of spring that she never before had noticed, feeling that
life was good.
Easter came in middle April, and the little church at Glen Village
was to be decorated with flowers. The day before, Gurth went into
town with a load of feed, stopping on his way at the post-office, and
found a letter from Andrea that made him resolve to act at once.
On his way home he bought two pots of blooming lilies, which he
placed on Margaret’s table in the sitting-room, as an Easter gift to
the home. As she thanked him, bending over the flowers, he said,
“Miss Margaret, a while ago I said that you could do something for
me. I have come to ask it now, but before I speak there is much
that I must tell you, so that you may understand.” Margaret, making
a gesture of assent, stood clinging to the curtain for support, still
bending over the lilies.
Gurth began slowly and hesitatingly with his father’s unhappy
marriage and his loveless childhood, speaking deliberately, and
choosing his words like a lawyer presenting his case. A puzzled
expression gradually spread over Margaret’s face, but as he told her
of his meeting with Andrea and his love for her, she gave the curtain
so sudden a jerk that it tore from its fastenings, and fell in a heap
upon her. Gurth, merely thinking that she had stood too long, lifted
the curtain, gave her a chair, and continued his narrative, with
unconscious egotism. For more than an hour he talked; the Deacon
peeped in and hastily withdrew, thinking that the young folks were
coming to an understanding.
Margaret did not say a word, but so absorbed was Gurth that he
did not notice it. A terrible struggle was rending her, and she could
not trust herself to speak. Not only had her life hinged itself upon an
impossibility, but the mistake that had made such a thing possible
had come from giving credence to the story of the carpenter.
As every detail of the past three months came before her, she
realized how innocent of any deception Waldsen had been, and the
very advice he was now seeking proved his confidence in her. The
secret was her own,—at least she had that comfort. Then a wave of
pain passed over her, almost stopping her breath and seizing her
throat in an iron grasp. She dimly saw that Gurth was showing her
some letters, and gathered herself together only to receive a fresh
blow,—his appeal for Andrea. For though he did not ask it in so
many words, she knew what was in his mind.
When he had finished and stood expectantly before her, she
could no longer contend with herself, and big tears rolled down her
cheeks as she said, “I must think before I answer you, but I will do
all I can.” As she passed him he saw the tears, and, taking her hand,
he stooped and kissed it reverently, saying, “God bless you for your
sympathy.”
The Deacon did not return for tea, having business in town, and
Waldsen, much surprised at Margaret’s absence, ate his meal alone.
Margaret herself sat in her east window looking at the twilight,
and, when it faded, at the stars. The marsh frogs piped
monotonously, and the water rushed over the dam, falling below
with a hollow thud. Soon Waldsen’s violin sounded from his open
window,—to-night he played “The Songs without Words,” one after
another, chancing to end with “Lost Happiness.” As Margaret
listened, now that the first shock was over, she was soothed. At first
she did not think it was possible that she could have Andrea in the
house, and then she knew that only by some such object lesson
would she realize that Waldsen could not belong to her. Andrea
should come, and they would work together. Zella was shiftless and
constantly threatening to go. To tell her father and make him
comprehend the change was her next task. Puritan in education and
temperament, no other thought but to bend to the seemingly
inevitable occurred to her.
On Easter Day no one who heard Margaret sing at church knew
of her struggle, and yet her voice moved those plain people as it
never had before, and they spoke of it among themselves in walking

You might also like