Image Processing Using FPGAs Donald G Bailey Editor PDF Download
Image Processing Using FPGAs Donald G Bailey Editor PDF Download
https://ebookmeta.com/product/image-processing-using-fpgas-
donald-g-bailey-editor/
https://ebookmeta.com/product/design-for-embedded-image-
processing-on-fpgas-2nd-edition-donald-g-bailey/
https://ebookmeta.com/product/digital-image-processing-using-
matlab-4th-edition-rafael-c-gonzalez/
https://ebookmeta.com/product/embedded-microprocessor-system-
design-using-fpgas-uwe-meyer-baese/
https://ebookmeta.com/product/the-fairness-doctrine-and-the-
media-steven-j-ssmmons/
NSW Civil Procedure Handbook 2017 Carol Webster
https://ebookmeta.com/product/nsw-civil-procedure-
handbook-2017-carol-webster/
https://ebookmeta.com/product/sin-and-soil-12-1st-edition-anya-
merchant/
https://ebookmeta.com/product/international-negotiation-and-
mediation-in-violent-conflict-the-changing-context-of-
peacemaking-1st-edition-chester-a-crocker/
https://ebookmeta.com/product/atmosphere-architecture-cinema-
thematic-reflections-on-ambiance-and-place-1st-edition-michael-
tawa/
https://ebookmeta.com/product/unproduction-studies-and-the-
american-film-industry-1st-edition-james-fenwick/
Starting a Business All In One For Dummies The Experts
At Dummies
https://ebookmeta.com/product/starting-a-business-all-in-one-for-
dummies-the-experts-at-dummies/
Journal of
Imaging
Image Processing
Using FPGAs
Edited by
Donald G. Bailey
Printed Edition of the Special Issue Published in Journal of Imaging
www.mdpi.com/journal/jimaging
Image Processing Using FPGAs
Image Processing Using FPGAs
Editorial Office
MDPI
St. Alban-Anlage 66
4052 Basel, Switzerland
This is a reprint of articles from the Special Issue published online in the open access journal
Journal of Imaging (ISSN 2313-433X) from 2018 to 2019 (available at: https://www.mdpi.com/
journal/jimaging/special issues/Image FPGAs).
For citation purposes, cite each article independently as indicated on the article page online and as
indicated below:
LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. Journal Name Year, Article Number,
Page Range.
c 2019 by the authors. Articles in this book are Open Access and distributed under the Creative
Commons Attribution (CC BY) license, which allows users to download, copy and build upon
published articles, as long as the author and publisher are properly credited, which ensures maximum
dissemination and a wider impact of our publications.
The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons
license CC BY-NC-ND.
Contents
Donald Bailey
Image Processing Using FPGAs
Reprinted from: Journal of Imaging 2019, 5, 53, doi:10.3390/jimaging5050053 . . . . . . . . . . . . 1
Fahad Siddiqui, Sam Amiri, Umar Ibrahim Minhas, Tiantai Deng, Roger Woods,
Karen Rafferty and Daniel Crookes
FPGA-Based Processor Acceleration for Image Processing Applications
Reprinted from: Journal of Imaging 2019, 5, 16, doi:10.3390/jimaging5010016 . . . . . . . . . . . . 5
Paulo Garcia, Deepayan Bhowmik, Robert Stewart, Greg Michaelson and Andrew Wallace
Optimized Memory Allocation and Power Minimization for FPGA-Based Image Processing
Reprinted from: Journal of Imaging 2019, 5, 7, doi:10.3390/jimaging5010007 . . . . . . . . . . . . . 27
Andrew Tzer-Yeu Chen, Rohaan Gupta, Anton Borzenko, Kevin I-Kai Wang and
Morteza Biglari-Abhari
Accelerating SuperBE with Hardware/Software Co-Design
Reprinted from: Journal of Imaging 2018, 4, 122, doi:10.3390/jimaging4100122 . . . . . . . . . . . . 91
Zhe Wang, Trung-Hieu Tran, Ponnanna Kelettira Muthappa and Sven Simon
A JND-Based Pixel-Domain Algorithm and Hardware Architecture for Perceptual Image
Coding
Reprinted from: Journal of Imaging 2019, 5, 50, doi:10.3390/jimaging5050050 . . . . . . . . . . . . 164
v
About the Special Issue Editor
Donald G. Bailey received his Bachelor of Engineering (Honours) degree in Electrical Engineering
in 1982, and Ph.D. degree in Electrical and Electronic Engineering from the University of Canterbury,
New Zealand, in 1985. From 1985 to 1987, he applied image analysis to the wool and paper industries
of New Zealand. From 1987 to 1989, he was a Visiting Research Engineer at University of California,
Santa Barbara. Prof. Bailey joined Massey University in Palmerston North, New Zealand, as Director
of the Image Analysis Unit in November 1989. He was a Visiting Researcher at the University of
Wales, Cardiff, in 1996; University of California, Santa Barbara, in 2001–2002; and Imperial College
London in 2008. He is currently Professor of Imaging Systems in the Department of Mechanical
and Electrical Engineering in the School of Food and Advanced Technology at Massey University,
where he is Leader of the Centre for Research in Image and Signal Processing. Prof. Bailey has spent
over 35 years applying image processing to a range of industrial, machine vision, and robot vision
applications. For the last 18 years, one area of particular focus has been exploring different aspects
of using FPGAs for implementing and accelerating image processing algorithms. He is the author
of many publications in this field, including the book “Design for Embedded Image Processing on
FPGAs”, published by Wiley/IEEE Press. He is a Senior Member of the IEEE, and is active in the
New Zealand Central Section.
vii
Preface to ”Image Processing Using FPGAs”
Over the last 20 years, FPGAs have moved from glue logic through to computing platforms.
They effectively provide a reconfigurable hardware platform for implementing logic and algorithms.
Being fine-grained hardware, FPGAs are able to exploit the parallelism inherent within a hardware
design while at the same time maintaining the reconfigurability and programmability of software.
This has led to FPGAs being used as a platform for accelerating computationally intensive tasks. This
is particularly seen in the field of image processing, where the FPGA-based acceleration of imaging
algorithms has become mainstream. This is even more so within an embedded environment, where
the power and computational resources of conventional processors are not up to the task of managing
the data throughput and computational requirements of real-time imaging applications.
Unfortunately, the fine-grained nature of FPGAs also makes them difficult to programme
effectively. Conventional processors have a fixed computational architecture, which is able to provide
a high level of abstraction. By contrast, on an FPGA, it is necessary to design not only the algorithm
but also the computational architecture, which leads to an explosion in the design space complexity.
This, coupled with the complexities of managing the concurrency of a highly parallel design and the
bandwidth issues associated with the high volume of data associated with images and video, has
led to a wide range of approaches and architectures used for realising FPGA-based image processing
systems. This Special Issue provides an opportunity for researchers in this area to present some of
their latest results and designs. The diversity of presented techniques and applications reflects the
nature and current state of FPGA-based design for image processing.
Donald G. Bailey
Special Issue Editor
ix
Journal of
Imaging
Editorial
Image Processing Using FPGAs
Donald G. Bailey
Department of Mechanical and Electrical Engineering, School of Food and Advanced Technology,
Massey University, Palmerston North 4442, New Zealand; [email protected]
Abstract: Nine articles have been published in this Special Issue on image processing using
field programmable gate arrays (FPGAs). The papers address a diverse range of topics relating
to the application of FPGA technology to accelerate image processing tasks. The range includes:
Custom processor design to reduce the programming burden; memory management for full frames,
line buffers, and image border management; image segmentation through background modelling,
online K-means clustering, and generalised Laplacian of Gaussian filtering; connected components
analysis; and visually lossless image compression.
Keywords: field programmable gate arrays (FPGA); image processing; hardware/software co-design;
memory management; segmentation; image analysis; compression
2. Contributions
Programming an FPGA to accelerate complex algorithms is difficult, with one of four approaches
commonly used [1]:
• Custom hardware design of the algorithm using a hardware description language, optimised for
performance and resources;
• implementing the algorithm by instantiating a set of application-specific intellectual property
cores (from a library);
• using high-level synthesis to convert a C-based representation of the algorithm to
synthesisable hardware; or
• mapping the algorithm onto a parallel set of programmable soft-core processors.
The article by Siddiqui et al. [1] took this last approach, and describes the design of an efficient
16-bit integer soft-core processor, IPPro, capable of operating at 337 MHz, specifically targetting the
dataflow seen in complex image processing algorithms. The presented architecture uses dedicated
stream access instructions on the input and output, with a 32-element local memory for storing pixels
and intermediate results, and a separate 32-element kernel memory for storing filter coefficients
and other parameters and constants. The exploitation of both data-level parallelism and task-level
parallelism is demonstrated through the mapping of a K-means clustering algorithm onto the
architecture, showing good scalability of processing speed with multiple cores. A second case study of
traffic sign recognition is partitioned between the IPPro cores and an ARM processor, with the colour
conversion and morphological filtering stages mapped to the IPPro. Again, the use of parallel IPPro
cores can significantly accelerate these tasks, compared to conventional software, without having to
resort to the tedious effort of custom hardware design.
Garcia et al. [2] worked on the thesis that the image processing operations which require random
access to the whole frame (including iterative algorithms) are particularly difficult to realise in FPGAs.
They investigate the mapping of a frame buffer onto the memory resources of an FPGA, and explore
the optimal mapping onto combinations of configurable on-chip memory blocks. They demonstrate
that, for many image sizes, the default mapping by the synthesis tools results in poor utilisation, and is
also inefficient in terms of power requirements. A procedure is described that determines the best
memory configuration, based on balancing resource utilisation and power requirements. The mapping
scheme is demonstrated with optical flow and mean shift tracking algorithms.
On the other hand, local operations (such as filters) only need part of the image to produce an
output, and operate efficiently in stream processing mode, using line buffers to cache data for scanning
a local window through the image. This works well when the image size is fixed, and is known in
advance. Two situations where this approach is less effective [3] are in the region of interest processing,
where only a small region of the image is processed (usually determined from the image contents at
run-time), and cloud processing of user-uploaded images (which may be of arbitrary size). This is
complicated further in high-speed systems, where the real-time requirements demand processing
multiple pixels in every clock cycle, because, if the line width is not a multiple of the number of pixels
processed each cycle, then it is necessary to assemble the output window pixels from more than one
memory block. Shi et al. [3], in their paper, extend their earlier work on assembling the output window
to allow arbitrary image widths. The resulting line buffer must be configurable at run-time, which is
achieved through a series of “instructions”, which control the assembly of the output processing
window when the required data spans two memory blocks. Re-configuration only takes a few clock
cycles (to load the instructions), rather than conventional approach of reconfiguring the FPGA each
time the image width changes. The results demonstrate better resource utilisation, higher throughput,
and lower power than their earlier approach.
When applying window operations to an image, the size of the output image is smaller than
the input because data is not valid when the window extends beyond the image border. If necessary,
this may be mitigated by extending the input image to provide data to allow such border pixels to be
calculated. Prior work only considered border management using direct form filter structures, because
the window formation and filter function can be kept independent. However, in some applications,
transpose-form filter structures are desirable because the corresponding filter function is automatically
pipelined, leading to fewer resources and faster clock frequencies. Bailey and Ambikumar [4] provide
a design methodology for border management using transpose filter structures, and show that the
resource requirements are similar to those for direct-form border management.
An important task in computer vision is segmenting objects from a complex background. While
there are many background modelling algorithms, the complexity of robust algorithms make them
difficult to realise on an FPGA, especially for larger image sizes. Chen et al. [5] address scalability issues
with increasing image size by using super-pixels—small blocks of adjacent pixels that are treated as a
single unit. As each super-pixel is considered to be either object or background, this means that fewer
2
J. Imaging 2019, 5, 53
models need to be maintained (less memory) and fewer elements need to be classified (reduced
computation time). Using hardware/software co-design, they accelerated the computationally
expensive steps of Gaussian filtering and calculating the mean and variance within each super-pixel
with hardware, with the rest of the algorithm being realised on the on-chip CPU. The resulting system
gave close to state-of-the-art classification accuracy.
A related paper, by Badawi and Bilal [6], used K-means clustering to segment objects within video
sequences. Rather than taking the conventional iterative approach to K-means clustering, they rely
on the temporal coherence of video streams and use the cluster centres from the previous frame as
initialisation for the current frame. Additionally, rather than waiting until the complete frame has
been accumulated before updating the cluster centres, an online algorithm is used, with the clusters
updated for each pixel. To reduce the computational requirements, the centres are updated using a
weighted average. They demonstrate that, for typical video streams, this gives similar performance to
conventional K-means algorithms, but with far less computation and power.
In another segmentation paper, Zhou et al. [7] describe the use of a generalised Laplacian of
Gaussian (LoG) filter for detecting cell nuclei for a histopathology application. The LoG filters detect
elliptical blobs at a range of scales and orientations. Local maxima of the responses are used as
candidate seeds for cell centres, and mean-shift clustering is used to combine multiple detections
from different scales and orientations. Their FPGA design gave modest acceleration over a software
implementation on a high-end computer.
Given a segmented image, a common task is to measure feature vectors of each connected
component for analysis. Bailey and Klaiber [8] present a new single-pass connected components
analysis algorithm, which does this with minimum latency and relatively few resources. The key novelty
of this paper is the use of a zig-zag based scan, rather than a conventional raster scan. This eliminates the
end-of-row processing for label resolution by integrating it directly within the reverse scan. The result is
true single-pixel-per-clock-cycle processing, with no overheads at the end of each row or frame.
An important real-time application of image processing is embedded online image compression
for reducing the data bandwidth for image transmission. In the final paper within this Special Issue,
Wang et al. [9] defined a new image compression codec which works efficiently with a streamed image,
and minimises the perceptual distortion within the reconstructed images. Through small local filters,
each pixel is classified as either an edge, a smooth region, or a textured region. These relate to a
perceptual model of contrast masking, allowing just noticeable distortion (JND) thresholds to be
defined. The image is compressed by downsampling; however, if the error in any of the contributing
pixels exceeds the visibility thresholds, the 2 × 2 block is considered a region of interest, with the
4 pixels coded separately. In both cases, the pixel values are predicted using a 2-dimensional predictor,
and the prediction residuals are quantised and entropy-encoded. Results typically give a visually
lossless 4:1 compression, which is significantly better than other visually lossless codecs.
3. Conclusions
Overall, this collection of papers reflects the diversity of approaches taken to applying FPGAs to
image processing applications. From one end, using the programmable logic to design lightweight
custom processors to enable parallelism, through overcoming some of the limitations of current
high-level synthesis tools, to the other end with the design of custom hardware designs at the
register-transfer level.
The range of image processing techniques include filtering, segmentation, clustering, and
compression. Applications include traffic sign recognition for autonomous driving, histopathology,
and video compression.
3
J. Imaging 2019, 5, 53
References
1. Siddiqui, F.; Amiri, S.; Minhas, U.I.; Deng, T.; Woods, R.; Rafferty, K.; Crookes, D. FPGA-based processor
acceleration for image processing applications. J. Imaging 2019, 5, 16. [CrossRef]
2. Garcia, P.; Bhowmik, D.; Stewart, R.; Michaelson, G.; Wallace, A. Optimized memory allocation and power
minimization for FPGA-based image processing. J. Imaging 2019, 5, 7. [CrossRef]
3. Shi, R.; Wong, J.S.; So, H.K.H. High-throughput line buffer microarchitecture for arbitrary sized streaming
image processing. J. Imaging 2019, 5, 34. [CrossRef]
4. Bailey, D.G.; Ambikumar, A.S. Border handling for 2D transpose filter structures on an FPGA. J. Imaging
2018, 4, 138. [CrossRef]
5. Chen, A.T.Y.; Gupta, R.; Borzenko, A.; Wang, K.I.K.; Biglari-Abhari, M. Accelerating SuperBE with
hardware/software co-design. J. Imaging 2018, 4, 122. [CrossRef]
6. Badawi, A.; Bilal, M. High-level synthesis of online K-Means clustering hardware for a real-time image
processing pipeline. J. Imaging 2019, 5, 38. [CrossRef]
7. Zhou, H.; Machupalli, R.; Mandal, M. Efficient FPGA implementation of automatic nuclei detection in
histopathology images. J. Imaging 2019, 5, 21. [CrossRef]
8. Bailey, D.G.; Klaiber, M.J. Zig-zag based single pass connected components analysis. J. Imaging 2019, 5, 45.
[CrossRef]
9. Wang, Z.; Tran, T.H.; Muthappa, P.K.; Simon, S. A JND-based pixel-domain algorithm and hardware
architecture for perceptual image coding. J. Imaging 2019, 5, 50. [CrossRef]
c 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
4
Journal of
Imaging
Article
FPGA-Based Processor Acceleration for Image
Processing Applications
Fahad Siddiqui 1,† , Sam Amiri 2,† , Umar Ibrahim Minhas 1 , Tiantai Deng 1 , Roger Woods 1, * ,
Karen Rafferty 1 and Daniel Crookes 1
1 School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast,
Belfast BT7 1NN, UK; [email protected] (F.S.); [email protected] (U.I.M.); [email protected] (T.D.);
[email protected] (K.R.); [email protected] (D.C.)
2 School of Computing, Electronics and Maths, Coventry University, Coventry CV1 5FB, UK;
[email protected]
* Correspondence: [email protected]; Tel.: +44-289-097-4081
† These authors contributed equally to this work.
Abstract: FPGA-based embedded image processing systems offer considerable computing resources
but present programming challenges when compared to software systems. The paper describes an
approach based on an FPGA-based soft processor called Image Processing Processor (IPPro) which can
operate up to 337 MHz on a high-end Xilinx FPGA family and gives details of the dataflow-based
programming environment. The approach is demonstrated for a k-means clustering operation and
a traffic sign recognition application, both of which have been prototyped on an Avnet Zedboard
that has Xilinx Zynq-7000 system-on-chip (SoC). A number of parallel dataflow mapping options
were explored giving a speed-up of 8 times for the k-means clustering using 16 IPPro cores, and a
speed-up of 9.6 times for the morphology filter operation of the traffic sign recognition using
16 IPPro cores compared to their equivalent ARM-based software implementations. We show that for
k-means clustering, the 16 IPPro cores implementation is 57, 28 and 1.7 times more power efficient
(fps/W) than ARM Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded
GPU respectively.
1. Introduction
With improved sensor technology, there has been a considerable growth in the amount of data
being generated by security cameras. In many remote environments with limited communication
bandwidth, there is a clear need to overcome this by employing remote functionality in the system such
as employing motion estimation in smart cameras [1]. As security requirements grow, the processing
needs will only need to increase.
New forms of computing architectures are needed. In late 70’s, Lamport [2] laid the foundation
of parallel architectures exploiting data-level parallelism (DLP) using work load vectorisation and
shared memory parallelisation, used extensively in Graphical Processing Units (GPUs). Current energy
requirements and limitations of Dennard scaling have acted to limit clock scaling and thus reduce
future processing capabilities of GPUs or multi-core architectures [3]. Recent field programmable gate
array (FPGA) architectures represent an attractive alternative for acceleration as they comprise ARM
processors and programmable logic for accelerating computing intensive operations.
FPGAs are proven computing platforms that offer reconfigurability, concurrency and pipelining,
but have not been accepted as a mainstream computing platform. The primary inhibitor is the need to
use specialist programming tools, describing algorithms in hardware description language (HDL), altough
this has been alleviated by the introduction of high-level programming tools such as Xilinx’s Vivado
High-level Synthesis (HLS) and Intel’s (Altera’s) compiler for OpenCL. While the level of abstraction
has been raised, a gap still exists between adaptability, performance and efficient utilisation of FPGA
resources. Nevertheless, the FPGA design flow still requires design synthesis and place-and-route that
can be time-consuming depending on the complexity and size of the design [4,5]. This FPGA design
flow is alien to software/algorithm developers and inhibits wider use of the technology.
One way to approach this research problem is to develop adaptable FPGA hardware architecture
that enables edit-compile-run flow familiar to software and algorithm developers instead of hardware
synthesis and place-and-route. This can be achieved by populating FPGA logic with a number of efficient
soft core processors used for programmable hardware acceleration. This underlying architecture will
be adaptable and can be programmed using conventional software development approaches. However,
the challenge is to build an FPGA solution that is more easily programmed whilst still providing high
performance. Whilst FPGA-based processor architectures exist such as Xilinx’s MicroBlaze, Altera’s
NIOS and others [6–9], we propose an Image Processing Processor (IPPro) processor [10] tailored to
accelerate image processing operations, thereby providing an excellent mapping between FPGA
resources, speed and programming efficiency. The main purpose of the paper is to give insights into
the multi-core processor architecture built using the IPPro architecture, its programming environment
and outline its applications to two image processing applications. Our main contributions are:
• Creation of an efficient, FPGA-based multicore processor which advances previous work [10],
[11] and an associated dataflow-based compiler environment for programming a heterogeneous
FPGA resource comprising it and ARM processors.
• Exploration of mapping the functionality for a k-means clustering function, resulting in a possible
speedup of up to 8 times that is 57, 28 and 1.7 times more power efficient (fps/W) than ARM
Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded GPU.
• Acceleration of colour and morphology operations of traffic sign recognition application, resulting
in a speedup of 4.5 and 9.6 times respectively on a Zedboard.
The rest of paper is organized as follows: Section 2 outlines the various image processing
requirements and outlines how these can be matched to FPGA; relevant research is also reviewed.
System requirements are outlined in Section 3 and the soft core processor architecture is also briefly
reviewed in Section 4. The system architecture is outlined in Section 5. Experiments to accelerate a
k-means clustering algorithm and a traffic sign recognition example, are presented in Sections 6 and 7
respectively. Conclusions and future work are described in Section 8.
2. Background
Traditionally, vision systems have been created in a centralized manner where video from
multiple cameras is sent to a central back-end computing unit to extract significant features. However,
with increasing number of nodes and wireless communications, this approach becomes increasingly
limited, particularly with higher resolution cameras [12]. A distributed processing approach can be
employed where data-intensive, front-end preprocessing such as sharpening, object detection etc. can
be deployed remotely, thus avoiding the need to transmit high data, video streams back to the server.
6
J. Imaging 2019, 5, 16
• Customised hardware accelerator designs in HDLs which require long development times but
can be optimised in terms of performance and area.
• Application specific hardware accelerators which are generally optimized for a single function,
non-programmable and created using IP cores.
• Designs created using high-level synthesis tools such as Xilinx’s Vivado HLS tool and
Altera’s OpenCL compiler which convert a C-based specification into an RTL implementation
synthesizable code [15] allowing pipelining and parallelization to be explored.
• Programmable hardware accelerator in the form of vendor specific soft processors such as
Xilinx’s Microblaze and Altera’s NIOS II processors and customized hard/soft processors.
Table 1. Categorisation of image processing operations based on their memory and execution
patterns [13] allow features of compute and memory patterns to be highlighted and therefore identifying
what can be mapped into FPGA.
7
J. Imaging 2019, 5, 16
a GPGPU architecture called FlexGrip [8] which like vector processors, supports wide data parallel,
SIMD-style computation using multiple parallel compute lanes, provides support for conditional
operations, and requires optimized interfaces to on- and off-chip memory. FlexGrip maps pre-compiled
CUDA kernels on soft core processors which are programmable and operate at 100 MHz.
3. System Implementation
Whilst earlier versions of FPGAs just comprised multiple Lookup Tables (LUT) connected to
registers and accelerated by fast adders, FPGAs now comprise more coarse-grained functions such as
dedicated, full-custom, low-power DSP slices. For example, the Xilinx DSP48E1 block comprises a
25-bit pre-adder, a 25 × 18-bit multiplier and a 48-bit adder/subtracter/logic unit, multiple distributed
RAM blocks which offer high bandwidth capability (Figure 1), and a plethora of registers which
supports high levels of pipelining.
Figure 1. Bandwidth/memory distribution in Xilinx Virtex-7 FPGA which highlight how bandwidth
and computation improves as we near the datapath parts of the FPGA.
Whilst FPGAs have been successfully applied in embedded systems and communications,
they have struggled as a mainstream computational platform. Addressing the following considerations
would make FPGAs a major platform rival for “data-intensive” applications:
• Programmability: there is a need for a design methodology which includes a flexible data
communication interface to exchange data. Intellectual Property (IP) cores and HLS tools [15]/
OpenCL design routes increase programming abstraction but do not provide the flexible system
infrastructure for image processing systems.
• Dataflow support: the dataflow model of computation is a recognized model for data-intensive
applications. Algorithms are represented as a directed graph composed of nodes (actors) as
computational units and edges as communication channels [21]. While the actors run explicitly in
parallel decided by the user, actor functionality can either be sequential or concurrent. Current
FPGA realizations use the concurrency of the whole design at a higher level but eliminate
reprogrammability. A better approach is to keep reprogrammability while still maximizing
parallelism by running actors on simple “pipelined” processors; the actors still run their code
explicitly in parallel (user-specified).
• Heterogeneity: the processing features of FPGAs should be integrated with CPUs. Since dataflow
supports both sequential and concurrent platforms, the challenge is then to allow effective
mapping onto CPUs with parallelizable code onto FPGA.
• Toolset availability: design tools created to specifically compile user-defined dataflow programs at
higher levels to fully reprogrammable heterogeneous platform should be available.
8
J. Imaging 2019, 5, 16
is a standalone entity, which defines an execution procedure and can be implemented in the IPPro
processor. Actors communicate with other actors by passing data tokens, and the execution is done
through the token passing through First-In-First-Out (FIFO) units. The combination of a set of actors
with a set of connections between actors constructs a network, which maps well to the system level
architecture of the IPPro processors. An earlier version of the programming environment has been
is detailed in [11] allowing the user to explore parallel implementation and providing the necessary
back-end compilation support.
In our flow, every processor can be thought of as an actor and data is fired through the FIFO
structures but the approach needs to be sensitive to FPGA-based limitations such as restricted memory.
Cal Actor Language (CAL) [22] is a dataflow programming language that has been focussed at image
processing and FPGAs and it offers the necessary constructs for expressing parallel or sequential
coding, bitwise types, a consistent memory model, and a communication between parallel tasks
through queues. RVC-CAL is supported by an open source dataflow development environment and
compiler framework, Orcc, that allows the trans-compilation of actors and generates equivalent code
depending on the chosen back-ends [23]. An RVC-CAL based design is composed of a dataflow
network file (.xdf file) that supports task and data-level parallelism.
Figure 2 illustrates the possible pipelined decomposition of dataflow actors. These dataflow
actors need to be balanced as the worst-case execution time of the actor determines the overall
achievable performance. Data-level parallelism is achieved by making multiple instances of an actor
and requires SIMD operations that shall be supported by the underlying processor architecture.
In addition, it requires software configurable system-level infrastructure that manages control and data
distribution/collection tasks. It involves the initialisation of the soft core processors (programming the
decomposed dataflow actor description), receiving data from the host processor, distributing them to
first-level actors, gathering processed data from the final-level actors and send it back to host processor.
Data-level parallelism directly impacts the system performance; the major limiting factor is the
number of resources available on FPGA. An example pipeline structure with an algorithm composed
of four actors each having different execution times, and multiple instances of the algorithm realised
in SIMD fashion is shown in Figure 2. The performance metric, frames-per-second (fps) can be
approximated using N(total_pixels) the number of pixels in a frame, N( pixel_consumption) the number of
pixels consumed by an actor in each iteration and f ( processor) is operating frequency of processor.
f ( processor) ∗ N( pixel_consumption)
fps ≈ (1)
N(total_pixels)
To improve the f ps, the following options are possible:
• Efficient FPGA-based processor design that operates at higher operating frequency f ( processor) .
• Reducing the actor’s execution time by decomposing it into multiple pipelined stages, thus reducing
t( actor) to improve the f ps. Shorter actors can be merged sequentially to minimise the data transfer
overhead by localising data into FIFOs between processing stages.
• Vertical scaling to exploit data parallelism by mapping an actor on multiple processor cores, thus
N(total_pixels)
reducing (n ∗ N( pixel_consumption) ) at the cost of additional system-level data distribution, control, and
collection mechanisms.
Figure 2. Illustration of possible data and task parallel decomposition of a dataflow algorithm found in
image processing designs where the numerous of rows indicate the level of parallelism.
9
J. Imaging 2019, 5, 16
The developed tool flow (Figure 3) starts with a user-defined RVC-CAL description composed
of actors selected to execute in FPGA-based soft cores with the rest to be run in the host CPUs.
By analyzing behaviour, software/hardware partitioning is decided by two main factors, the actors
with the worse execution time (determined exactly by number of instructions and the average waiting
time to receive the input tokens and send the produced tokens), and the overheads incurred in
transferring the image data to/from the accelerator. The behavioural description of an algorithm could
be coded in different formats:
BehaviouralDescriptionin
RVCCAL
Software/Hardware
Partitioning
RedesignofCPUͲTargeted RedesignofFPGAͲTargeted
ActorsinRVCCAL ActorsinRVCCAL
SIMDApplication
CompilerInfrastructure
RVCCAL–CCompilation InterfaceSettings
XDFAnalysis
ActorCodeGeneration
SystemImplementation ControlͲRegisterValue/
ParameterGeneration
Figure 3. A brief description of the design flow of a hardware and software heterogeneous system
highlighting key features. More detail of the flow is contained in reference [11].
There are two types of decomposition, “row-” and “column-wise”. The newly generated data-
independent actors can be placed row-wise at the same pipeline stage; otherwise they can be placed
column-wise as consecutive pipeline stages. Row-wise is preferred as the overhead incurred in token
transmission can be a limiting factor but typically a combination is employed.
If the actors or actions are not balanced, then they need to be decomposed. This is done by
detecting a sequence of instructions without branches (unless this occurs at the end) and then breaking
the program into basic blocks. The “balance points” whereby the actor needs to be divided into
multiple sets of basic blocks such that if each set is placed in a new actor, then need to be found;
this will ensure that the overhead of transferring tokens among the sets will not create a bottleneck
and infer the selection and use of one with the lowest overhead (See Ref. [11]). Once the graph is
partitioned, the original xdf file no longer represents the network topology, so each set of actors must
be redesigned separately and their input/output ports fixed and a new set of xdf dataflow network
description files, generated. The actors to run on the host CPU are compiled from RVC-CAL to C
using the C backend of Orcc development environment, whereas the FPGA-based functionality is then
created using the proposed compiler framework.
The degree of SIMD applied will affect the controller interface settings. For a target board,
the design will have a fixed number of IPPro cores realized and interconnected with each other and
10
J. Imaging 2019, 5, 16
controllers, determined by the FPGA resources and fan-out delay; for the Zedboard considered here,
32 cores are selected. The compilation infrastructure is composed of three distinctive steps:
• Examination of the xdf dataflow network file and assignment and recording of the actor mapping
to the processors on the network.
• Compilation of each actor’s RVC-CAL code to IPPro assembly code.
• Generation of control register values, mainly for AXI Lite Registers, and parameters required by
the developed C-APIs. running on the host CPU
While FPGA-targeted actor interaction is handled by the compiler, the processes for receiving
the image data and storing the output in the edge actors need to be developed. Multiple controllers
(programmable by the host CPU) are designed to provide the interface to transfer the data to the
accelerators, gather the results and transfer them back to the host. With the host CPU running part
of the design and setting control registers, and the IPPro binary codes of the other actors loaded to
the proper cores on the accelerator, and the interface between the software/hardware sections set
accordingly, the system implementation is in place and ready to run.
11
J. Imaging 2019, 5, 16
500
800 Single Port RAM
NOPATDET
True-Dual Port RAM
PATDET
PREADD_MULT_NOADREG 450
700 15% 52%
MULT_NOMREG
MULT_NOMREG_PATDET
400
Frequency (MHz)
Frequency (MHz)
600
500 350
400 300
300
250
200
-3 -2 -1 200
Virtex-7 Kintex-7 Artix-7
Kintex-7 fabric (speed grade)
FPGA Fabric
(a)
(b)
Figure 4. (a) Impact of DSP48E1 configurations on maximum achievable clock frequency using
different speed grades using Kintex-7 FPGAs for fully pipelined with no (NOPATDET) and with
(PATDET) PATtern DETector, then multiply with no MREG (MULT_NOMREG) and pattern detector
(MULT_NOMREG_PATDET) and a Multiply, pre-adder, no ADREG (PREADD_MULT_NOADREG)
(b) Impact of BRAM configurations on the maximum achievable clock frequency of Artix-7, Kintex-7
and Virtex-7 FPGAs for single and true-dual port RAM configurations.
Table 2. Computing resources (DSP48E1) and BRAM memory resources for a range of Xilinx Artix-7,
Kintex-7, Virtex-7 FPGA families implemented using 28nm CMOS technology.
12
J. Imaging 2019, 5, 16
depends on the input or neighbouring pixels. This model is only suitable for mapping a single
dataflow node.
The second model 2 increases the datapath functionality to a fine-grained processor by including
BRAM-based instruction memory (IM), program counter PC and kernel memory (KM) to store constants as
shown in Figure 6b. Conversely, 2 can support mapping of multiple data independent dataflow nodes
as shown in Figure 5b. The node (OP2) requires a memory storage to store a variable (t1) to compute
the output token (C) which feeds back from the output of the ALU needed for the next instruction in
the following clock cycle. This model supports improved dataflow mapping functionality over 1 by
introducing an IM which comes at the cost of variable execution time and throughput proportional
to the number of instructions required to implement the dataflow actor. This model is suitable for
accelerating combinational logic computations.
The third model 3 increases the datapath functionality to map and execute a data dependent
dataflow actor as shown in Figure 5c. The datapath has memory in the form of a register file (RF) which
represents a coarse-grained processor shown in Figure 6c. The RF stores intermediate results to execute
data dependent operations, implements (feed-forward, split, merge and feedback) dataflow execution
patterns and facilitates dataflow transformations (actor fusion/fission, pipelining etc.) constraints
by the size of the RF. It can implement modular computations which are not possible in 1 and .2
In contrast to
1 and ,
2 the token production/consumption (P/C) rate of 3 can be controlled through
program code that allows software-controlled scheduling and load balancing possibilities.
(a)
(b) (c)
Figure 5. A range of dataflow models taken from [24,25]. (a) DFG node without internal storage
called configuration ;
1 (b) DFG actor without internal storage t1 and constant i called configuration
;
2 (c) Programmable DFG actor with internal storage t1, t2 and t3 and constants i and j called
configuration .
3
13
J. Imaging 2019, 5, 16
in Verilog HDL, synthesised and placed and routed using the Xilinx Vivado Design Suite v2015.2 on
Xilinx chips installed on widely available development kits which are Artix-7 (Zedboard), Kintex-7
(ZC706) and Virtex-7 (VC707). The obtained f max results are reported in Figure 7.
In this analysis, f max is considered as the performance metric for each processor datapath model
and has a reduction of 8% and 23% for 2 and 3 compared to 1 using the same FPGA technology.
For ,
2 the addition of memory elements specifically IM realised using dedicated BRAM affects f max
by ≈ 8% compared to . 1 Nevertheless, the instruction decoder (ID) which is a combinational part of a
datapath significantly increases the critical path length of the design. A further 15% f max degradation
from 2 to 3 has resulted by adding memory elements KM and RF to support control and data
dependent execution, which requires additional control logic and data multiplexers. Comparing
different FPGA fabrics, a f max reduction of 14% and 23% is observed for Kintex-7 and Artix-7. When 3
is ported from Virtex-7 to Kintex-7 and Artix-7, a maximum f max reduction of 5% and 33% is observed.
This analysis has laid firm foundations by comparing different processor datapath and dataflow
models and how they impact the f max of the resultant soft-core processor. The trade-off analysis
shows that an area-efficient, high-performance soft core processor architecture can be realised that
supports requirements to accelerate image pre-processing applications. Among the presented models,
3 provides the best balance among functionality, flexibility, dataflow mapping and optimisation
possibilities, and performance. This model is used to develop a novel FPGA-based soft core IPPro
architecture in Section 4.3.
0RGHO
0RGHO
0RGHO
)UHTXHQF\ 0+]
14
J. Imaging 2019, 5, 16
Figure 8. Block diagram of FPGA-based soft core Image Processing Processor (IPPro) datapath
highlighting where relevant the fixed Xilinx FPGA resources utilised by the approach.
Table 3 outlines the relationship between data abstraction and the addressing modes, along with
some supported instructions for the IPPro architecture, facilitating programmable implementation of
point and area image processing algorithms. The stream access reads a stream of tokens/pixels from
the input FIFO using GET instruction and allows processing either with constant values (Kernel
Memory-FIFO) or neighbouring pixel values (Register File-FIFO or Register File-Register File).
The processed stream is then written to the output FIFO using PUSH instruction. The IPPro supports
arithmetic, logical, branch and data handling instructions. The presented instruction set is optimized
after profiling use cases presented in [10,26].
Table 3. IPPro supported addressing modes highlighting the relation to the data processing
requirements and the instruction set.
The IPPro supports branch instructions to handle control flow graphs to implement commonly
known constructs such as if-else and case statements. The DSP48E1 block has a pattern detector that
compares the input operands or the generated output results depending on the configuration and
sets/resets the PATTERNDETECT (PD) bit. The IPPro datapath uses the PD bit along with some
additional control logic to generate four flags zero (ZF), equal (EQF), greater than (GTF) and sign (SF)
bits. When the IPPro encounters a branch instruction, the branch controller (BC) compares the flag
status and branch handler (BH) updates the PC as shown in Figure 8.
The IPPro architecture has been coded in Verilog HDL and synthesized using Xilinx Vivado
v2015.4 design suite on Kintex-7 FPGA fabric giving a f max of 337 MHz. Table 4 shows that the
IPPro architecture has achieved 1.6–3.3× times higher operating frequency ( f max ) than the relevant
processors highlighted in Section 2.2 by adopting the approach presented in Section 4. Comparing
the FPGA resource usage of Table 4, the flip-flop utilisation (FF) is relatively similar except for the
FlexGrip which uses 30× more flip-flops. Considering LUTs, the IPPro uses 50% less LUT resources
compared to MicroBlaze and GraphSoC. To analyse design efficiency, a significant difference (0.76–9.00)
in BRAM/DSP ratio can be observed among processors. Analysing design area efficiency, a significant
difference 0.76–9.00 in BRAM/DSP ratio is observed which makes IPPro an area-efficient design based
on the proposed metric.
15
J. Imaging 2019, 5, 16
Table 4. Comparison of IPPro against other FPGA-based processor architectures in terms of FPGA
resources used and timing results achieved.
a
Processor MicroBlaze IPPro
FPGA Fabric Kintex-7
Freq (MHz) 287 337
Micro-benchmarks Exec. Time (us) Speed-up
Convolution 0.60 0.14 4.41
Degree-2 Polynomial 5.92 3.29 1.80
5-tap FIR 47.73 5.34 8.94
Matrix Multiplication 0.67 0.10 6.7
Sum of Abs. Diff. 0.73 0.77 0.95
Fibonacci 4.70 3.56 1.32
b
Processor MicroBlaze IPPro Ratio
FFs 746 422 1.77
LUTs 1114 478 2.33
BRAMs 4 2 2.67
DSP48E1 0 1 0.00
16
J. Imaging 2019, 5, 16
5. System Architecture
The k-means clustering and Traffic Sign Recognition algorithms has been used to explore and
analyse the impact of both data and task parallelism using a multi-core IPPro implemented on a
ZedBoard. The platform has a Xilinx Zynq XC7Z020 SoC device interfaced to a 256 MB flash memory
and 512 MB DDR3 memory. The SoC is composed of a host processor known as programmable system
(PS) which configures and controls the system architecture, and the FPGA programmable logic (PL)
on which the IPPro hardware accelerator is implemented, as illustrated in Figure 9. The SoC data
communication bus (ARM AMBA-AXI) transfers the data between PS and PL using the AXI-DMA
protocol and the Xillybus IP core is deployed as a bridge between PS and PL to feed data into the
image processing pipeline. The IPPro hardware accelerator is interfaced with the Xillybus IP core
via FIFOs. The Linux application running on PS streams data between the FIFO and the file handler
opened by the host application. The Xillybus-Lite interface allows control registers from the user space
program running on Linux to manage the underlying hardware architecture.
Figure 9 shows the implemented system architecture which consists of the necessary control
and data infrastructure. The data interfaces involve stream (Xillybus-Send and Xillybus-Read);
uni-directional memory mapped (Xillybus-Write) to program the IPPro cores; and Xillybus-Lite
to manage Line buffer, scatter, gather, IPPro cores and the FSM. Xillybus Linux device drivers are used
to access each of these data and control interfaces. An additional layer of C functions is developed
using Xillybus device drivers to configure and manage the system architecture, program IPPro cores
and exchange pixels between PS and PL.
Figure 9. System architecture of IPPro-based hardware acceleration highlighting data distribution and
control infrastructure, FIFO configuration and Finite-State-Machine control.
Control Infrastructure
To exploit parallelism, a configurable control infrastructure has been implemented using the PL
resources of the Zynq SoC. It decomposes statically the data into many equal-sized parts, where each
part can be processed by a separate processing core. A row-cyclic data distribution [28] has been used
because it allows buffering of data/pixels in a pattern suitable for point and area image processing
17
J. Imaging 2019, 5, 16
operations after storing them into the line buffers. The system-level architecture (Figure 9) is composed
of line buffers, a scatter module to distribute the buffered pixels, a gather module to collect the
processed pixels and a finite-state-machine (FSM) to manage and synchronise these modules.
Table 6. Dataflow actor mapping and supported parallelism of IPPro hardware accelerator design
presented in Figure 11.
Parallelism
Design Acceleration Paradigm Mapping
Data Task
1 Single core IPPro Single actor No No
2 8-way SIMD IPPro Single actor Yes No
3 Dual core IPPro Dual actor No Yes
4 Dual core 8-way SIMD IPPro Dual actor Yes Yes
18
J. Imaging 2019, 5, 16
Distance.cal Distance.ippro
packageorg.proj.kmeansorcc; DISTCAL:
actorDistance()int(size=8)DisInput==>int(size=8)DisOutput: GET R1, 1
//Get28Şbitpixelsandpusheachwithitsassociatedcentroid GET R2, 1
DistCal:actionDisInput:[Pix1,Pix2]==>DisOutput:[Pix1, STR R3, 31
Cent1,Pix2,Cent2] STR R4, 40
var STR R5, 50
uintCent1, STR R6, 76
uintCent2, SUB R7, R1, R3
uintCent[4]=[31,40,50,76],//4initialcentroids SUB R8, R1, R4
uintTemp1[4], SUB R9, R1, R5
(a) uintTemp2[4], SUB R10, R1, R6
uintTemp3[4], SUB R11, R2, R3
TopKMeansOrcc.xdf uintTemp4[4] SUB R12, R2, R4
<?xml version="1.0" encoding="UTF-8"?> do SUB R13, R2, R5
<XDF name="TopKMeansOrcc"> //Manhattandistanceestimation SUB R14, R2, R6
... foreachint(size=8)countin1..4do MUL R15, R7, R7
<Instance id="Distance"> //Pixel1'sdistancefromeverycentroid MUL R16, R8, R8
... Temp1[count]:=Pix1ŞCent[count]; MUL R17, R9, R9
</Instance> //Pixel1'sabsolutevalueestimationbysquaring MUL R18, R10, R10
<Instance id="Averaging"> Temp3[count]:=Temp1[count]*Temp1[count]; MUL R19, R11, R11
... //Pixel2'sdistancefromeverycentroid MUL R20, R12, R12
</Instance> Temp2[count]:=Pix2ŞCent[count]; MUL R21, R13, R13
<Connection dst="Averaging" dst-port="AvInput" //Pixel2'sabsolutevalueestimationbysquaring MUL R22, R14, R14
src="Distance" src-port="DisOutput"/> Temp4[count]:=Temp2[count]*Temp2[count]; ...
<Connection dst="Distance" dst-port="DisInput" end
src="" src-port="InputPort"/> ... (d)
<Connection dst="" dst-port="OutputPort" end
src="Averaging" src-port="AvOutput"/> ...
</XDF> end
(b) (c)
Figure 10. High-level implementation of k-means clustering algorithm: (a) Graphical view of Orcc
dataflow network; (b) Part of dataflow network including the connections; (c) Part of Distance.cal file
showing distance calculation in RVC-CAL where two pixels are received through an input FIFO channel,
processed and sent to an output FIFO channel; (d) Compiled IPPro assembly code of Distance.cal.
Figure 11. IPPro-based hardware accelerator designs to explore and analyse the impact of parallelism
on area and performance based on Single core IPPro , 1 eight-way parallel SIMD IPPro , 2 parallel
Dual core IPPro 3 and combined Dual core 8-way SIMD IPPro called . 4
19
J. Imaging 2019, 5, 16
one of the four clusters which take an average of 45 cycles/pixel. To classify the whole image, it takes
118.2 ms which corresponds to 8.45 fps. On the other hand, the averaging takes four tokens and
produces four new cluster values, which takes an average of 55 clock cycles/pixel results in 145 ms or
6.88 fps. Both the stages involve point-based pixel processing. Therefore design 2 was developed
and used to exploit data-level parallelism. As a result, the execution time is reduced to 23.32 ms
and 27.02 ms for distance calculation and averaging respectively. This is an improvement of 5.1 and
5.4 times over 1 (and not the expected 8 times) of the 8-way SIMD implementation ( 2 over )
1 as
the overhead of data transfer time from/to the accelerator restricts the performance improvement.
This came at the cost of 4.1, 2.3 and 8.0 times more BRAMs, LUTs and DSP blocks respectively
as reported in Table 8. The major contributor to increased area utilisation is data distribution and
control infrastructure.
Table 8 reports the execution time and performance (fps) numbers of both stages together to
exploit task and data parallelism using designs 3 and . 4 The reported results of 1 and 2 were
obtained by combining the execution time of both stages previously reported in Table 7. Using design
,
3 the effect of task parallelism implemented via intermediate FIFO results in an average of 63 clock
cycles/pixel which is 163 ms (6 fps). By pipelining both actors, 3 has achieved 1.6 times better
performance compared to 1 at the cost of 1.6 and 2.0 times more BRAM and DSP blocks using the
same Xillybus IP infrastructure as .
1 The reason for the improvement is the localisation of intermediate
data within FPGA fabric using an intermediate FIFO, which hides the data transfer overhead to and
from host processor as shown in Figure 11. Investigating the reported area utilisation numbers in
Table 8 shows that the area utilisation for design
3 and 4 is not twice as big as
1 and 2 respectively
due to the FPGA resources utilised by the input and output data ports of Xillybus IP. Design 1 and
3 requires a single input and output data port, while 2 and 4 requires eight input and output data
ports. Therefore, a part of FPGA logic used by the Xillybus IP is constant/fixed for , 1 3 and ,2 .
4
Analysing the impact of exploiting both task and data-level parallelism using 4 results on average
14 clock cycles/pixel and an execution time of 35.9 ms (2 fps). It is 1.4, 4.5 and 7.3 times better than
,
2 3 and 1 respectively. For comparison, both stages were coded in C language and executed on
an embedded ARM Cortex-A7 processor that achieved execution time of 286 ms (354 fps) which is
8 times slower than the performance achieved by . 4
1 Single Core IPPro
2 8-Way SIMD IPPro
Single Actor
Exec. (ms) fps Exec. (ms) fps
Distance Calculation 118.21 8.45 23.37 42.78
Averaging 145.17 6.88 27.02 37.00
Table 8. Area utilisation and performance results of IPPro-based hardware accelerator designs in
Figure 11 exploiting data and task parallelism namely ,
1 ,
2
3 and .
4
Area Performance
k-Means Acceleration
LUT FF BRAM DSP Exec. (ms) fps
1 Combined stages using Single-core IPPro 4736 5197 4.5 1 263.38 3.8
2 Combined stages using 8-way SIMD IPPro 10,941 12,279 18.5 8 50.39 19.8
3 Dual-core IPPro 4987 5519 4.5 2 163.2 6
4 Dual 8-way SIMD IPPro 13,864 16,106 18.5 16 35.9 28
Software implementation on ARM Cortex-A7 - - - - 286 3.5
20
J. Imaging 2019, 5, 16
Table 9. Power, resource and combined efficiency comparisons of IPPro-based k-means clustering
implementations on Zedboard (Xilinx Zynq XC7Z020 Artix-7).
Table 10. Power, resource and combined efficiency comparisons for k-means clustering for Xilinx Zynq
XC7Z045 Kintex-7 FPGA, nVIDIA GPU GTX980, embedded ARM Mali-T628 GPU and embedded ARM
Cortex-A7 CPU.
21
J. Imaging 2019, 5, 16
Both FPGA implementations achieved 6 and 27 times better fps performance than the embedded
CPU, whilst the embedded GPU delivered 6.7 times better performance over the FPGA by exploiting
parallelism and higher operating frequency. Focusing on the power consumption results, the FPGA
consumed 2.1 and 4.9 times less power than both the embedded CPU and embedded GPU respectively.
It shows that the FPGA technology delivers a power-optimised solution while the GPU approach
provides a performance-optimised solution. Considering the performance and power together,
the power efficiency (fps/W) numbers shows that FPGA and embedded GPU implementations
are 57 and 33 times more power efficient than embedded CPU and that the FPGA implementation is
24 times more power efficient than embedded GPU. Nevertheless, this power efficiency edge can be
further improved by applying dataflow transformations and increasing the number of IPPro cores.
Table 10 compares the FPGA results against desktop GPU and reports resource efficiency as a
metric due to significant difference in the power consumption numbers. The resource efficiency has
been presented in terms of frames-per-second-per-Transistor-Utilisation (fps/TU) which is 6 and 63 for
the 28 nm FPGA and GPU technologies. For embedded CPU and GPU, these results are not reported
due to unavailability of transistor count numbers for the ARM. The reported resource efficiency
results shows that GPU utilises area resources more efficiently than the FPGA when power is kept
out of the equation. Combining all three metrics (fps/W/TU) shows that the advantage gained from
FPGA designs is significant i.e., 22 times more efficient than GPU. This advantage becomes more
valuable when it is acknowledged that the FPGA-based SoC design is adaptable and allows exploration,
profiling and implementation of different dataflow transformation possibilities over dedicated FPGA
approaches to accelerate image processing applications for low energy applications.
DetectionStage RecognitionStage
Edge Bounding
Pixel Circles Classification Template
ColourFilter Morphology Detection/ Boxes
Stream Detection byShape Matching
ContourFilling Detection
Figure 12. Section execution times and ratios for each stage of the traffic sign recognition algorithm.
22
J. Imaging 2019, 5, 16
The simplified generated code of RVC CAL-IPPro compilation is shown in Figure 13a. GET and
PUSH instructions set the input or output port numbers through which the tokens are received or sent.
GET instructions read 9 pixels values and stores them into the register file from R1 to R9. Then, the
corner pixels are ignored to impose 3 × 3 circular mask, a maximum value among the remaining pixels
max(R1, R4, R5, R6, R8) is computed and stored in R7 to apply dilation operation. This value is
then pushed to the output using PUSH instruction. The output result of the implemented design are
shown in Figure 13b.
DILATION3X3:
GET R1, 1
GET R2, 1
GET R3, 1
GET R4, 1
GET R5, 1
GET R6, 1
GET R7, 1
GET R8, 1
GET R9, 1
MAX R30, R2, R5
MAX R29, R4, R6
MAX R28, R30, R8
MAX R27, R28, R29
PUSH R27, 1
JMP DILATION3X3
(b)
(a)
Figure 13. (a) The simplified IPPro assembly code of 3 × 3 dilation operation. (b) The output result of
implemented design.
Table 11 presents the results from the Zedboard implementation that has been tested with a set
of real images. The hardware accelerated implementation of colour filter stage using 32 IPPro cores
reduces the execution time from 88.87 ms down to 19.71 ms compared to software implementation
on-chip ARM Cortex-A9. Similarly, the morphology filter stage using 16 IPPro cores has reduced the
execution time from 399 ms down to 41.3 ms. The presented IPPro-based hardware acceleration design
has achieved a speed-up of 4.5 and 9.6 times over ARM for colour and morphology filters respectively.
The achieved speed up for colour filter stage using 32 cores is lower than that of morphology stage
using 16 cores, because of the higher number of clock cycles spent on every pixel for colour filter stage;
this is due to larger execution time of division coprocessor used for colour filtering.
Table 11. IPPro-based acceleration of colour and morphology operations implemented on Zedboard.
Figure 14 shows the stage-wise acceleration of traffic sign recognition by accelerating colour and
morphology filters. Edge/contours detection and bounding boxes stages were improved partially by
accelerating the morphology operations. The edge detection is based on the morphology operations
by taking the difference between erosion and dilation. Therefore the morphology results obtained by
acceleration are further exploited in the host to factor out some operations when doing edge detection.
23
J. Imaging 2019, 5, 16
&RORXU
0RUSKRORJ\
(GJH'HWHFWLRQ
&LUFOH'HWHFWLRQ
%RXQGLQJER[HV
6KDSH&ODVVLILFDWLRQ
7HPSODWH0DWFKLQJ
7LPH PV
$50 ,33UR$FFHOHUDWHG
6WDJHRI7UDIILF6LJQ5HFRJQLWLRQ
Figure 14. Stage-wise comparison of traffic sign recognition acceleration using ARM and IPPro
based approach.
Author Contributions: The project was conceptualized by R.W., F.S. and S.A. and R.W. and K.R. provided both
project supervision and project administration. The implementation was carried out primarily by F.S. and S.A.
and validation was undertaken by F.S., U.I.M. and T.D. The writing of the original draft was carried out by F.S.,
R.W., S.A. and D.C. and edits done by R.W. and F.S.
Funding: This work has been undertaken in collaboration with Heriot-Watt University in a project funded by the
Engineering and Physical Science Research Council (EPSRC) through the EP/K009583/1 grant.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Conti, F.; Rossi, D.; Pullini, A.; Loi, I.; Benini, L. PULP: A Ultra-Low Power Parallel Accelerator for
Energy-Efficient and Flexible Embedded Vision. J. Signal Process. Syst. 2016, 84, 339–354. [CrossRef]
2. Lamport, L. The Parallel Execution of DO Loops. Commun. ACM 1974, 17, 83–93. [CrossRef]
3. Markov, I.L. Limits on Fundamental Limits to Computation. Nature 2014, 512, 147–154. [CrossRef] [PubMed]
4. Bacon, D.F.; Rabbah, R.; Shukla, S. FPGA Programming for the Masses. ACM Queue Mag. 2013, 11, 40–52.
[CrossRef]
5. Gort, M.; Anderson, J. Design re-use for compile time reduction in FPGA high-level synthesis flows.
In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT), Shanghai,
China, 10–12 December 2014; pp. 4–11.
24
J. Imaging 2019, 5, 16
6. Yiannacouras, P.; Steffan, J.G.; Rose, J. VESPA: Portable, scalable, and flexible FPGA-based vector processors.
In Proceedings of the 2008 International Conference on Compilers, Architectures and Synthesis for Embedded
Systems, Atlanta, GA, USA, 19–24 October 2008; pp. 61–70.
7. Severance, A.; Lemieux, G.G. Embedded supercomputing in FPGAs with the VectorBlox MXP matrix
processor. In Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software
Codesign and System Synthesis, Montreal, QC, Canada, 29 September–4 October 2013; pp. 1–10.
8. Andryc, K.; Merchant, M.; Tessier, R. FlexGrip: A soft GPGPU for FPGAs. In Proceedings of the 23rd
International Conference on Field Programmable Logic and Applications (FPL 2013), Porto, Portugal,
2–4 September 2013; pp. 230–237.
9. Cheah, H.Y.; Brosser, F.; Fahmy, S.A.; Maskell, D.L. The iDEA DSP block-based soft processor for FPGAs.
ACM Trans. Reconfig. Technol. Syst. 2014, 7, 19. [CrossRef]
10. Siddiqui, F.M.; Russell, M.; Bardak, B.; Woods, R.; Rafferty, K. IPPro: FPGA based image processing processor.
In Proceedings of the IEEE Workshop on Signal Processing Systems, Belfast, UK, 20–22 October 2014; pp. 1–6.
11. Amiri, M.; Siddiqui, F.M.; Kelly, C.; Woods, R.; Rafferty, K.; Bardak, B. FPGA-Based Soft-Core Processors for
Image Processing Applications. J. Signal Process. Syst. 2017, 87, 139–156. [CrossRef]
12. Bourrasset, C.; Maggiani, L.; Sérot, J.; Berry, F. Dataflow object detection system for FPGA-based smart
camera. IET Circuits Devices Syst. 2016, 10, 280–291. [CrossRef]
13. Nugteren, C.; Corporaal, H.; Mesman, B. Skeleton-based automatic parallelization of image processing
algorithms for GPUs. In Proceedings of the 2011 International Conference on Embedded Computer Systems:
Architectures, Modeling and Simulation, Samos, Greece, 18–21 July 2011; pp. 25–32. [CrossRef]
14. Brodtkorb, A.R.; Dyken, C.; Hagen, T.R.; Hjelmervik, J.M.; Storaasli, O.O. State-of-the-art in Heterogeneous
Computing. Sci. Program. 2010, 18, 1–33. [CrossRef]
15. Neuendorffer, S.; Li, T.; Wang, D. Accelerating OpenCV Applications with Zynq-7000 All Programmable SoC
Using Vivado HLS Video Libraries; Technical Report; Xilinx Inc.: San Jose, CA, USA, 2015.
16. Strik, M.T.; Timmer, A.H.; Van Meerbergen, J.L.; van Rootselaar, G.J. Heterogeneous multiprocessor for
the management of real-time video and graphics streams. IEEE J. Solid-State Circuits 2000, 35, 1722–1731.
[CrossRef]
17. Zhang, J.; Zhang, Z.; Zhou, S.; Tan, M.; Liu, X.; Cheng, X.; Cong, J. Bit-level Optimization for High-level
Synthesis and FPGA-based Acceleration. In Proceedings of the 18th Annual ACM/SIGDA International
Symposium on Field Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2010; pp. 59–68.
18. Nikhil, R. Bluespec System Verilog: Efficient, correct RTL from high level specifications. In Proceedings
of the Second ACM and IEEE International Conference on Formal Methods and Models for Co-Design
(MEMOCODE ’04), San Diego, CA, USA, 23–25 June 2004; pp. 69–70.
19. Kapre, N. Custom FPGA-based soft-processors for sparse graph acceleration. In Proceedings of the 2015
IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP),
Toronto, ON, Canada, 27–29 July 2015; pp. 9–16.
20. LaForest, C.E.; Steffan, J.G. Octavo: An FPGA-centric processor family. In Proceedings of the ACM/SIGDA
International Symposium on Field Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2012;
pp. 219–228.
21. Sutherland, W.R. On-Line Graphical Specification of Computer Procedures. Technical Report, DTIC
Document. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1966.
22. Eker, J.; Janneck, J. CAL Language Report; Tech. Rep. UCB/ERL M; University of California: Berkeley, CA,
USA, 2003; Volume 3.
23. Yviquel, H.; Lorence, A.; Jerbi, K.; Cocherel, G.; Sanchez, A.; Raulet, M. Orcc: Multimedia Development
Made Easy. In Proceedings of the 21st ACM International Conference on Multimedia (MM ’13), Barcelona,
Spain, 21–25 October 2013; pp. 863–866.
24. So, H.K.H.; Liu, C. FPGA Overlays. In FPGAs for Software Programmers; Springer: Berlin, Germany, 2016;
pp. 285–305.
25. Gupta, S. Comparison of Different Data Flow Graph Models; Technical Report; University of Stuttgart: Stuttgart,
Germany, 2010.
26. Kelly, C.; Siddiqui, F.M.; Bardak, B.; Woods, R. Histogram of oriented gradients front end processing:
An FPGA based processor approach. In Proceedings of the 2014 IEEE Workshop on Signal Processing
Systems (SiPS), Belfast, UK, 20–22 October 2014; pp. 1–6.
25
J. Imaging 2019, 5, 16
27. Schleuniger, P.; McKee, S.A.; Karlsson, S. Design Principles for Synthesizable Processor Cores. In Proceedings
of the 25th International Conference on Architecture of Computing Systems (ARCS); Springer: Berlin/Heidelberg,
Germany, 2012; pp. 111–122.
28. García, G.J.; Jara, C.A.; Pomares, J.; Alabdo, A.; Poggi, L.M.; Torres, F. A survey on FPGA-based sensor
systems: Towards intelligent and reconfigurable low-power sensors for computer vision, control and signal
processing. Sensors 2014, 14, 6247–6278. [CrossRef] [PubMed]
29. Mogelmose, A.; Trivedi, M.M.; Moeslund, T.B. Vision-Based Traffic Sign Detection and Analysis for Intelligent
Driver Assistance Systems: Perspectives and Survey. IEEE Trans. Intell. Transp. Syst. 2012, 13, 1484–1497.
[CrossRef]
c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
26
Journal of
Imaging
Article
Optimized Memory Allocation and Power
Minimization for FPGA-Based Image Processing
Paulo Garcia 1, * , Deepayan Bhowmik 2 , Robert Stewart 3 , Greg Michaelson 3 and
Andrew Wallace 4
1 Department of Systems and Computer Engineering, Carleton University, Ottawa, ON K1S 5B6, Canada
2 Div. of Computing Science and Mathematics, University of Stirling, Stirling FK9 4LA, UK;
[email protected]
3 School of Mathematical and Computer Sciences, Heriot Watt University, Edinburgh EH14 4AS, UK;
[email protected] (R.S.); [email protected] (G.M.)
4 School of Engineering and Physical Sciences, Heriot Watt University, Edinburgh EH14 4AS, UK;
[email protected]
* Correspondence: [email protected]
Abstract: Memory is the biggest limiting factor to the widespread use of FPGAs for high-level
image processing, which require complete frame(s) to be stored in situ. Since FPGAs have limited
on-chip memory capabilities, efficient use of such resources is essential to meet performance, size and
power constraints. In this paper, we investigate allocation of on-chip memory resources in order to
minimize resource usage and power consumption, contributing to the realization of power-efficient
high-level image processing fully contained on FPGAs. We propose methods for generating memory
architectures, from both Hardware Description Languages and High Level Synthesis designs, which
minimize memory usage and power consumption. Based on a formalization of on-chip memory
configuration options and a power model, we demonstrate how our partitioning algorithms can
outperform traditional strategies. Compared to commercial FPGA synthesis and High Level Synthesis
tools, our results show that the proposed algorithms can result in up to 60% higher utilization
efficiency, increasing the sizes and/or number of frames that can be accommodated, and reduce
frame buffers’ dynamic power consumption by up to approximately 70%. In our experiments
using Optical Flow and MeanShift Tracking, representative high-level algorithms, data show that
partitioning algorithms can reduce total power by up to 25% and 30%, respectively, without impacting
performance.
Keywords: field programmable gate array (FPGA); memory; power; image processing; design
1. Introduction
Advances in Field Programmable Gate Array (FPGA) technology [1] have made them the de facto
implementation platform for a variety of computer vision applications [2]. Several algorithms, e.g.,
stereo-matching [3], are not feasibly processed in real-time on conventional general purpose processors
and are best suited to hardware implementation [4,5]. The absence of a sufficiently comprehensive,
one size fits all hardware pipeline for the computer vision domain [6] motivates the use of FPGAs
in a myriad of computer vision scenarios, especially in applications where processing should be
performed in situ, such as in smart cameras [7], where FPGAs embed data acquisition, processing and
communication subsystems. Adoption of FPGA technology by the computer vision community has
accelerated during recent years thanks to the availability of High Level Synthesis (HLS) tools which
enable FPGA design within established software design contexts.
However, since FPGAs have limited on-chip memory capabilities (e.g., approx. 6MB of on-chip
memory on high end Virtex 7 FPGAs), external memory (i.e., DDR-RAM chips connected to the FPGA)
is often used to accommodate frames [8,9]. This causes penalties on performance (latency is much higher
for off-chip memory access) and perhaps more importantly, on size (two chips, FPGA and DDR, rather
than just FPGA), power (DDR memories are power hungry [10]) and have associated monetary costs,
hindering the adoption of FPGAs.
In this paper, we research allocation of on-chip memory resources in order to minimize resource
usage and power consumption, contributing to the realization of power-efficient high-level image
processing systems fully contained on FPGAs. We propose methods for generating on-chip memory
architectures, applicable from both HLS and Hardware Description Languages (HDL) designs, which
minimize FPGA memory resource usage and power consumption for image processing applications.
Our approach does not exclude external memory access: rather, it is orthogonal to any memory
hierarchy, and applicable to any instances of on-chip memory. Specifically, this paper offers the
following contributions:
• A formal analysis of on-chip memory allocation schemes and associated memory usage for given
frame sizes and possible on-chip memory configurations.
• Methods for selecting a memory configuration for optimized on-chip memory resource usage and
balanced usage/power for a given frame size.
• A theoretical analysis of the effects on resource usage and power consumption of our partitioning
methods.
• Empirical validation of resource usage, power and performance of the proposed methods,
compared to a commercial HLS tool.
Our experiments show that on-chip memory dynamic power consumption can be reduced by up
to approximately 70%; using representative high-level algorithms, this corresponds to a reduction of
total power by up to 25% and 30%, respectively, without impacting performance. The remainder of
this paper is organized as follows: Section 2 describes related work within FPGA memory systems
architecture and design for image processing. In Section 3, we formally describe the research problem
of power-size optimization, present a motivational example that highlights the limitations of standard
HLS approaches, and present alternative partitioning methods. Section 4 describes our experimental
methodology and experimental results, and Section 5 presents a thorough discussion of said results.
Finally, Section 6 presents our concluding remarks.
Throughout this paper, we use the term BRAM (Block Random Access Memory), a Xilinx
nomenclature for on-chip memories, to refer to on-chip FPGA memories in general.
28
J. Imaging 2019, 5, 7
(e.g., sliding window filters) [15] are relatively simple to implement using hardware structures such as
line buffers [3]. However, complex algorithms based on global operations require complete frame(s) to
be stored in situ [11]; examples of contemporary applications that require global operations are object
detection, identification and tracking, critical to security. Notice we use the term “global operations” to
simultaneously refer to two characteristics: the use of global operators (atomic operations which require
the whole image, such as transposition or rotation) and undetermined (unpredictable) access patterns
(e.g., a person identification system might only need a subset of a frame, but which subset cannot be
decided at design time, as it depends on person location at runtime).
A possible approach is to refine image processing algorithms so they can perform on smaller frame
sizes that can be contained on an FPGA [2]. Several algorithms maintain robustness for downscaled
images [17], e.g., the Face Certainty Map [18]) or employ intelligent on-chip memory allocation
schemes [8] to accommodate complete frames that take into account power profiles. The latter requires
methods to optimize on-chip memory configurations in order to maximize valuable usage; often at
odds with performance-oriented allocation schemes standard in HLS code generators. Other possible
approaches include stream-processing algorithm refactoring to minimize memory requirements [19]
or programming-language abstractions for efficient hardware pipeline generation [20]; these are
orthogonal to our approach, and outside the scope of this work.
In our context, the most significant related work on the use of FPGA on-chip memory for image
processing applications has focused on four aspects: processing-specific memory architectures, caching
systems for off-chip memory access, partitioning algorithms for performance and on chip memory
power reduction.
29
J. Imaging 2019, 5, 7
on-chip as rows, consuming precious processing time. Similarly, Chou et al. [26] have shown the
use of vector scratchpad memories for accelerating vector processing on FPGAs, but still rely on
random-access external memories; a similar approach is followed by Naylor et al. [27] in the context of
FPGAs as accelerators. The use of external memories solves the storage limitation: however, it greatly
limits parallelism (only one access per external memory chip can be performed at once) or greatly
exacerbates financial and power costs, if several external memories are used.
30
J. Imaging 2019, 5, 7
problem formulation. We assume that only one possible BRAM configuration is used for each image
frame buffer.
Mx × Nx ≤ C, ∀ x ∈ [0, i − 1] (2)
For any given frame size, several possible BRAM topologies are possible (Different BRAM
configurations do not always equal the same logical bit capacity. Whilst the total physical capacity is the
same, in some configurations parity bits can be used as additional data bits. E.g., configuration (1,16384)
can store 16384 bits, whilst configuration (9,2048) can store 18432 bits). A frame is a 3-dimensional array,
of dimensions width W, height H, and pixel bit width Bw (typically defined as a 2-dimensional array
where the type defines the bit width dimension). BRAM topologies are defined based on a mapping of
3-D to 2-D arrays and a partitioning of a 2-D array to a particular memory structure (Figure 1).
Figure 1. Mapping a 3-D array into row-major and colum-major order 2-D arrays.
31
J. Imaging 2019, 5, 7
Throughout the remainder of this paper, we assume the use of a mapping scheme which assigns
Bw to the x dimension and H and W to the y dimension, in both row-major and column-major
order (where x and y are 2-D array width and height, respectively). This is the default approach in
software implementations, where the type/bit width dimension is considered implicit, and a sensible
approach for hardware implementations. Mapping bit width Bw across the y dimension would result
in implementations where different bits of the same array element (pixel) would be scattered among
different memory positions of the same BRAM. This would require sequential logic to read/write a
pixel, accessing several memory positions, creating performance, power and size overheads. It should
be noted that this approach might offer performance advantages for certain classes of algorithms
which might want to compare individual bits of different elements; however, we delegate this aspect
to future work. Hence, we define only the default mapping scheme:
Definition 2. A mapping scheme m transforms a 3-D array A3 into a 2-D array A2 of dimensions x and y by
assigning Bw to the x dimension and ordered combinations of W and H to the y dimension, for a total of two
possible configurations, as depicted in Figure 1. Mapping schemes are defined as:
( x, y) = m(W, H, Bw ) (3)
A2x,y = A3y\W,y%W,x , x = Bw , y = W × H (4)
A2x,y = A3y%H,y\ H,x , x = Bw , y = W × H (5)
Definition 3. Given a 2-D mapped image frame of dimensions x and y, a partitioning scheme p which assigns
pixels across a × b BRAMs, depicted in Figure 2, is defined as the linear combination:
p( x, y) = Cfg ∗ ( a1 , b1 ), ( a2 , b2 ), ..., ( ai , bi ) (6)
where ∗ stands for linear combination, such that only one ( a x , bx ), ∀ x ∈ [0, i − 1] pair has non-zero components
(such a pair is generated as a function of x and y), selecting M p and Np subject to:
(( a × M p ) ≥ x ) ∩ ((b × Np ) ≥ y) (7)
32
J. Imaging 2019, 5, 7
and resource usage. As this is the greatest bottleneck in implementing high-level image processing
pipelines on an FPGA, it is paramount to define BRAM usage efficiency, i.e., the ratio between the total
data capacity of the assigned BRAMs and the amount of data which is actually used.
Definition 4. Given a partitioning scheme p and maximum BRAM capacity C, the utilization efficiency E is
defined as the ratio:
x×y
E= (8)
a p × bp × C
The default mapping and partitioning schemes in state of the art HLS tools are geared towards
minimizing addressing logic (abundant in contemporary FPGAs), resulting in sub-par efficiency in
BRAMs usage (still scarce for the requirements of high-level image processing systems). Alternative
schemes must be used in order to ensure memory availability within HLS design flows. We define the
problem as:
Problem 1 (Utilization Efficiency). Given an image frame of width W, height H and pixel width Bw , select a
partitioning scheme, in order to:
x ×y
Maximize E = a p ×b p ×C
Subject to (( a × M p ) ≥ x ) ∩ ((b × Np ) ≥ y)
8 × (320 × 240)
E= = 0.520833333 (11)
8 × 8 × (36 × 512)
We have observed that this is the default behaviour for Xilinx Vivado HLS synthesis tools:
empirical results show that configuration ( M1 , N1 ) = (1, 16384) is selected through a partitioning
scheme where a1 = Bw and
W×H
b1 = (12)
N1
33
J. Imaging 2019, 5, 7
rounded up to the nearest power of 2. Our experiments show that for any frame size, the synthesis
tools’ default partitioning scheme can be given by:
⎛ ⎞T
log2 ( WN× H )
( Bw , 2 1 )
⎜ ⎟
⎜ (0, 0) ⎟
⎜ ⎟
⎜ (0, 0) ⎟
p(m(W, H, Bw )) = Cfg ∗ ⎜
⎜
⎟
⎟ (13)
⎜ (0, 0) ⎟
⎜ ⎟
⎝ (0, 0) ⎠
(0, 0)
log ( W × H )
where 2 2 N1 should be read as 2 to the rounded up (ceiled) result of the logarithm operation (i.e.,
2 to an integer power).
Now consider the same mapping (x = Bw , y = W × H), but with a partitioning scheme:
⎛ ⎞T
(8, 5)
⎜(0, 0)⎟
⎜ ⎟
⎜(0, 0)⎟
⎜ ⎟
p(m(320, 240, 8)) = Cfg ∗ ⎜ ⎟ (14)
⎜(0, 0)⎟
⎜ ⎟
⎝(0, 0)⎠
(0, 0)
which partitions data unevenly across BRAMs, rather than evenly. This scheme yields a BRAM usage
count of 40, with storage efficiency:
320 × 240 × 8
E= = 0.833333333 (15)
8 × 5 × (36 × 512)
Yet a better partitioning scheme for the same mapping would be:
⎛ ⎞T
(0, 0)
⎜ (0, 0) ⎟
⎜ ⎟
⎜(2, 19)⎟
⎜ ⎟
p(m(320, 240, 8)) = Cfg ∗ ⎜ ⎟ (16)
⎜ (0, 0) ⎟
⎜ ⎟
⎝ (0, 0) ⎠
(0, 0)
yielding a BRAM count of 38 and efficiency:
320 × 240 × 8
E= = 0.877192982 (17)
2 × 19 × (36 × 512)
Clearly, partitioning schemes depend on the frame dimensions, width, height, and bit width, to
enable efficient use of on-chip memory blocks.
34
Exploring the Variety of Random
Documents with Different Content
“Mark my footsteps, good my page,
Tread thou in them boldly:
Thou shalt find the winter’s rage
Freeze thy blood less coldly!”
Gurth paused on the threshold an instant listening to the singing,
then entered without knocking. The little boys rushed to hang about
him and explore his pockets, and the pastor and Theresa welcomed
him warmly. It was Andrea alone who saw a change in his whole
demeanour, and wondered at the bird-cages. The evening meal was
soon eaten, and the boys went to the kitchen with the toys that
Gurth had brought them; the pastor, scenting something, sat erect in
his arm-chair, all forgetful of his pipe and expectant of some news,
while Theresa hung over him. Gurth stood by the stove, nervous and
uncertain how to begin.
Andrea went to him, and, putting her hand through his arm, said
quietly, but with an infinite tenderness in her voice, “You are going
away, dearest, and you have brought your birds for me to keep for
you.”
It was as if her voice smoothed away his fears and perplexities,
so all four together they discussed the situation without reserve.
Gurth, forgetting his prudent plans, begged the pastor to marry
them then, or at the latest in a few days, when the necessary
legalities could be complied with, so that he might leave Andrea as
his wife. Upon this point the pastor was obdurate. His practical
instinct, born partly of the peasant suspicion of another class, and
partly of hard experience, forbade this. Among his parishioners many
a wedding had taken place on the eve of parting, and the husband
had been swallowed up in that vast new world, while the poor girl at
home waited in vain, not knowing whether she was wife or widow.
He liked Gurth in a way, but he was sadly disappointed in his
failure to reconcile his mother to the marriage, and, while he
believed him sincere at present, he did not know how the separation
might affect either of the young people; so he insisted upon delay. If
Gurth had established himself by the next Christmas, he might
return and marry. If not—well, there were other men who, under the
circumstances, would be more suitable for Andrea, though he did
not voice his opinion. In reality he had no romance in his nature,
and he disbelieved in unequal marriages, especially if money was
not coupled with the rank.
If, after a year’s trial, Gurth was in a position to come for Andrea,
—well and good,—but further than that the pastor could be neither
coaxed nor driven.
Moreover, he allowed them little privacy for saying good-by.
“I know how to work, and I like it, but you must learn how,”
Andrea whispered, as she clung to him. “But I will be ready, Gurth,
and, more, if you can’t return, I will go to you!” This understanding
was their farewell.
His mother, when she found that he had gone, laughingly told
her friends that Gurth had a foolish love affair, and, taking her
advice, he had gone away to travel it off.
II
There is nothing that tends so to destroy the conceit of a man
little used to the sea, as an ocean voyage in midwinter, especially if
it is made on board an emigrant ship. On a good liner he may prop
up his flimsy importance in a dozen ways, from feeing stewards to
bring him six meals a day while he lies in his berth, to pulling himself
together and wearing the distinction of being the only cabin
passenger at table during a furious squall. But on an emigrant ship it
is impossible to veil or soften stern reality.
Gurth had chosen this way of travel that he might more quickly
realize his changed circumstances. For two weeks or perhaps three
he must live in this community. Previously he had a theoretical
knowledge of the conditions that surround and make poverty. Now
for the first time he saw the reality. His first thought was of the
wonderful patience of these people; the next conviction was of their
unconquered hope.
A dozen perhaps had settled homes in America and had returned
to their native land merely to visit, but the multitude were going,
they were not quite sure where, to earn their bread, they did not
know how. Doubts did not trouble them, their pink pasteboard
tickets seemed the pledge of landing somewhere, and as for the
rest, they were used to uncertainty.
The fourth day out, a day when a mild streak and a few hours’
sunshine brought all the grotesque animated bundles of clothes from
their berths, Gurth took his violin and, without ado, began to play a
native ballad, and then another. Silently the people grouped about
him, some stealing below to coax up a comrade who was ill.
The intensely earnest look on their faces stimulated him, and he
played on and on, grading his music from grave to gay, to suit each
in turn, until at last, feeling his wrist failing, he made the national
hymn a final effort. Scarcely had the tune taken form than a chorus
rose, at first swaying and uncertain, and then gaining power and
steadiness, until the last word was reached. The men rubbed their
eyes with the backs of horny hands, and women hugged him, and
before he realized the situation, one stolid, square-faced man, who
had virtually declined to talk to him the day before, was passing
around his peaked fur cap to receive a ready shower of small coin,
which Gurth could not refuse. So thus he earned his first money. By
his violin and its speech, which, however exquisite, no man feels
above him, he was admitted to the freemasonry of his companions.
A carpenter who had been home to see his old parents asked
Gurth where he was going to settle, and then he realized that he did
not know, save what his port was, and that he did not wish to locate
far from the sea, nor in a sultry climate. The carpenter drew from
him such scant outline of his schemes as he wished to tell—his plan
of buying a farm, after he had learned the country’s ways. This man
told him about the village where he lived, which was near a New
England town whose railways offered a market for small fruits, and
he advised Gurth to work for his board and lodging with one of the
numerous fruit-growers until he learned the craft, saying that as he
spoke English well, Waldsen might earn a trifle above his board, but
that a man who had never done hard work was not worth much.
III
It was a bitterly cold winter; the wind swept fiercely through the
cut between Sunset and Rocky hills, rushing down the main street at
Glen Village, separating the neighbours on either side more
effectively than drifts of snow could have done. However deep, there
is something cheerful and exhilarating about snow. Children think
that it is sent for their special amusement; the shy young man, who
drives his sweetheart over to the “Social” in the next village, needs
no excuse for putting his arm around her, for light sleighs have been
known to upset suddenly without the slightest warning. The old folks
are cheerful in their reminiscences of just such episodes, and
compare each storm with some long-remembered one in the thirties,
noting always the frail and inferior wearing quality of modern snow.
But Wintry Wind is the most exasperating and prying of nature’s
messengers, whose mission is the uncovering of weaknesses in all
things animate and inanimate. It soon discovers if your eyes are
sensitive, your hat a size too small, that you are subject to
rheumatism, that your breath is short when you walk uphill, and that
your knees bend as you go down, and so turns your cloak over your
head like an extinguisher. It knows precisely which shingle lacks a
nail, and will lay bare spots calculated to make obstinate leaks. It
also spies out the blind whose catch is loose, the gate with one
hinge, the elm that is split in the crotch, and the particular chimney
flue that leads to the room where your most important relation (who
suffers from bronchitis) is being entertained at tea, and it gauges
accurately which article on the clothes-line you value the most.
It was this sort of weather, combined with his daughter
Margaret’s delicate health, that made Ezra Tolford, living at the Glen
Mill, for which the village was named, resolve to have a hired man.
Now Ezra Tolford had many titles to local distinction. He was
Deacon of the First Church, and his parents had been zealous before
him, his grandfather having had the hardihood to fly to the woods
with the church plate on the approach of the British in 1779, thereby
risking his life via wild beasts, Hessians, and exposure,—a fact that
is brought up in every local historical discourse to this day.
Incidentally it might be mentioned that the plucky ancestor (owing
to fright and darkness) was never able afterwards to locate the
marshy spot where the precious metal was buried; this fact,
however, is usually omitted.
Ezra was also Judge of Probate, thanks to a fragmentary law
course taken in days when a fond mother had pinched and saved
that her only boy might “make his mark.” Thirdly, he was the owner
of the best mill on the Pequotuck. A mill that, in spite of the sale of
flour and meal at the village store, kept its wheel going five days out
of seven during nine months of the year, sawing wood when no one
wished flour, and turning out middlings for the cattle when the
stacks grew low. So swift was the river that ice very seldom silenced
the song the old wheel hummed as it worked.
Lastly, by wise drainage the deacon had turned a dozen acres of
protected meadow-land, heretofore regarded as next to useless, into
one of the thriftiest fruit farms in southern New England.
All these things made Ezra’s daughter Margaret of special
importance in many eyes besides his own, and it was for her sake
that he resolved to have a man to hook up the team for her, when
he was busy in the mill or away in the village, and do a thousand
and one little errands that the sturdier daughters of his neighbours
accomplished for themselves.
The Mill House, as it was called, stood on a hill between the
Pequotuck and a little brook that, curving, joined the river below the
dam. It was a placid-looking white house of a style of architecture
that might be called New England Restored. It had been Colonial,
but a modern bay-window, a piazza, and a lean-to in the rear had
hybridized it; yet it still possessed a dignity never seen in the rural
interpretations of the Queen Anne villa.
This particular house had a very attractive outlook. Raised well
above them, it was bounded on the western side by the river and
the mill-pond that always held the sunset reflections until the
twilight absorbed them, while the old red mill with its moss-mottled
roof focussed the view. Toward the north and east the meadows ran
slantwise up a hillside, where, dotted here and there like grazing
sheep, you could see the stones of the burying-ground, where the
inhabitants of the glen took their final rest, as if their friends had left
them as near heaven as possible, and safe from the floods that used
once to sweep the valley. To the south the road ran tolerably straight
for three miles down to Glen Village itself.
The interior of the house differed but slightly from others of its
class, and that difference consisted in the greater genuineness of its
fittings. Evidently the woman who presided over it appreciated
relative values, for the sitting-room had glowing crimson curtains
and a fire of logs in place of the usual “air-tight,” while in one corner,
in the location usually chosen for the inevitable asthmatic parlour
organ, stood an upright piano. On the table was a comfortable litter
of books and papers.
By the window, looking down the road, stood Margaret Tolford.
At the first glance there was nothing striking about her personality.
Medium in height and colouring, her slight frame was wrapped in a
soft white shawl that gave her a fragile air. At a second glance the
deep gray eyes, that looked from under a brow narrowed by a
quantity of smooth, coal-black hair, were magnetic in their intelligent
wonder. Her eyes said, “There is much that I would understand, but
I cannot;” whereas a shallower nature would have thought, “I am
misunderstood!”
The wind whistled in the chimney, and the pud, pud, of a heavy
flatiron came from the kitchen, with snatches of inharmonious song,
as the thick-lipped Polack who was the “help” pummelled the towels
and folded them at angles that would have distracted a
mathematician. In fact, this very Polack was one of Margaret’s lesser
problems, a sort of necessary evil who, in summer, bareheaded and
barefooted, pervaded the premises, but having with her gay
neckerchief a certain sort of picturesque fitness, which, when
brought nearer, booted and confined to the winter kitchen, became
an eyesore. Other farmers’ daughters did the cooking and the lighter
work, and only had a woman to help with the washing.
Margaret had never done manual labour; her mother, dead now
two years, had stood between this only child and all hardship, and
coaxed the Deacon to send her to a collegiate school when her
playdays were over. In the summer holidays she was petted and
caressed and kept from soiling her hands, and when at eighteen she
was coming home for good to mingle as an equal with her parents
and learn her part in life, her mother died, and her father closed the
one tender spot in his stern heart around his daughter. So she lived
shut up within herself, craving a more intellectual companionship
than the neighbourhood furnished, and starving unconsciously for
demonstrative affection.
Tolford was a silent sort of man, who had been so thoroughly
understood by his wife that she seemed to know his unvoiced
wishes. Because he showed so few signs of an affection that would
have won a hearty response from Margaret, he failed to comprehend
the difference between a deeply reserved nature and physical
weakness, to which cause he laid her abstraction. His love for her,
therefore, took the schooltime form of shielding her from work. He
liked to hear her play hymns on Sunday evenings, and was very
proud to have her train the children of the Sunday School in their
carols, but it never occurred to him to ask her advice in any of his
plans, or expect aid from her. She stood apart, not understanding
the love her mother had drawn from the stern, lonely man, and
while he excused her reserve, and told the neighbours she was
delicate and peaky, her only ailment lay in lack of motive.
It grew dark, and points of light appeared here and there in the
landscape; an icy slip of a moon pierced the driving clouds. Margaret
drew the curtains and sat down by the fire, its light sending a glow
to her usually colourless face. A brisk, though heavy, footstep came
along the entry from the kitchen, and Ezra Tolford opened the door,
and, stopping a moment to adjust his eyes to the fitful light, went
toward the fire, rubbing his hands. Margaret immediately arose and,
pushing a rocking-chair towards him, prepared to light the lamp.
“Never mind that now, daughter,” he said; “sit down, I want to
talk a bit. You know I said I’d get a hired man to ‘piece out’ with the
work? Well, he’s come!”
The Deacon was, in reality, fairly well educated, but since his
wife’s death (she had kept him to her standard, for she had been a
schoolmistress) his English had relapsed into localisms, and, besides
this, at the present moment he seemed ill at ease. Margaret merely
understood the announcement as a roundabout question as to
whether any accommodations were prepared for the man, and said:
“The shed bedroom is just as Hans Schmidt left it last fall; I suppose
a bed could be made up now, and Zella can clean the room to-
morrow, but it will be very cold unless you give him a stove.”
“Well—er—you see,” said the Deacon, “I don’t suppose that room
will do,—em!—hem! You see in the beginning he is to live with me
without wages, and—” here the Deacon came to an embarrassed
standstill, and Margaret broke in,—“Without wages! If he is as poor
as that, he will scarcely object to the shed room without a fire for
the night!” She did not say this because she was at all mean or hard-
hearted, but from her experience of the servant question, any one
who was willing to work for nothing must either be utterly worthless
or bereft of reason.
“Not at all, not at all, daughter! You see, the man is not a
common workman, but may buy the Hill Farm some day as a home
for his sister, and wants me to teach him to grow small fruits, and
learn the way of things here while he gets it to rights. I’ve
contracted with him for a year—” and as Margaret did not reply, he
continued, “You know Peter Svenson, the carpenter, who went home
to Denmark last summer to see his folks? Well, he brought this
young man back with him. Peter knows all about him, and says he is
perfectly honest and speaks good English, but is close-mouthed, and
doesn’t like to talk of his affairs, because his family used to be well
fixed, but now they are all dead but one sister. He has a few
thousand dollars and is going to make a home and bring her over in
a year.
“Peter says he can play a fiddle, but isn’t used to hard work, and
advised me not to pay him money, but to offer to show him how I
work my farm and give him his board for his services.” Then the
Deacon continued, giving the account of Gurth that the garrulous
carpenter had pieced together to cover his lack of real knowledge.
As Margaret still said nothing, he added:—
“Now I think the attic east room might be straightened up,—it
won’t take long, and it can be bettered to-morrow.”
Instantly Margaret was divided between extreme wonderment at
this strange arrangement on her father’s part, and fierce resentment
at the intrusion of a stranger in the house,—a man who was and
was not a servant, who must necessarily eat with them, who would
not perhaps leave the room when the meal was finished.
If Margaret had a decided eccentricity, it was her positive
resentment of male society, and she bore the reputation of being
proud, because, when the village swains drove up in their newly
washed buggies with bows of ribbon tied to the whip handles, and
with self-satisfied glances asked her to take a drive, the usual rural
compliment, she invariably declined, and their irate mothers settled
that she either must be in a decline, experiencing religion, or else,
woful thought, “engaged to some fellow Northampton way,” where
she had been to school.
The truth was that she had, through a wide range of reading and
no experience, built up a well-nigh impossible ideal, half mediæval
heroism, half modern, intellectual refinement, that was irreconcilable
with the type of men with whom she came in contact.
Margaret was thoroughly accustomed to her father’s silent mood
and considered him by far (as he was) the best-informed man she
knew. He was also fond of reading, not only subscribed to a daily
paper, but several weeklies and magazines, and always allowed her
to buy any book she fancied, so that their winter evenings, when
Margaret read aloud, were comfortably sociable, and sympathetic. It
was no wonder, therefore, that she resented the presence of a
stranger, and it was with rather a lowering brow that she followed
her father to the kitchen.
Deacon Tolford went in first, and said abruptly, but in a tone that
Margaret knew was meant to be cordial: “Daughter, this is Gurth
Waldsen, who is going to help me out this year; we want to make
him feel so much at home that he’ll settle in Glen Village. You’d
better tell Zella to hurry supper; I guess we are both of us hungry.”
Margaret added some ordinary words of greeting before she
looked at the figure who rose from the settle back of the stove and
bowed, without offering to shake hands, as a native would have
done. Then she raised her eyes and saw the tall, easy figure with
the golden-tipped hair and beard, his dreamy gray eyes looking at
her with a directness that was not curious, but almost as of pleading
for mercy, while the mouse-coloured corduroy suit that Waldsen
wore brought out the clearness of his skin in a degree that was
almost startling.
“I hope that I put you not to great trouble,” he said in his soft
baritone. “If you will tell me where I may place my things, I can
arrange all myself.” The English was musical, and doubly so from the
slight hesitation and accent.
What passed through Margaret’s brain she never clearly realized,
but she heard her voice as from a long distance asking him to follow
her upstairs, and found herself lighting a lamp, and leading the way.
It was strange that she had never noticed before how dreary the
attic was. She merely indicated the room, saying that he might leave
his things there, and to-morrow he could bring up firewood, while
to-night she would give him an extra supply of bedding. As she left,
Gurth looked after her and at the bare room, and shivered, but the
room seemed less cold to him than the woman. There was no
reason that he should expect her to be cordial; doubtless she would
have preferred a field hand to whom she need not speak.
He realized that his very disappointment grew from the lack of
proper comprehension of his present position. “Oh, Andrea! Andrea!
for one sight of her sweet, sympathetic face, one touch only!” A
harsh, clanging bell from below waked him to the fact that if he
wanted water to wash his hands, he must bring it up himself; he
looked at them dubiously, smoothed his hair, flipped off his clothes
with his handkerchief, and went down.
He hoped that he might be allowed to eat his meals in the
kitchen; it would indicate his position more clearly, and he should be
less lonely than with constrained companionship. This was not to be.
As he passed the dining room door, he saw a table laid for three, at
which Ezra Tolford was already sitting, wrapped in a gaily figured
dressing-gown, and collarless, as was his habit when either at ease
or at work. He was reading a paper which was propped against a
pitcher, and he barely raised his eyes as he asked Gurth to be
seated.
Margaret came in with a coffee-pot and a plate of biscuits. She
had thrown off her shawl, and her crimson cashmere waist
accentuated the depth of her eyes. Gurth unconsciously arose and
drew out her chair, waited until she was seated, and pushed it in
again. It was a very simple and ordinary act of courtesy, and done
as a matter of course without the slightest manner of conferring a
favour. Margaret coloured at this hitherto unknown civility, but said
“Thank you” as if she were quite accustomed to it, while the Deacon
did not notice it at all.
The meal began in silence, but the Deacon finished his paper
with the first cup of coffee, and began to discuss the affairs of the
farm in a businesslike manner. The ice-cutting must begin to-
morrow, it was quite clear, for the last snowstorm had been dry and
had drifted away from the pond.
Had Waldsen ever cut ice? No! Well, he could superintend the
weighing of it, then. Could he milk? No! The hay must be transferred
from the left side of the great barn to the right, as the supports
were giving way, and Peter Svenson, the carpenter, must come and
straighten them, as well as do some tinkering at the mill. Squire
Black at the village needed two tons of hay, so that much could be
carted in next morning.
Waldsen fortunately was thoroughly familiar with horses, and
was a good deal of a carpenter, having always had a fancy for such
work, and, when a boy, he had for amusement built an arbour for
his mother in the garden of her country-house. He was able to
volunteer to repair the barn and mill, if the Deacon had the
necessary tools. The Deacon was too keen to show his surprise, but
accepted the offer, and said it would come handy to have some
patching up done before it came time to clear the land. He could
manage the cows and the mill, if Gurth took charge of the horses
and the chores.
The Deacon, having finished his meal, shook the crumbs from a
fold of the tablecloth of which he made a sort of apron in his lap,
and left the table. Margaret followed him, and Waldsen, hesitating a
moment, went to the back entry and began to collect his
possessions, taking his violin case and a small box first. When he
returned for his trunk, the Deacon appeared, and, as a matter of
course, helped him carry it upstairs. The trunk was very heavy, being
half full of books. Then the two men went out to feed the horses;
the sharp, dry snow blew in like powdered flint when they opened
the door, and made rainbows about the lantern as they went down
the path.
After the table was clear, Margaret took up the paper, read for a
few moments, then dropped it suddenly and went into the kitchen.
Zella, who was knitting a skirt of scarlet yarn, seemed very sulky and
angry when Margaret bade her take some wood to the attic
bedroom. “I no carry for hired man,” was her rejoinder. “You will
take the wood up to-night,” said Margaret, in the quiet, decided tone
that was habitual to her; “to-morrow he will carry it himself.” In a
short time a fire was started in the old, open-fronted wood stove,
that sent a welcome glow across the long, low room with its deeply
recessed dormer windows. The furniture consisted of an old-
fashioned four-posted bedstead and some spindle-backed chairs,
discarded long ago from the lower rooms, an old chest of drawers
and a table, while a row of wooden pegs behind the chimney did
duty as a closet.
Going to the adjoining lumber room, Margaret pulled open a long
trunk and took a chintz quilt, some curtains that had originally
belonged to the old bed, and three or four carpet rugs. These she
dragged into the attic, and then brought from a downstairs room a
large rocking-chair, covered with Turkey red, and a blue china bowl
and pitcher. The last man who had slept in the attic had washed at
the pump. In a few minutes the bare room looked quite habitable,
and Margaret returned to her newspaper.
In perhaps half an hour her father returned, and she heard
Waldsen’s steps going up the creaking back stairs.
“Well, daughter, quite a figure of a man, isn’t he? I know you
don’t like to have men folks about, but you see this arrangement will
advantage me greatly. If I can sell him the Hill Farm, it will be so
much clear gain, besides being a bargain for him, for it’s running
down and needs lots of tinkering. And if we get a good neighbour
there, it won’t be so lonesome for you when I go over town. I can
arrange with him for half-time work in the growing season, so he
can get his fruit running. I’ll sell that place for three thousand dollars
—and three thousand dollars in hand,—why, Margaret, you might go
to Europe next summer with Judge Martin’s folks! He told me
yesterday they expected to take a tour, and that if I’d let you go,
you’d be good company for Elizabeth. What do you say to that,
daughter?”
Going to him and sitting on the arm of his chair, she hid her face
on his shoulder, a childish habit of hers, and said: “Dear old dad, I
should want you to go with me, and then, besides, it is all so
uncertain. This man may not really want to buy a place, or he may
have no money, or—or, a great many things may not be true!”
“No, no, child! the man is all right, he wants to have a home of
his own by next Christmas. There is some reason why his sister
cannot come until then. I like to keep you with me, but my little girl
is too lonely; she must see more company, and if she’s too wise and
too proud for the folks about home, why, this place isn’t the whole
world.”
Meanwhile Waldsen was sitting on his trunk in the attic room in
an attitude of dejection. Then, as the fire flickered, he saw the
change that had been wrought. Not great in fact, but in the
womanly touch, and he was comforted. Taking from his pocket the
little case containing Andrea’s portrait, he placed it on the chest of
drawers, and, after closing the door, took out his violin.
Margaret and her father were playing their nightly game of
backgammon when she started, dropped her checkers with a rattle,
and grasped his arm. The Deacon looked up in surprise, and then,
as he heard a far-away strain of music that seemed to come from
the chimney, said, “Don’t be scared, daughter, it’s only the young
man playing his fiddle!” But somehow neither father nor daughter
cared to continue their game, and a moment later Margaret opened
the door of the sitting-room and one at the foot of the stairs, and
stood there listening, in spite of the cold air that swept down.
Accustomed at most to the trick playing of travelling concert troupes,
who visited the next town, this expressive legato music was a
revelation to Margaret, and stirred her silent nature to untested
depths. The first theme was pleading and wholly unknown to her,
but presently the air changed to the song she had taught the
children during the last Christmas season; through it she heard two
voices singing,—the violin and the man.
“Brightly shone the moon that night
Though the frost was cruel
When a poor man came in sight
Gathering winter fuel.
IV
The last of February a southerly rain inaugurated the spring
thaw. Great cakes of ice came down the river, and barricaded the
mill. Then a cold snap followed, and the trees hung thick with
fantastic icicles. In the morning the Deacon, Gurth, and several
neighbours went up the stream to dislodge, with long poles, cakes of
ice that were wedged threateningly between trees, and after dinner,
when the two men had been talking of the caprices of the storm, the
Deacon said: “It’s worth walking up to the Hill Farm, daughter, to
see the ice on those white pines, but you must mind your footing.
Waldsen’s going up there to shovel off the shed roof, and he’ll be
glad to beau you, I know.”
Margaret blushed painfully, but Gurth, totally missing the
significance of the word, said, in his precise language, that he was
about to ask Miss Margaret, but feared she could not walk so far. So
Margaret brought her coat, trimmed with a neck-band and cuffs of
fur, and, drawing a dark red tam-o’-shanter over her black hair, set
off with Waldsen.
As the Deacon watched them go down the road, dark and fair,
slender and tall, both talking with animation, he suddenly gave a
long whistle, for an idea, born of the word he had just used, flashed
across his matter-of-fact mind, and he said aloud,—“Well, I never!
Well, I never! She shan’t find her old dad a spoil sport, anyhow! I’ve
my doubts if he’ll ever make out with farming, but I suspect he
comes of good folks, and there’s a good living at the mill, and
Margaret’s my only one!” Then he smiled contentedly to himself. The
Deacon had loved his wife with a sentiment that was regarded as a
weakness by his neighbours, and he was prepared to enjoy the
courtship of his only daughter and forward it by all the innocent local
ruses. Yes, he would even make errands to town, and at the last
moment send Waldsen to drive Margaret in his stead.
The couple crossed the bridge and climbed the steep river bank
towards the Hill Farm. Waldsen was in high spirits and hummed and
whistled as they struggled and slipped along, steadying Margaret
every few steps. Happiness and the bracing air had given her a clear
colour, and her eyes were sparkling—she was a different being from
the pale, silent girl of two months ago. The mail-carrier, who met
them at the cross-roads and handed Gurth some letters, thought
what a fine couple they made, and immediately started his opinion
as a rumour around the community.
Margaret walked about outside the little brown house, while her
companion freed the roof from its weight of ice. Her own home was
in sight across the river, and at the left was a lovely strip of hill
country that rose and fell until it merged with the horizon. She was
so absorbed in the view that she did not realize when the shovelling
was finished, until Waldsen stood close beside her. “Has your father
told you that I buy this place, and that to-morrow the papers will be
signed? Yes, I have bought it for my home; I shall plant the ground
and work it, as your father says, to win my living. At evening we
shall sit here and look up the river and down to where the sun sets,
and then over to your house, thanking you for your kindness to a
lonely stranger.” The “we” dropped in unawares, but Margaret knew
that he meant Andrea, his sister.
“Next Christmas I shall move here, for my best resolves have
come on Christmas Day; meanwhile, there is much to be done, and I
shall ask your woman’s art how best to make my home attractive.”
Then they talked of the garden and of the house, how it would need
a summer kitchen, until he, through the subtilty of woman’s
sympathy, thought that he could not wait all the long months for
Andrea’s coming.
That night Waldsen sat a long time pondering over a letter that
had that day come from Andrea. At the first, nothing new suggested
itself, except that she perhaps was lonely, but on a second reading a
note of pain was evident. Carelessly feeling in the pocket of his
overcoat before going to bed, he found that he had received two
letters, when he thought he had but one, and, re-lighting his lamp,
he read the second, which was blotted and tear-stained. It ran thus:
—
“The stamp on the last letter that I wrote you, dear
Gurth, is hardly dried, yet I must write again and tell you
that which for the last month I have tried to conceal. Now it
is useless. My father will bring a new wife to fill my
mother’s place in two months from now. A hateful woman
who has in some strange way gained power over and
fascinated him, but who does not wish me in the house, for
my father is urging, nay, almost commanding, my betrothal
to Hans Kraus, the brewer’s son, whom I have seen hardly
twice, and whose mother is arranging the matter for him.
“In vain I protest and remind him of our betrothal. He
insists that your mother will surely win you back, as she is
making great efforts to discover where you are. He will not
hear of my going out to service. I know that you will say,
‘Come to me, and we will be married,’ but knowing your
plans and your agreement with your employer, this I will
not do until Christmas comes again. One thing is possible, if
you will undertake it. You are, of course, known in your
village as a working-man. There must be some one there
who wishes a young, strong woman to do housework,
sewing, anything, in short,—you know my hands are used
to work of all kinds. Find some lady who will pay my
passage money, to be taken out in service, and I will come.
Thus I, too, shall be independent. I can sometimes see
you, and when we then marry at Christmas, no one will
know that we are not as we seem, and we shall begin on a
sure footing. Do not attempt to stop me, dearest. Let me
also work.
“Your Andrea.”
This letter cut Waldsen to the heart as well as stirred his pride,
and his first impulse was to return at once to Denmark for Andrea.
Then he considered all the threads that must be unravelled, the
dispersal of many plans so nicely made, and he paused, perplexed.
Andrea clearly did not realize that he was not really a servant even
in name, and that he could not allow her to fill a drudge’s place in
some farm-house.
Stop! why should he not consult Margaret? She might suggest
something, and, at least, her advice would be in accord with local
custom, so that neither he nor Andrea would be criticised in future
by those among whom they were to live. He wrote a few comforting
lines to his betrothed, which he prepared to post that night that the
letter might go by the next day’s steamer, for he had the habit, that
a man bred in a large city seldom loses, of noting the coming and
going of the iron monsters that bind the continents.
It was after one o’clock when he went downstairs, shoes in hand,
and nearly three when he returned from his six-mile walk, after
dropping his letter through the well-worn slit in the post-office door.
The stairs creaked provokingly as he made his way up. He heard a
slight noise and saw a light under Margaret’s door, which, as he
passed by, opened, and Margaret herself peered out, shading her
candle with her hand, and looking down the hall. She almost
screamed when she saw Gurth so near, and said quickly, with a
catch in her breath: “I heard a noise and thought the stair door had
blown open. Are you ill? Can I do anything for you?” He looked at
her a moment as she stood there in her loose wool wrapper, her hair
hanging in long braids, and it seemed like an answer to his
perplexity. His heart whispered, Trust her, consult her, and he said
gravely, “I am not ill, I thank you, and you can do something for me,
but not to-night.”
Then Waldsen slept the sleep of deep fatigue, but Margaret,
misunderstanding wholly and wakeful with happiness, threw herself
on her knees by her bed and, falling asleep, stayed in this position
until the sun cast streaks across the room and scattered the mist
that betokened the final breaking up of winter.
The March days flew by rapidly, and it was almost April. The
willows were showing yellow stems, and the river swirled under
them with new fervour. Hepaticas bloomed in the wood edges, while
violets crept along in the sheltered garden border; bluebirds purled
about the mill, while the kingfishers quarrelled over the pond. At
every meal Waldsen brought the account of some new bird or
unknown flower, until the Deacon was almost vexed, and told him in
a sternly parental way that he would never make his salt, but fill his
farm with brakes and briers, growing strawberries for robins and
raspberries for catbirds; but Margaret only smiled, treasuring every
leaf he brought, and spent much time out of doors watching the
messengers of spring that she never before had noticed, feeling that
life was good.
Easter came in middle April, and the little church at Glen Village
was to be decorated with flowers. The day before, Gurth went into
town with a load of feed, stopping on his way at the post-office, and
found a letter from Andrea that made him resolve to act at once.
On his way home he bought two pots of blooming lilies, which he
placed on Margaret’s table in the sitting-room, as an Easter gift to
the home. As she thanked him, bending over the flowers, he said,
“Miss Margaret, a while ago I said that you could do something for
me. I have come to ask it now, but before I speak there is much
that I must tell you, so that you may understand.” Margaret, making
a gesture of assent, stood clinging to the curtain for support, still
bending over the lilies.
Gurth began slowly and hesitatingly with his father’s unhappy
marriage and his loveless childhood, speaking deliberately, and
choosing his words like a lawyer presenting his case. A puzzled
expression gradually spread over Margaret’s face, but as he told her
of his meeting with Andrea and his love for her, she gave the curtain
so sudden a jerk that it tore from its fastenings, and fell in a heap
upon her. Gurth, merely thinking that she had stood too long, lifted
the curtain, gave her a chair, and continued his narrative, with
unconscious egotism. For more than an hour he talked; the Deacon
peeped in and hastily withdrew, thinking that the young folks were
coming to an understanding.
Margaret did not say a word, but so absorbed was Gurth that he
did not notice it. A terrible struggle was rending her, and she could
not trust herself to speak. Not only had her life hinged itself upon an
impossibility, but the mistake that had made such a thing possible
had come from giving credence to the story of the carpenter.
As every detail of the past three months came before her, she
realized how innocent of any deception Waldsen had been, and the
very advice he was now seeking proved his confidence in her. The
secret was her own,—at least she had that comfort. Then a wave of
pain passed over her, almost stopping her breath and seizing her
throat in an iron grasp. She dimly saw that Gurth was showing her
some letters, and gathered herself together only to receive a fresh
blow,—his appeal for Andrea. For though he did not ask it in so
many words, she knew what was in his mind.
When he had finished and stood expectantly before her, she
could no longer contend with herself, and big tears rolled down her
cheeks as she said, “I must think before I answer you, but I will do
all I can.” As she passed him he saw the tears, and, taking her hand,
he stooped and kissed it reverently, saying, “God bless you for your
sympathy.”
The Deacon did not return for tea, having business in town, and
Waldsen, much surprised at Margaret’s absence, ate his meal alone.
Margaret herself sat in her east window looking at the twilight,
and, when it faded, at the stars. The marsh frogs piped
monotonously, and the water rushed over the dam, falling below
with a hollow thud. Soon Waldsen’s violin sounded from his open
window,—to-night he played “The Songs without Words,” one after
another, chancing to end with “Lost Happiness.” As Margaret
listened, now that the first shock was over, she was soothed. At first
she did not think it was possible that she could have Andrea in the
house, and then she knew that only by some such object lesson
would she realize that Waldsen could not belong to her. Andrea
should come, and they would work together. Zella was shiftless and
constantly threatening to go. To tell her father and make him
comprehend the change was her next task. Puritan in education and
temperament, no other thought but to bend to the seemingly
inevitable occurred to her.
On Easter Day no one who heard Margaret sing at church knew
of her struggle, and yet her voice moved those plain people as it
never had before, and they spoke of it among themselves in walking