100% found this document useful (3 votes)
17 views

Parallel Programming with Co-Arrays Robert W. Numrich 2024 scribd download

Robert

Uploaded by

levonsibeu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
17 views

Parallel Programming with Co-Arrays Robert W. Numrich 2024 scribd download

Robert

Uploaded by

levonsibeu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Experience Seamless Full Ebook Downloads for Every Genre at textbookfull.

com

Parallel Programming with Co-Arrays Robert W.


Numrich

https://textbookfull.com/product/parallel-programming-with-
co-arrays-robert-w-numrich/

OR CLICK BUTTON

DOWNLOAD NOW

Explore and download more ebook at https://textbookfull.com


Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Fortran 2018 with Parallel Programming 1st Edition Subrata


Ray (Author)

https://textbookfull.com/product/fortran-2018-with-parallel-
programming-1st-edition-subrata-ray-author/

textboxfull.com

Concepts of Programming Languages Global Edition Robert W.


Sebesta [Sebesta R.W.]

https://textbookfull.com/product/concepts-of-programming-languages-
global-edition-robert-w-sebesta-sebesta-r-w/

textboxfull.com

Pro Tbb: C++ Parallel Programming with Threading Building


Blocks 1st Edition Michael Voss

https://textbookfull.com/product/pro-tbb-c-parallel-programming-with-
threading-building-blocks-1st-edition-michael-voss/

textboxfull.com

Parallel Computers Architecture and Programming V.


Rajaraman

https://textbookfull.com/product/parallel-computers-architecture-and-
programming-v-rajaraman/

textboxfull.com
Beginning Java with WebSphere Expert s Voice in Java
Janson Robert W

https://textbookfull.com/product/beginning-java-with-websphere-expert-
s-voice-in-java-janson-robert-w/

textboxfull.com

Parallel and High Performance Computing MEAP V09 Robert


Robey

https://textbookfull.com/product/parallel-and-high-performance-
computing-meap-v09-robert-robey/

textboxfull.com

Parallel programming for modern high performance computing


systems Czarnul

https://textbookfull.com/product/parallel-programming-for-modern-high-
performance-computing-systems-czarnul/

textboxfull.com

Parallel Programming Concepts and Practice 1st Edition


Bertil Schmidt

https://textbookfull.com/product/parallel-programming-concepts-and-
practice-1st-edition-bertil-schmidt/

textboxfull.com

Irving Fisher Robert W. Dimand

https://textbookfull.com/product/irving-fisher-robert-w-dimand/

textboxfull.com
Parallel Programming
with Co-arrays

Robert W. Numrich
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2019 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

Version Date: 20180716

International Standard Book Number-13: 978-1-4398-4004-7 (Hardback)


International Standard Book Number-13: 978-0-4297-9327-1 (Paperback)

Library of Congress Cataloging-in-Publication Data

Names: Numrich, Robert W., author.


Title: Parallel programming with co-arrays / Robert W. Numrich.
Description: First edition. | Boca Raton, FL : CRC Press/Taylor & Francis
Group, 2018. | Series: Chapman & Hall/CRC computational science ; 33 |
Includes bibliographical references and index.
Identifiers: LCCN 2018024855 | ISBN 9781439840047 (hardback : acid-free paper)
Subjects: LCSH: Parallel processing (Electronic computers)
Classification: LCC QA76.642 .N86 2018 | DDC 004/.35--dc23
LC record available at https://lccn.loc.gov/2018024855

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents

Preface ix

1 Prologue 1

2 The Co-array Programming Model 3


2.1 A co-array program . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Partition Operators 9
3.1 Uniform partitions . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Non-uniform partitions . . . . . . . . . . . . . . . . . . . . . 12
3.3 Row-partitioned matrix-vector multiplication . . . . . . . . . 14
3.4 Input/output in the co-array model . . . . . . . . . . . . . . 16
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Reverse Partition Operators 19


4.1 The partition of unity . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Column-partitioned matrix-vector multiplication . . . . . . . 21
4.3 The dot-product operation . . . . . . . . . . . . . . . . . . . 23
4.4 Extended definition of partition operators . . . . . . . . . . . 24
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Collective Operations 27
5.1 Reduction to root . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Broadcast from root . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 The sum-to-all operation . . . . . . . . . . . . . . . . . . . . 31
5.4 The max-to-all and min-to-all operations . . . . . . . . . . . 32
5.5 Vector norms . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.6 Collectives with array arguments . . . . . . . . . . . . . . . . 33
5.7 The scatter and gather operations . . . . . . . . . . . . . . . 34
5.8 A cautionary note about functions with side effects . . . . . 37
5.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Performance Modeling 39
6.1 Execution time for the sum-to-all operation . . . . . . . . . . 40
6.2 Execution time for the dot-product operation . . . . . . . . . 41
6.3 Speedup and efficiency . . . . . . . . . . . . . . . . . . . . . 43
6.4 Strong scaling under a fixed-size constraint . . . . . . . . . . 43
6.5 Weak scaling under a fixed-time constraint . . . . . . . . . . 47
6.6 Weak scaling under a fixed-work constraint . . . . . . . . . . 49
6.7 Weak scaling under a fixed-efficiency constraint . . . . . . . 50
6.8 Some remarks on computer performance modeling . . . . . . 52
6.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

7 Partitioned Matrix Classes 57


7.1 The abstract matrix class . . . . . . . . . . . . . . . . . . . . 57
7.2 Sparse matrix classes . . . . . . . . . . . . . . . . . . . . . . 59
7.3 The compressed-sparse-row matrix class . . . . . . . . . . . . 61
7.4 Matrix-vector multiplication for a CSR matrix . . . . . . . . 64
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

8 Iterative Solvers for Sparse Matrices 67


8.1 The conjugate gradient algorithm . . . . . . . . . . . . . . . 67
8.2 Other Krylov solvers . . . . . . . . . . . . . . . . . . . . . . 70
8.3 Performance analysis for the conjugate gradient algorithm . 71
8.4 Strong scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.5 Weak scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.6 Iso-efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

9 Blocked Matrices 79
9.1 Partitioned dense matrices . . . . . . . . . . . . . . . . . . . 80
9.2 An abstract class for dense matrices . . . . . . . . . . . . . . 80
9.3 The dense matrix class . . . . . . . . . . . . . . . . . . . . . 81
9.4 Matrix-matrix multiplication . . . . . . . . . . . . . . . . . . 84
9.5 LU decomposition . . . . . . . . . . . . . . . . . . . . . . . . 87
9.6 Partial pivoting . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.7 Solving triangular systems of equations . . . . . . . . . . . . 94
9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

10 The Matrix Transpose Operation 99


10.1 The transpose operation . . . . . . . . . . . . . . . . . . . . 99
10.2 A row-partitioned matrix transposed to a row-partitioned ma-
trix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.3 The Fast Fourier Transform . . . . . . . . . . . . . . . . . . 105
10.4 Performance analysis . . . . . . . . . . . . . . . . . . . . . . 106
10.5 Strong scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 108
10.6 Weak scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
10.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
11 The Halo Exchange Operation 111
11.1 Finite difference methods . . . . . . . . . . . . . . . . . . . . 111
11.2 Partitioned finite difference methods . . . . . . . . . . . . . . 114
11.3 The halo-exchange subroutine . . . . . . . . . . . . . . . . . 116
11.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

12 Subpartition Operators 119


12.1 Subpartition operators . . . . . . . . . . . . . . . . . . . . . 119
12.2 Assigning blocks to images . . . . . . . . . . . . . . . . . . . 121
12.3 Combined effect of the two partition operations . . . . . . . 122
12.4 Permuted distributions . . . . . . . . . . . . . . . . . . . . . 122
12.5 The cyclic distribution . . . . . . . . . . . . . . . . . . . . . 124
12.6 Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . 125
12.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

13 Blocked Linear Algebra 127


13.1 Blocked matrices . . . . . . . . . . . . . . . . . . . . . . . . . 127
13.2 The block matrix class . . . . . . . . . . . . . . . . . . . . . 130
13.3 Optimization of the LU-decomposition algorithm . . . . . . . 136
13.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

14 The Finite Element Method 141


14.1 Basic ideas from finite element analysis . . . . . . . . . . . . 142
14.2 Nodes, elements and basis functions . . . . . . . . . . . . . . 143
14.3 Mesh partition operators . . . . . . . . . . . . . . . . . . . . 147
14.4 The mesh class . . . . . . . . . . . . . . . . . . . . . . . . . . 150
14.5 Integrating the heat equation . . . . . . . . . . . . . . . . . . 154
14.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

15 Graph Algorithms 159


15.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
15.2 The breadth-first search . . . . . . . . . . . . . . . . . . . . . 161
15.3 The graph class . . . . . . . . . . . . . . . . . . . . . . . . . 164
15.4 A parallel breadth-first-search algorithm . . . . . . . . . . . 165
15.5 The Graph 500 benchmark . . . . . . . . . . . . . . . . . . . 167
15.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

16 Epilogue 171

A A Brief Reference Manual for the Co-array Model 173


A.1 The image index . . . . . . . . . . . . . . . . . . . . . . . . . 174
A.2 Co-arrays and co-dimensions . . . . . . . . . . . . . . . . . . 175
A.3 Relative co-dimension indices . . . . . . . . . . . . . . . . . . 177
A.4 Co-array variables with multiple co-dimensions . . . . . . . . 178
A.5 Co-array variables of derived type . . . . . . . . . . . . . . . 179
A.6 Allocatable co-array variables . . . . . . . . . . . . . . . . . 182
A.7 Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
A.8 Procedure interfaces . . . . . . . . . . . . . . . . . . . . . . . 183
A.9 Execution control . . . . . . . . . . . . . . . . . . . . . . . . 183
A.10 Full barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
A.11 Partial barriers . . . . . . . . . . . . . . . . . . . . . . . . . . 185
A.12 Critical segments and locks . . . . . . . . . . . . . . . . . . . 187
A.13 Input/output . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
A.14 Command line arguments . . . . . . . . . . . . . . . . . . . . 190
A.15 Program termination . . . . . . . . . . . . . . . . . . . . . . 191
A.16 Inquiry functions . . . . . . . . . . . . . . . . . . . . . . . . . 191
A.17 Image index functions . . . . . . . . . . . . . . . . . . . . . . 192
A.18 Execution control statements . . . . . . . . . . . . . . . . . . 192

Bibliography 197

Index 207
Preface

This book describes basic parallel algorithms encountered in scientific and


engineering applications. It implements them in modern Fortran using the
co-array programming model and it analyzes their performance.
The book’s intended audience includes graduate and advanced undergrad-
uate students who want an introduction to basic techniques. It also includes
experienced researchers who want to understand the relationships among
seemingly disparate techniques. The book assumes that the reader knows the
modern Fortran language, basic linear algebra, and basic techniques for solv-
ing partial differential equations.
The book contains many code examples. Each example has been tested,
but there is no warranty that the code is free of errors. Nor is there any warrant
that a particular example represents the best implementation. Reading code
without writing code, however, is not enough. The only way to learn parallel
algorithms is to write code and to test it. Modifying the sample codes is a
good way to experiment with alternative implementations.
The entries in the Bibliography provide a guide to the development of
parallel programming techniques over the last few decades. It is not exhaus-
tive, and it does not provide an accurate historical record of how they evolved
or who first formulated them. They simply provide a guide to more detailed
discussions of topics covered in the book.
Many thanks to John Reid whose encouragement and support helped make
this book possible. He worked through many of the technical details of the
co-array model and shepherded it through the rigorous standardization pro-
cess to make it part of the Fortran language. He read several versions of the
manuscript, and I greatly appreciate his suggestions for improvement.

Robert Numrich
Minneapolis, Minnesota
Chapter 1
Prologue

This book describes a set of fundamental techniques for writing parallel appli-
cation codes. These techniques form the basis for parallel algorithms frequently
used in scientific and engineering applications. Parallel computing by itself is
a very large topic as is scientific computing by itself. The book makes no claim
of comprehensive coverage of either topic, just a basic outline of how to write
parallel code for scientific applications.
All the examples in the book employ two fundamental techniques that are
part of every parallel programming model in one form or another:
• data decomposition
• execution control
The programmer must master these two techniques and may find them the
hardest part of designing a parallel application. The book applies these two
fundamental techniques to five fundamental algorithms:
• matrix-vector multiplication
• matrix factorization
• matrix transposition
• collective operations
• halo exchanges
It is not a complete list, but it is a list that every parallel code developer must
understand.
The book describes these techniques in terms of partition operators. The
programmer frequently encounters partition operators, either explicitly or im-
plicitly, in scientific application codes, and the techniques needed for new codes
are often variations of techniques encountered in previous codes. The specific
form of the partition operators becomes progressively more complicated as
the examples become more complicated. The book’s goal is to show that all
its examples fit into a single unified framework.
The book encourages the use of Fortran as a modern object-oriented lan-
guage. Parallel programming is an exercise in transcribing mathematical def-
initions for partition operators into small functions associated with Fortran
2 Parallel Programming with Co-Arrays

objects. The exchange of one data distribution for another is the exchange
of one set of functions with another set. This technique follows one of the
fundamental principles of object-oriented design.
This book demonstrates that the programmer needs to learn just a handful
of techniques that are variations on a common theme. As always, however, the
only way to learn to write parallel code is to write parallel code. And to test
it.
The interplay between two sets of indices makes the programmer’s job
difficult. One set describes the decomposition of a data structure. The other
set describes how the first set is distributed across a parallel computer. Perhaps
the statement attributed to Kronecker,
“God created the integers, all else is the work of man.”

applies just as well to the whole of parallel programming as it does to the


whole of mathematics [47].
Chapter 2
The Co-array Programming Model

2.1 A co-array program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

This chapter explains the essential features of the co-array programming


model. Programmers use co-array syntax to move data objects from one lo-
cal memory to another local memory. To ensure that the values moved from
one memory to another are the correct values, programmers insert explicit
synchronization statements to control execution order.
The co-array programming model follows the Single-Program-Multiple-
Data (SPMD) execution model. The run-time system creates multiple copies
of the same program and executes the statements in each copy asynchronously
in parallel. The co-array model calls each copy of the program an image. The
run-time system assigns a unique image index to each copy, and the number
of images remains fixed throughout the execution of a co-array program [56]
[70].
The book uses imprecise expressions such as, “the image executes a state-
ment” to mean that “the run-time system executes a statement within a copy
of the program assigned to the image.” The second expression is more precise
than the first because an image is just a sequence of statements that can do
nothing on their own. The minor imprecision in the first expression, however,
is worth the gain in readability.

2.1 A co-array program


Before looking at specific parallel algorithms, the book first describes the
basic ideas behind the co-array programming model by examining the program
shown in Listing 2.1. The run-time system creates multiple copies of this
program, called images, assigns each image to physical hardware, and allocates
local memory for each image with affinity to the hardware. The run-time
system executes the statements independently for each image according to
the normal rules of the Fortran language.
Because each image is a copy of the same program, variables declared
in the program have the same names across all images. One variable in the
4 Parallel Programming with Co-Arrays

program, the integer variable x[:], is an allocatable co-array variable declared


with one unspecified co-dimension in square brackets. In addition to integer
variables, complex, character, logical or user-defined variables can be declared
as co-array variables. A co-array variable has the special property that it is
visible across images. At any time during execution, an image can reference
both the value of its own variable and the value of the variable assigned to any
other image. The other four variables, me,p,y,you, declared in the program
are normal variables. Their values are visible only within each image, and they
cannot be referenced by a remote image.

Listing 2.1: A co-array program.


program First
implicit none
integer , allocatable :: x [:] ! - - - - co - array variable - -!
integer :: me ,p ,y , you ! - - - - normal variables - - -!

p = num_images () ! - - - - begin segment 1 - - - -!


me = this_image () ! !
allocate ( x [*]) ! - - - - end segment 1 - - - -!

you = me +1 ! - - - - begin segment 2 - - - -!


if ( me == p ) you = 1 ! !
x = me ! !
sync all ! - - - - end segment 2 - - - -!

y = x [ you ] ! - - - - begin segment 3 - - - -!


write (* ,"( ’ me : ’,i5 , ’ my pal : ’, i5 ))") me , y !
! !
deallocate ( x ) ! - - - - end segment 3 - - - -!

! - - - - begin segment 4 - - - -!
: ! !
: ! !
end program First ! - - - - end segment 4 - - - -!

The programmer needs to know the number of images and the value of the
local image index at run-time. Each image obtains these values by invoking
two intrinsic functions added to the language to support the co-array model
as replicated in Listing 2.2.

Listing 2.2: Intrinsic functions.


p = num_images ()
me = this_image ()

The first function returns the number of images, and the second function
returns the image index of the image that invokes it. The image index has
a value between one and the number of images and uniquely identifies the
The Co-array Programming Model 5

image. In all the examples that follow, the book uses the symbol p for the
number of images and the symbol me for the value of the local image index.

Listing 2.3: Allocating a co-array variable with default co-bounds.


allocate ( x [*])

The allocation statement, replicated in Listing 2.3, sets the upper and lower
co-bounds for the co-dimension. Specifying the co-bounds with an asterisk in
square brackets follows the Fortran convention for declaring a normal variable
with an assumed size. This convention allows the programmer to write code
that can run on different numbers of images without changing the source code,
without re-compiling and without re-linking. The implied lower co-bound is
one and the implied upper co-bound equals the number of images at run-time.
The programmer can override the default co-bound values as described in
Section A.3. If the lower co-bound is zero, for example, the upper co-bound
is the number of images minus one. Whatever the value for the lower co-
bound, the programmer is responsible for using valid co-dimension indices.
Compiler vendors may provide an option to check the run-time value of a co-
dimension index, but an out-of-bound value for a co-dimension index results
in undefined, often disastrous, behavior just as it does for an out-of-bound
value for a normal dimension index.
In the program shown in Listing 2.1, each image picks a partner with image
index one greater than its own. Since the partner’s index cannot be greater
than the number of images, the last image picks the first image as its partner.
Each image sets the value of its local co-array variable to its own image index
as shown in Listing 2.4.

Listing 2.4: Reference to a local co-array variable.


x = me

A reference to a co-array variable without a co-dimension index is equivalent


to a reference with an explicit co-dimension index equal to the local image
index as shown in Listing 2.5.

Listing 2.5: Alternative reference to a local co-array variable.


x [ me ]= me

This convention is a fundamental feature of the co-array model. The default


view of memory is a local view that allows the programmer to write code
without redundant syntax. It also eliminates the need for extra analysis, both
at compile-time and at run-time, to recognize a purely local memory reference.
The programmer inserts execution control statements into a program to
impose an execution order across images. The programmer is responsible for
placing execution control statements correctly and may find this requirement
the most difficult aspect of the co-array model.
The program shown in Listing 2.1 consists of four segments determined by
6 Parallel Programming with Co-Arrays

three execution control statements plus the statement that ends the program.
The first segment begins with the first executable statement and ends with
the allocation statement. Every image must execute this statement because
the variable being allocated is a co-array variable. There is an implied barrier
that no image may cross until all have allocated the variable.
The second segment consists of all statements following the allocation
statement up to and including the sync all statement. Without this con-
trol statement, an image might try to obtain a value from its partner before
the partner has defined the value. The sync all statement guarantees that
each image has defined its value before any image references the value. When
an image executes a statement in the third segment, as shown in Listing 2.6,
it obtains the value defined by its partner in the second segment.

Listing 2.6: Accessing the value of a co-array variable assigned to a remote


image.
y = x [ you ]

The third segment ends with execution of the deallocation statement. Be-
cause the variable is a co-array variable, every image waits until all images
reach this statement before deallocating the variable. Otherwise, an image
might try to reference the value of a variable that no longer exists. The fourth
segment consists of other statements between the deallocation statement and
the end of program statement assuming there are no more control statements
in the program.

Listing 2.7: Output from the program shown in Listing 2.1.


me : 5 my pal : 6
me : 3 my pal : 4
me : 1 my pal : 2
me : 2 my pal : 3
me : 4 my pal : 5
me : 6 my pal : 7
me : 7 my pal : 1

Execution of this program using seven images might result in the output
shown in Listing 2.7. Each image has obtained the correct value from its
partner, but output from different images appears randomly in the standard
output file. The reason for this behavior is that the run-time system executes
the output statement independently for each image. It is required to write a
complete record for each image, with no intermixing of records from different
images, and to merge records from different images into a single file. The order
in which it merges the records, however, is implementation specific, and the
order may vary from execution to execution even on the same system.
To guarantee order in an output file, the programmer may need to restrict
execution of output statements to a single image that gathers data from other
images before writing to the file in a specific order as shown, for example, in
The Co-array Programming Model 7

Listing 2.8. The alternative form of the output statement avoids the need for
an additional variable to hold the value obtained from the remote image.

Listing 2.8: Enforcing output order.


if ( me == 1) then
do q =1 , p
write (* ,"( ’ me : ’,i5 , ’ my pal : ’, i5 ))") q , x [ q ]
end do
end if

2.2 Exercises
1. Change the example program shown in Listing 2.1 in the following ways.

• Remove the sync all statement and describe what happens.


• What happens if the variable y, just before the write statement, is
replaced by the co-array variable x?
• Change the code so that each image sends its image index to its
partner.
• Change the code so that one image broadcasts a variable to all
other images.
• Change the code so that each image obtains a variable from image
one.
• Pick one image and have it add together the values of some variable
across all images.
• Have all images add the values at the same time.
• Have one image add the values and broadcast it to the other images.

2. In Listing 2.8, have image p perform output in reverse order.


Chapter 3
Partition Operators

3.1 Uniform partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9


3.2 Non-uniform partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Row-partitioned matrix-vector multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Input/output in the co-array model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

The first problem a programmer faces when implementing a parallel appli-


cation code is the data decomposition problem. In the co-array model, the
programmer partitions data structures into pieces and assigns each piece to
an image. The run-time system places each piece in the appropriate local mem-
ory and executes code independently for each image. The programmer uses
co-array syntax to move data between local memories and inserts execution
control statements to order statement execution among images.
Partition operators provide a mathematical description of the data de-
composition process. A partition operator cuts an index set into subsets and
induces a partition of a set of objects labeled by the index set, for example,
a partition of the elements of a vector labeled by the element indices. The
mathematical formulas that define a partition operator transcribe directly
into Fortran code within the co-array model.
Index manipulations make parallel programming hard. A serial algorithm
works with a set of data objects labeled by global indices. A parallel version
of the same algorithm works with subsets of the data objects labeled by local
indices. The map between global and local indices may be simple for simple
algorithms but becomes more complicated for more complicated algorithms.
Errors mapping indices, forward global-to-local and backward local-to-global,
are the source of many difficulties in the design of parallel algorithms. Precise
definitions for partition operators prevent some of these errors.

3.1 Uniform partitions


The fundamental object of interest in scientific computing is the finite set
of ordered integers,
N = {1, . . . , n} . (3.1)
10 Parallel Programming with Co-Arrays

This global index set labels a set of objects,

O = {o1 , . . . , on } . (3.2)

The order relationship is the natural integer order, but for some applications
a permutation may change the order relationship.
By default, the Fortran language assumes the value one for the lower bound
of the index set. It allows the programmer to override this default value replac-
ing it, for example, with zero when interfacing with other languages. Changing
the default lower bound, however, usually adds little benefit for algorithm de-
sign.
Parallel applications require the programmer to partition the global index
set into local index sets. If n, the size of the global index set, is a multiple of
p, the number of partitions, the size of the local index set,

m = n/p , (3.3)

is the same for each partition. The global base index for each partition has
the value,
k0α = (α − 1)m , α = 1, . . . , p , (3.4)
and the local index set is a contiguous set of integers added to the base,

N α = {k0α + 1, . . . , k0α + m} , α = 1, . . . , p . (3.5)

In scientific computing, the set of objects (3.2) is often a vector of real or


complex numbers,  
x1
x =  ...  . (3.6)
 

xn
The partition of the global index set induces a partition of the vector elements,
 α   
x1 xk0α +1
 ..   ..
 . =  . (3.7)

.

m xk0α +m

The numerical values of the vector elements are the same, but they are labeled
by the local index set,
L = {1, . . . , m} , (3.8)
rather than by the global index set.
Partitioning a vector is an operation in linear algebra,

xα = P α x , α = 1, . . . , p . (3.9)

The partition operator is represented by the rectangular matrix,

P α = 0 · · · Iα · · · 0 ,
 
(3.10)
Partition Operators 11

with m rows and n columns. Each zero symbol represents a zero matrix of
size m × m, and the symbol I α represents the identity matrix of the same size.
The partition operation defined by formula (3.9), then, is the matrix-vector
multiplication,

x1
 
 .. 
 

 . 

xk0α +1  xkα +1
0

..  ..


 
= 0 ··· ··· 0   . (3.11)
  
 .  . 
xk0 +m
α  xkα +m
 0


 .. 
 . 
xn

Partition operators define relationships among three sets of indices: the


global index set, the local index set, and the image index set. When the
number of partitions equals the number of images, the partition index and
the image index are the same. A vector element xk , with global index k, maps
to image index α according to the formula,

k−1
 
α= +1 , (3.12)
m

where the notation b·c denotes the floor function. The same element considered
as a local element assigned to image α is labeled with a local index i according
to the formula,
i = k − k0α . (3.13)

On the other hand, a local index i assigned to partition α maps back to the
global index k according to the formula,

k = k0α + i , i = 1, . . . , m . (3.14)

The partition operator, therefore, defines a forward map from one global
index to two local indices, a partition index and a local index,

k → (α, i) , (3.15)

according to furmulas (3.12) and (3.13), and a backward map from the two
local indices to the global index,

(α, i) → k , (3.16)

according to formula (3.14). Implementing a parallel algorithm requires precise


application of these maps.
12 Parallel Programming with Co-Arrays

3.2 Non-uniform partitions


When the size of the global index set is not a multiple of the number of
partitions, the partition sizes are non-uniform and the index maps change.
There is more than one way to define these maps, but partition sizes should
be as close to uniform as possible, and the maps between indices should be
natural extensions of the maps for a uniform partition.
The formulas for the new maps depend on the remainder,

r = n mod p , (3.17)

and on the quantity,


m = dn/pe , (3.18)
where d·e is the ceiling function. The size assigned to partition α obeys the
formula, 
 m r = 0 α = 1, . . . , p
mα = m r > 0 α = 1, . . . , r . (3.19)
m−1 r > 0 α = r + 1, . . . , p

When the remainder is zero, the partition is uniform and the size is the same
for all images. When the remainder is not zero, the partition size for images
with indices less than or equal to the remainder equals the ceiling value. The
partition size for images with indices greater than the remainder is one less.
A global index 1 ≤ k ≤ n maps to partition index α by the formula,
 k−1

 b m c+1 r=0



α= b k−1
m c+1 k ≤ mr . (3.20)



 k−r−1

b m−1 c + 1 k > mr

The global base index for partition α has the value,

 (α − 1)mα

r = 0 α = 1, . . . , p
α
k0 = (α − 1)mα r > 0 α = 1, . . . , r , (3.21)
(α − 1)mα + r r > 0 α = r + 1, . . . , p

and the local index has the value,

i = k − k0α . (3.22)

The local index maps back to the global index according to the rule,

k = k0α + i , i = 1, . . . , mα . (3.23)
Partition Operators 13

An alternative convention is based on the floor function,


m = bn/pc . (3.24)
The partition sizes are the same for all but the last partition according to the
formula, 
α m α<p
m = . (3.25)
n − (p − 1)m α = p
With this alternative convention, the formulas for mapping indices have the
values,
k−1
α = b c+1 (3.26)
m
k0α = (α − 1)m (3.27)
i = k − k0α (3.28)
k = k0α + i (3.29)

103 •
• •••••••
• •• •••••
•••
• ••• •••••• •••
••
•• • •
•• •

• •• •
•• •
• •
• •
• •
• •

• •


• • • • • • •
102 •••
••••• ••••





••









••


••





••

••









•••• ••• ••
••••••••••
• •
• •
• •

• •
• •
• •
• •

• •


•• • • •
••• •• •• •
• • •
• • •
• • •


• •••••• ••• ••••••••• • •
• •

• •
• •
•• •

•••••••••• ••••• ••••• • • • • •
mp • •• •
•• ••••• ••••• •• • •







• •









• • ••• •• • • • • • • • •

• • •• • • • • •

• • • • •
101 •• • • • •


• • •





• •

• •
• • • •

• •
100 •

100 101 102 103


p

FIGURE 3.1: Partition size based on the floor function as a function of the
number of images for a fixed index set n = 1000. The solid line is a step
function representing the partition size assigned to images not equal to the
last image. The bullets represent the partition size assigned to the last image.

For the case n = 41, p = 9, the original definition of the partition oper-
ator yields sizes (5, 5, 5, 5, 5, 4, 4, 4, 4). The alternative definition yields sizes
(4, 4, 4, 4, 4, 4, 4, 4, 9). The disadvantage of the alternative definition is that all
images must allocate co-array variables with size 9 to accommodate the last
image even though all the other images need only size 4. In addition, since the
last image owns more data than the others, a workload imbalance may develop
as images wait for the last image. This problem becomes more important as
the number of images increases as shown in Figure 3.1.
It might be tempting to use the ceiling function in place of the floor func-
tion to define the partition size for all but the last image. Formulas for the
partition size, however, become more complicated. Cases exist where the par-
tition size is zero for the last image and something smaller than the ceiling
14 Parallel Programming with Co-Arrays

function for some images below the last one. This rule can still be used with
caution, but it may lead to errors and it may waste one image with nothing
to do.

3.3 Row-partitioned matrix-vector multiplication


Matrix-vector multiplication,

y = Ax , (3.30)

is an important operation that occurs frequently in parallel application codes.


It provides a good example of how partition operators describe parallel algo-
rithms. Application of a partition operator on both sides of Equation (3.30),

P α y = P α Ax , α = 1, . . . , p , (3.31)

yields the partitioned equation,

y α = Aα x , (3.32)

for the result vector y α assigned to image α where the partitioned matrix,

Aα = P α A , (3.33)

consists of the subset of rows assigned to image α. Figure 3.2 shows the matrix-
vector multiplication operation partitioned by rows with a one-to-one corre-
spondence between the partition index α and the image index.

.. ..
. .

yα = Aα
x
.. ..
. .

.. ..
. .

FIGURE 3.2: Row-partitioned matrix-vector multiplication. Image α uses


the rows of the partitioned matrix Aα asssigned to it ignoring the other rows
of the matrix. It generates a result vector in its local memory.

Listing 3.1 shows code for row-partitioned matrix-vector multiplication.


Partition Operators 15

Once an image has initialized its piece of the matrix and the vector on the
right side, it computes its matrix-vector multiplication using only local data.
Each image executes the code independently for its own piece of the problem
with no interaction between images.

Listing 3.1: Row-partitioned matrix-vector multiplication.


integer , parameter :: n = globalSize
integer :: m
real , allocatable :: x (:) , y (:) , A (: ,:)
m = localSize ( n )
allocate ( A (m , n ))
allocate ( x ( n ))
: ! - code to initialize A and x -!
y = matmul (A , x ) ! - automatic allocation of result vector -!

Listing 3.2 shows the function that computes the size of the partition
assigned to the image that invokes it. It is a direct transcription of formula
(3.19). Placing this calculation for such a straightforward formula may seem
superfluous. But this calculation occurs frequently in a code and is subject
to errors that may be hard to find. Furthermore, the programmer can change
this function to use an alternative definition for the partition size without
changing the rest of the code. In later applications, this kind of function
becomes a procedure associated with a class of objects.

Listing 3.2: Local partition size.


integer function localSize ( n ) result ( m )
integer , intent ( in ) :: n
integer :: me , p . r
p = num_images ()
me = this_image ()
r = mod (n , p )
m = (n -1)/ p + 1
if ( r /=0 . and . me > r ) m = m -1
end function localSize

The statement,
m = (n -1)/ p + 1 ! - - ceiling ( n / p ) - -!

computes the ceiling function for the ratio of two integers.


Partitioned matrices are the basis for optimized versions of serial algo-
rithms with the partition size set equal to the size of a vector register, for
example, or to the size of a local cache. These algorithms typically use arrays
with one extra dimension corresponding to the partition index with a loop
through partitions such that a single processor computes each partition in
turn. Indeed, one of the motivations for the co-array model is that the SPMD
execution model removes the need for the extra dimension. Each image works
on its own piece of the problem, and co-array syntax appears only in specific
16 Parallel Programming with Co-Arrays

places to initiate communication between images. The serial version of the


code often becomes the basis for the parallel version with the full problem
size replaced by the partitioned problem size.

3.4 Input/output in the co-array model


The programmer is still faced with the problem of initializing the data. If
the full matrix is small enough to fit in one local memory, the programmer
might initialize the matrix by letting one image read the whole matrix into
its memory and then let the other images pull their parts into their own local
memory. It is more likely, however, that the matrix is too big to fit into a
single local memory. A common reason for writing parallel code, however, is
that the data objects have become too large for the available local memory.

Listing 3.3: Reading a dense matrix from a file.


function readA ( filename , n ) result ( A )
character ( len =*) , intent ( in ) :: fileName
integer , intent ( in ) :: n
real , allocatable :: A (: ,:)
integer :: k , k0 ,m , me ,p , r
real , allocatable :: temp (:)[:] !.. co - array buffer ..!
p = num_images ()
me = this_image ()
r = mod (n , p ) !.. remainder ..!
m = localSize ( n )
k0 = ( me -1)* m !.. global base ..!
if ( r /=0 . and . me > r ) k0 = k0 + r
allocate ( A (m , n ))
if ( me ==1) open ( unit =10 , file = ’ fileName ’)
allocate ( temp ( n )[*]) !.. hidden barrier ..!
do k =1 , n
if ( me == 1) read ( unit =10 ,*) temp (1: n )
sync all
A (1: m , k ) = temp ( k0 +1: k0 + m )[1]
sync all
end do
if ( me == 1) close ( unit =10)
deallocate ( temp ) !.. hidden barrier ..!
end function readA

Listing 3.3 shows code that uses a temporary co-array buffer to hold each
column of the matrix as image one reads one column at a time from a file.
Each image executes the sync all statement waiting for image one to finish
reading the data, and then each image pulls its piece of the matrix into its
Partition Operators 17

own local memory. The second sync all statement guarantees that image
one does not read new data into the buffer before all images have obtained
their data. Notice the calculation of the global base index defined by formula
(3.21) as each image determines its piece of the current column.
The function shown in Listing 3.3 assigns image one to open a file, read the
data, and make data available to the other images. This technique is very com-
mon in parallel applications, but serialized input may become a performance
bottleneck. To avoid this problem, the programmer may want to associate
a procedure pointer to a library procedure that performs input in parallel
perhaps from a library using another programming model.
The current version of the co-array model allows only one image at a time
to open a particular file. It need not be image one, but only one at a time. Some
future version of the language may allow the programmer to use direct access
files with each row, or column, of the matrix held in a separate record. Each
image could then position itself at the records corresponding to its partition
and read from the file independently in parallel. Section A.13 contains a more
detailed description of input/output for the co-array model.

3.5 Exercises
1. For the case n = 1000 with p = 37, use the ceiling function dn/pe to
define a base partition size. Use that size for as many images as possible.
What is the size on the last image? On the second to last image?

2. Modify the code sample in Listing 3.3 so that image one sends the data
to the other images.

3. Modify the code in Listing 3.3 such that image p reads the data from
the file.

4. Each image obtains only its own piece of the result vector from the row-
partitioned matrix-vector multiplication code. How could each image
obtain the full result?
Chapter 4
Reverse Partition Operators

4.1 The partition of unity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


4.2 Column-partitioned matrix-vector multiplication . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 The dot-product operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Extended definition of partition operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

In addition to partitioning data structures in the forward direction, from the


global index to the local index, the programmer needs to recombine data
structures in the reverse direction, from the local index back to the global
index. The partition operators defined in Chapter 3 provided a recipe in the
forward direction. Reverse partition operators defined in this chapter provide a
recipe in the reverse direction. Almost all operations on data structures found
in parallel applications are based on some mixture of forward and reverse
partition operations.

4.1 The partition of unity


The set of forward partition operators defined in Chapter 3,

Pα , α = 1, . . . , p , (4.1)

labeled with superscript indices, induces a corresponding set of reverse oper-


ators,
Pα , α = 1, . . . , p , (4.2)

labeled with subscript indices, under the constraint,


p
X
Pα P α = I , (4.3)
α=1

where I is the n × n identity operator and n is the size of the global index set.
This constraint is called a partition of unity. It describes the recovery of
a global data structure from the local data structures. Indeed, application of
20 Parallel Programming with Co-Arrays

the identity operator (4.3) to a vector,


p
!
X
α
Pα P x=x, (4.4)
α=1

changes nothing. Define the local vector,

xα = P α x , (4.5)

as before and use the associative property of multiplication to obtain the


result,
X p
Pα xα = x . (4.6)
α=1

To recover the global vector, therefore, each image must apply the reverse
operator to its local piece of the vector and then it must sum together pieces
from the other images.
The partition of unity is a fundamental tool for the development of parallel
algorithms. Whenever a formula, like formula (4.6), involves partition indices
not equal to the local partition index, the formula implies communication
between images. The mathematical expression for a partitioned algorithm,
therefore, explicitly contains the interactions between processors within itself.
The definitions for these operators are not unique, but given the forward
operator as in Section 3.1,

P α = 0 · · · Iα · · · 0 ,
 
(4.7)

the reverse operator must be the transposed operator,


 
0
 .. 
 . 
 
Pα =  Iα  , (4.8)

 . 
 .. 
0

under the constraint imposed by the partition of unity. This matrix repre-
sentation of the reverse operator has n rows and mα columns. The symbol 0
represents a zero matrix and the matrix Iα is the identity matrix for partition
α. Applied to a vector xα of length mα , the reverse operator produces a vector
of length n,    
0 0
 ..   .. 
 .   . 
 α  
 x  =  Iα  xα ,
 
(4.9)
 .   . 
   
 ..   .. 
0 0
Reverse Partition Operators 21

with all zeros except in the part corresponding to partition α. The reader
may readily verify, by direct matrix multiplication and summation, that the
partition of unity (4.3) is satisfied for these definitions of the forward and
reverse operators.
Partition operators are not projection operators. Nor are they inverses of
each other. The forward operator maps a vector of length n to a vector of
length mα ,
P α : Vn → Vmα . (4.10)
The reverse operator maps a vector of length mα to a vector of length n,

Pα : Vmα → Vn . (4.11)

The important property of these operators is that the product operators,

Pα P α : Vn → Vn , (4.12)

form a set of orthogonal projection operators [79]. Summed over the index α,
these projection operators yield the identity operator. Nothing is lost during
the forward and reverse partition operations.
The forward partition operation is an example of the scatter operation,
and the reverse partition operation is an example of the gather operation.
Chapter 5 discusses these operations in more detail implemented as collective
operations.

4.2 Column-partitioned matrix-vector multiplication


Many parallel algorithms follow directly from the constraint imposed by
the partition of unity. Its insertion between a matrix and a vector
p
!
X
α
y=A Pα P x (4.13)
α=1

yields a formula for a column-partitioned algorithm for matrix-vector multi-


plication. The associative rule for multiplication yields the formula,
p
X
y= Aα xα , (4.14)
α=1

where the matrix,


Aα = APα , (4.15)
holds the columns assigned to image α, and the vector,

xα = P α x , (4.16)
22 Parallel Programming with Co-Arrays

holds the piece assigned to image α. To verify the meaning of the symbol APα ,
observe that,
Pα = (P α )T , (4.17)
implies the identity,
APα = (P α AT )T . (4.18)
The matrix P α AT on the right consists of the rows of AT assigned to the
partition α. These rows of the transposed matrix are just the columns of the
original matrix A assigned to image α.

..
.


yα = ··· Aα ··· ··· ×
..
.

..
.

FIGURE 4.1: Column-partitioned matrix-vector multiplication. Image α ap-


plies the column-partitioned matrix Aα to the partitioned vector xα to produce
a partial result. The images work independently, but to obtain the full result,
each image must sum the partial results from the other images.

Figure 4.1 shows how each image computes a partial result vector,

yα = Aα xα , (4.19)

with length equal to the full dimension. The images do not interact, but to
obtain the full result, they must sum together the partial results. Listing
4.1 shows code for the summation. The code employs staggered references to
remote images to reduce pressure on local memory should every image try to
access the same memory at the same time.

Listing 4.1: Summing the full result for column-partitioned matrix-vector


multiplication.
integer :: me ,n ,p , q
real , allocatable :: buffer (:)[:]
real , allocatable :: y (:)
:
p = num_images ()
me = this_image ()
n = ...
:
allocate ( y ( n ))
allocate ( buffer ( n )[*])
:
Reverse Partition Operators 23

buffer = matmul (A , x )
y (1: n ) = buffer (1: n )
sync all
alpha = me
do q =1 ,p -1
alpha = alpha +1
if ( alpha > p ) alpha = alpha - p
y (1: n ) = y (1: n ) + buffer (1: n )[ alpha ]
end do

4.3 The dot-product operation


Another important application of the partition of unity is calculation of
the dot product,
d = uT v , (4.20)
between two vectors u and v of size n. To obtain a parallel version of this
operation, insert the partition of unity (4.3) between the two vectors to obtain
p
!
X
T α
d=u Pα P v, (4.21)
α=1

and use the associative property of multiplication to write the dot product in
partitioned form,
Xp
d= (uT Pα )(P α v) . (4.22)
α=1
From definition (4.17) for the reverse partition operator, the dot product be-
comes
Xp
d= (P α u)T (P α v) . (4.23)
α=1
Define the partitioned vectors, uα = P α u and v α = P α v, and write
p
X
d= (uα )T (v α ) . (4.24)
α=1

This formula says that each image first computes its local dot product,
dα = (uα )T v α , (4.25)
and then participates in a sum,
p
X
d= dα , (4.26)
α=1

across all images such that they all obtain the same value of the dot product.
24 Parallel Programming with Co-Arrays

4.4 Extended definition of partition operators


The previous analysis of parallel algorithms depended only on the fact that
the forward and reverse partition operators define a partition of unity. The
analysis does not change with a change in the definition of the operators as
long as the new operators also define a partition of unity. The formulas for
mapping global indices to local indices and local indices to global indices are
more complicated but still straightforward.
One way to extend the definition of a partition operator is to create an
equivalence class of operators based on a set of orthogonal operators,
QT Q = QQT = I . (4.27)
Since the original operators define a partition of unity, this orthogonality
condition is unchanged by the insertion of the identity,
p
!
X
T α
Q Pα P Q=I . (4.28)
α=1

The extended forward operator,


Qα = P α Q , (4.29)
and the corresponding reverse operator,
Qα = QT Pα = (P α Q)T , (4.30)
still form a partition of unity,
p
X
Qα Qα = I . (4.31)
α=1

The extended operators mix the elements of a vector before cutting it into
pieces,
v α = Qα v = P α Qv . (4.32)
The product operators,
Qα Qα : Vn → Vn , (4.33)
still define a set of orthogonal projection operators. Summing them together
unmixes the elements of the partitioned vectors and reassembles the original
vector. The analysis of partitioned algorithms does not change; only the spe-
cific implementation changes because the index maps are more complicated.
The full generality of these extended operators is seldom used in parallel
codes. One particular case, however, where the unitary operator is a permu-
tation,
Q=Π, (4.34)
is quite common. It occurs in Chapter 13 for cyclic distributions and again in
Chapter 14 for finite element meshes.
Exploring the Variety of Random
Documents with Different Content
XXXIV.
For autocrats of old, with treacherous guile,
Had bribed the villain’s soul by sensual wile;
To meanest man a lower drudge assigned,—
With gift of female thrall cajoled the hind;
The stolid churl his servitude forgave
Whilst he in turn was master to a slave;
Through every rank the sexual serfdom ran,
And woman’s life was bound in vassalage to man.
XXXV.
Then, fearing that the slave herself might guess
The knavery of her forced enchainedness,
A subtle fiction mannish brain designed
To dominate her conscience and her mind,—
Inhuman dogmas did his genius frame,
Investing them with sanctimonious name
Of “woman’s duty”; and the fetish base
E’en to this reasoned day uplifts its impious face.
XXXVI.
By cant condoned, man fashioned woman’s “sphere,”
And mapped out “natural” bounds to her career;
His sapience—should she dare any deed
In contravention of his code—decreed
On soul or body penalties condign,
In part dubbed civil law, and part divine:
Misguided man,—confused in self-deceit
His unisexual wit and pious pretext meet.
XXXVII.
Obeisance yet his caste of sex demands;—
In legislative script the verbiage stands
How lowest boor is lordly “baron” styled,
And highest bride as common “feme” reviled;
The tardier fear that grants the clown a share
In his own governance, denies it her;
And British matrons are, by man-made rules,
In solemn statute ranked with infants, felons, fools.
XXXVIII.
The crass injustice early man displayed,
His own crude infancy of brain betrayed;
His riper judgment scorns the childish use,
And cries to all his bygone freaks a truce;
Enactments that long blemished legal page
Shall fade as figments of a foolish age,
Till saner years have every bond erased
Which selfish law of man on life of woman placed.
XXXIX.
Till like with him in human right she stands,
Her will an equal power of rule commands;
Her voice, in council and in senate heard,
To stern debate brings harmonising word;
In mutual stress each sex the other cheers,
Since one are made their hopes and one their fears;
“Self-reverent each, and reverencing each,”—
The theme that truer man and freer woman teach.
XL.
For but a slave himself must ever be,
Till she to shape her own career be free;—
Free from all uninvited touch of man,
Free mistress of her person’s sacred plan;
Free human soul; the brood that she shall bear,
The first—the truly free, to breathe our air;
From woman slave can come but menial race,
The mother free confers her freedom and her grace.
XLI.
By her the progress of our future kind,
Their stalwart body and their spacious mind;
For, folded in her form each human mite
Has its first home, its sustenance and light;
Hers the live warmth that fans its spirit flame,
Her generous sap supplies its fleshly frame,
And e’en the juice,—the fullborn infant’s food,
Is yet a blanched form of woman’s living blood.
XLII.
Strange wisdom by her unkenned craft is taught
While yet the embryo in her womb is wrought;
For, long ere entering on our tumult rife,
It learns from her the needful art of life;
Unconscious teacher, she, yet all she knows
Of dark experience to her infant flows,
And brands him, ere he rest upon her knee,
Offshoot of slavish race, not scion of the free.
XLIII.
To either sex the bondage and the pain,
They seek to live a freeman’s life in vain;
For man or woman can but act the part,
When ’tis not freeborn blood that fills the heart:
Strive as he may, the modern man, at best,
Is tyrant, differing somewhat from the rest;
Nor woman thraldom-bred can surely know
Where lies her richest gift, or how its wealth to show.
XLIV.
Thus learn we that in woman rendered free
Is raised the rank of all humanity;
The despot is the fullfruit of the slave;—
To form the freeman, equable and brave,
Habit of freedom must spontaneous come
As life itself, and from the selfsame womb;
Life, liberty, and love,—lien undefiled,—
The freeborn mother’s heirloom to her freeborn child.
XLV.
So shall her noble issue, maid or boy,
With equal freedom equal fate enjoy;
Together reared in purity and truth,
Through plastic childhood and retentive youth;
Their mutual sports of sinew and of brain
In strength alike the sturdy comrades train;
Of differing sex no thought inept intrudes,
Their purpose calmly sure all errant aim excludes.
XLVI.
For soul, not sex, shall to each life assign
What destiny to fill, or what decline;
Through years mature impartial range shall reach,
And wider wisdom, juster ethics, teach;
Conformed to claims of intellect and need,
The tempered numbers of their high-born breed;
Not overworn with childward pain and care,
The mother—and the race—robuster health shall share.
XLVII.
Nor blankly epicene, as scoffers say,
The necessary sequence of that day;
For not by vapid imitation low,
Or aping falser sex shall truer grow;
Nor modish mind may fathom Nature’s range,
Or fix the fleeting scope of human change;
Can singer blind the rainbow’s tints compare?—
The brain enslaved from birth the freeman’s powers declare?
XLVIII.
Work we in faith, secure that precious seed
Shall bear due fruit for man’s extremest need;
Not greatly timorous, as those fruits we see,
What changed existence from such food may be;
For well we wot shall come forth worthy soul,
Or male or female, with impartial dole
Of all that life can grant of good or great,—
Happy what each may bring to help the common fate.
XLIX.
By mutual aid perfecting complex man,
Their twofold vision human life may scan
From differing standpoints, grasping from the two
A clearer concept and a bolder view;
And thus diverse humanity shall learn
A wisdom which not single sex might earn;
Each on the problem casting needful light,
Not fully known of one without the other’s sight.
L.
How should he write what she alone may tell?—
The movements of her psychic ebb and swell;
The latent springs of life that in her gush,
When motherhood’s first throb awakes her flush,
And swift the signal flashes to her soul,
Of future being claiming her control;
Seeking from her its mind and body’s food;
Drawing, to make its own, her evil and her good.
LI.
Within herself the drama’s scene is laid,
The Birth and Growth of Soul the mystery played;
She, in her part, is but an agent mute,
Her brain untutored, nor her tact acute,
Her nerve-strung body slow as senseless soil
To watch the working of the seedling’s toil;
In vain before her inmost vision spread
The hidden streams from whence the vital founts are fed.
LII.
The mother’s blindness was blind man’s decree,
And to himself reverts the misery;
Through hapless years his ordinance has run,
And harsh reward of ignorance has won;
His pride of maledom, dull to recognise
The deeper depth accessive to her eyes,
Forbade to teach her brain to understand
The facts that, deftly sought, lay ready to her hand.
LIII.
Less wisely he, his curious search to serve,
In helpless creature teased the quivering nerve,
And strove to probe the covert ways of life
By living butchery with learned knife,
And cruel anodyne that chained the will,
Yet left the shuddering victim conscious still:
But Nature shrinks from foul and fierce attack,
Nor yields her holiest truths on such a murderer’s rack.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like