0% found this document useful (0 votes)

21 views13 pages

Meep Openmp

The document discusses the parallelization of a finite difference time domain (FDTD) code for efficient design of integrated photonic devices, focusing on the advantages of OpenMP over MPI for certain problems. Despite extensive refactoring, the OpenMP implementation did not outperform the MPI version, highlighting challenges in load balancing and communication overhead. The author conducted numerical experiments to analyze performance across various examples and OpenMP parameters, revealing insights into the limitations and potential improvements for the code.

Uploaded by

Samuel Greenberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views13 pages

Meep Openmp

Uploaded by

Samuel Greenberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Efficient Shared Memory Parallelization of FDTD

Alec Hammond

1 Introduction

Large scale photonic integration shows potential to transform fields like quantum computing, ma-
chine learning, and telecommunications. In order to design compact and efficient integrated photonic
devices, however, current design methodologies must be revamped to encourage faster design cycles.
Topology optimization, for example, requires iteratively solving Maxwell’s equations, a numerically
expensive task that is clearly the bottleneck toward faster device design.
To overcome this challenge, I accelerated an open source finite difference time domain (FDTD)
code that I co-developed (1). The codebase already supports MPI for extremely large simulations.
That being said, load balancing issues often prevent efficient scaling for even modest simulations and
design cycles (2).
While OpenMP is not a substitute for MPI, it offers several advantages for the classes of problems
I intend to solve. On the one hand, MPI can scale the simulation domain to any arbitrary compute
configuration, which is essential for problems that don’t tractably fit inside a single node. That being
said, the majority of inverse-design problems are compact and fit in single-node resources. Naively
leveraging MPI in this case exhibits exceptional computational waste. This is compounded in a linear
fashion as the optimization algorithm calls the solver every iteration. OpenMP enables dynamic load
balancing and eliminates the costly communication paradigm required by MPI, thereby accelerating
the overall design process.
In summary, I revamped the existing codebase to allow for OpenMP parallelism. Specifically, I
rewrote 11 loop macros, refactored over 106 different parallel loops, adopted a "first-touch"
initilization scheme, coalesced memory access to leverage vectorization (basic SIMD opera-
tions), and profiled the performance across four different examples. Unfortunately, the final
OpenMP build still underperformed the previous MPI build, despite several weeks of refactoring
the original codebase. I performed several numerical experiments on quantities like the OpenMP
scheduler and thread chunk size to verify the drawbacks of the OpenMP method.

1
2 Previous Work & Relevant Background

The most common nanophotonic design tools leverage the finite-difference time-domain method
(FDTD). Since the FDTD algorithm is easy to implement, it is straightforward to enable codes that
support multiple physics (heterogeneous material types, boundary layers, dispersion, nonlinearities,
adjoint methods, etc). Despite it’s simplicity, the FDTD algorithm is a fullwave solver that produces
accurate solutions to Maxwell’s equations, provided the simulation resolution is sufficiently high.
The simulation time itself increases as O(n4 ) with resolution. Consequently, accurate simulations
require intense computational resources to resolve the requested geometry and physics. Luckily, since
the algorithm simply loops over the simulation grid implementing various finite difference stencils,
parallilization is also rather straightforward. In fact, several parallel FDTD codes, commercial and
open-source, already exist. For example, gprMax is a popular FDTD code geared toward ground
penetrating radar and supports both CUDA (3) and OpenMP (4) acceleration. Similarly, B-CALM is
a GPU enabled FDTD code designed for plasmonic device simulations (which require extremely high
simulation resolutions) (5). Other commercial codes, like AxFDTD solver (6), CST solver (7), and
SEMCAD (8) all support both OpenMP and GPU parallelism.
While several parallel codes already exist, they all rely on simple timestepping kernels and lack
the flexibility needed to model photonic integrated circuits. For example, None of the codes simul-
taneously support nonlinear materials, dispersion, and adjoint variable methods. Consequently, I
opted to parallelize a code that I developed that I know supports the features I need.
MIT Electromagnetic Equation Propagation (MEEP) is an open-source FDTD code that supports
everything I need to efficiently perform topology optimization on nanophotonic devices. As previ-
ously mentioned, the codebase already supports MPI parallelism for distributed computation. Each
of MEEP’s kernels iterate over the simulation domain using a series of abstracted loop macros that
are defined in a uniform header file. This is largely possible because even if the kernels perform
wildly different computations, they all operate over the same fundamental data arrays (electric fields,
magnetic fields, etc). The MPI engine then chunks the arrays and allocates these subdomains to par-
ticular processes. While all processors will perform time-stepping (the most expensive kernel) on
their chunk, not all processors will be required to perform other kernels on their chunk. Even in the
simple case where only time-stepping is required for all chunks, it’s important to note that the cost
of time-stepping heavily depends on the contents of the respective chunk.
Consequently, the current MPI implementation significantly suffers from “load balancing” issues.
Deciding how to divide the work among the available processors before the simulation starts is close
to impossible. Figure 1 compares the relative time it takes for each processor to complete various

2
kernels. The computation time for each worker varies wildly. Some processors even sit idly simply
because of the uniform chunk splitting required by MPI implementations.

Figure 1: The differences between static and dynamic load balancing. The designer first prescribes
a simulation domain with various physics involved. The complexity of the domain is difficult to
determine before running the simulation. Consequently most SIMD implementations (e.g. using
MPI) split the computational cell at the start of the simulation. A better scheme would allow threads
to dynamically shift the size of the domains as some regions require more work than others. This
essentially equates to splitting the domain by computational cost.

Alternatively, OpenMP supports various “dispatch” paradigms where several smaller chunks are
passed to threads as they become available. In effect, the domain is split by cost, rather than split
evenly. This ensures that no processor has to idle and wait while its neighbors finish their respective
tasks. In a sense, OpenMP is able to dynamically load balance the simulation domain. Even more
important, the chunk division can change between kernels (time-stepping vs Fourier Transforming)
and from iteration to iteration.
Also note that almost as much time is spent on MPI communication as time-stepping. While
OpenMP will also have some communication overhead, it should be significantly less than the current
implementation.

2.1 Computational Complexity and Kernel Analysis

The core time-stepping algorithm is the most expensive kernel and deserves the most attention. We
can quantify the total time of the current MPI implementation using

p) × N
W (~
T (~
p) = p) × S
+ C(~ (1)
| {zP } | {z }
communication
computation

3
where N is the total number of grid points, P is the number of processors, and S is the maximum
surface area of any processor’s subdomain. (Since time-stepping is synchronous, CS is determined by
the process that requires the most communication.) The surface area, S, can be minimized to some
extent, but will always increase as the number of processes increases. The actual kernel cost (W )
depends on the simulation domain (p) but is predicable to first order. Similar to the communication
cost, the slowest processor will dominate the computation time since all other processors must wait.
By looping over our domain using OpenMP, we significantly minimize the overhead caused by
the C×S communication cost (since communication overhead is minimized under a shared memory
paradigm). Furthermore, we can attempt to perform more FLOPs/data transfer during the main
kernel, which further increases the efficiency of the computation term.
That being said, there is only so much we can do to maximize our computation/communication
ratio. It is well known that the FDTD algorithm is heavily bound by memory bandwidth. Moving
entirely “off the roof ” of the roofline model is not feasible without completely rewriting the Yee grid
scheme (which is akin to starting from scratch). Instead, we can still enjoy significant performance
gains simply by parallelizing with OpenMP and caching where necessary.

3 Parallelization Overview

The code’s fundamental looping routines are rather complicated. The FDTD algorithm relies on a
staggered grid mechanism, known as the Yee grid (9). Different fields and their respective compo-
nents are stored at different points in space and time. Since the grid is still fundamentally Cartesian,
however, we can still use the same loop mechanism for all fields and materials, provided it is general
enough to handle all of these extreme cases.
This flexibility comes with a price, however. Parallelizing these loop macros requires some sophis-
ticated "trickery" to ensure all of the existing stencils behave correctly. For example, we can collapse
some of our nested loops (since we are simply looping over three dimensions of a row-major array)
using an OpenMP clause. However, we must first abstract out the initial array starting point calcu-
lations by unwrapping a single loop iteration. Tricks like these seem unnecessary, but they let us
preserve the variable scope of the existing macro calls without having to change the entire codebase
(over 300,000 lines of c++ code). Listing 1 illustrates the rafactoring of one particular loop macro
using multiple collapsed loops, unwrapped lops, and SIMD calls.

4
1 # define PS1LOOP_OVER_IVECS ( gv , is , ie , idx ) \
2 _Pragma ( " unroll (1 " ) \
3 for ( ptrdiff_t loop_is1 = ( is ). yucky_val (0) , loop_is2 = ( is ). yucky_val (1) , \
4 loop_is3 = ( is ). yucky_val (2) , loop_n1 = (( ie ). yucky_val (0) - loop_is1 ) / 2 + 1 ,\
5 loop_n2 = (( ie ). yucky_val (1) - loop_is2 ) / 2 + 1 , \
6 loop_n3 = (( ie ). yucky_val (2) - loop_is3 ) / 2 + 1 , \
7 loop_d1 = ( gv ). yucky_direction (0) , loop_d2 = ( gv ). yucky_direction (1) , \
8 loop_s1 = ( gv ). stride (( meep :: direction ) loop_d1 ) , \
9 loop_s2 = ( gv ). stride (( meep :: direction ) loop_d2 ) , loop_s3 = 1 , \
10 idx0 = ( is - ( gv ). little_corner ()). yucky_val (0) / 2 * loop_s1 + \
11 ( is - ( gv ). little_corner ()). yucky_val (1) / 2 * loop_s2 + \
12 ( is - ( gv ). little_corner ()). yucky_val (2) / 2 * loop_s3 , \
13 dummy_first =0; dummy_first <1; dummy_first ++) \
14 _Pragma ( " omp parallel for collapse (2) " ) \
15 for ( ptrdiff_t loop_i1 = 0; loop_i1 < loop_n1 ; loop_i1 ++) \
16 for ( ptrdiff_t loop_i2 = 0; loop_i2 < loop_n2 ; loop_i2 ++) \
17 _Pragma ( " omp simd " ) \
18 for ( ptrdiff_t loop_i3 = 0; loop_i3 < loop_n3 ; loop_i3 ++) \
19 _Pragma ( " unroll (1 " ) \
20 for ( ptrdiff_t idx = idx0 + loop_i1 * loop_s1 + loop_i2 * loop_s2 + loop_i3 , \
21 dummy_last =0; dummy_last <1; dummy_last ++)

Listing 1: Sample loop parallelization using complicated macros and folding.

In addition to completely rewriting the main loop macros, I had to modify over 100 other loops
that didn’t directly call theses loop macros. This was primarily to ensure I was leveraging a "first-
touch" policy.

4 Results

4.1 Hardware and Build Details

I ran all simulations using an Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz with 40 cores (2 sockets,
20 cores/socket) The node has 190 GB RAM @ 2.933GHz (6 slots, 2 controllers/NUMA at 64 bits
each). Hyperthreading is disabled. The L1 cache is 32 kB, the L2 cache is 1 MB, and the L3 cache is
28 MB (shared across the socket).
All software was built using GCC 8.3.0, which ships with OpenMP 4.5. The MPI versions of the
code were built using mpicc and mpicxx from MVAPICH 2.3.2. The main FDTD engine was interfaced
using SWIG bindings and python scripts to streamline the example batching.

5
4.2 Examples

In order to thoroughly benchmark any OpenMP performance gains, I constructed four different
example problems that each involve hetergenous physics stencils. Figure 2 illustrates each example.
The first example involves a nanophotonic grating coupler, which is primarily a scattering problem.
The second example is a photonic crystal cavity problem. The third example is a CMOS sensor
efficiency problem. The final example is an OLED screen light extraction efficiency problem. Some
examples focus on absorption while other focus on emission. All examples contain complicated
materials, boundary layers, and sources, which require multiple stencils throughout the simulation
grid.

Figure 2: The four different example problems used to thoroughly test the new OpenMP implmen-
tation. (a) A nanophotonic grating coupler. (b) A nanophotonic crystal cavity. (c) A liquid crystal
display diffraction grating. (d) An organic light emitting diode stack.

6
Each example was run on the Xeon 6248 node at various thread counts. Figure 3 compares the
raw execution time, speedup, and efficiency for each example. As expected, each example performs
rather differently. For example, the nanophotonic grating example (nanobeam) experiences very
little maximum speedup (2.5x) whereas the liquid crystal display example (liq) experiences almost
25x maximum speedup. The other two examples topped out at about 5x speedup.

Figure 3: Timing results, speedup and efficiency for all four examples running with the OpenMP
acceleration. The liquid crystal display example (liq) experienced the most speedup, followed by the
nanobeam cavity, the OLED example, and then the grating coupler.

The differences in speedup/efficiency between each example problem are somewhat expected as
they involve different physics stencils and call each stencil a different number of times. For example,
the grating coupler has several complicated material profiles that require calling the corresponding
stencil several times. The liquid crystal display, however, has much simpler material profiles and con-
sequently calls each stencil fewer times. The relative memory footprint of each example is relatively
uniform. That being said, it is important to investigate the impact of memory footprint and other
OpenMP parameters that could impact the performance of the code.

4.3 OpenMP Parameter Tuning

In order to better understand the limitations of the current OpenMP implementation, I ran several
more simulations by sweeping through various OpenMP hyperparameters. In particular, I analyzed
the impact of the simulation footprint in memory, the efficiency dependency on various schedulers,
and the effect of thread chunking. To be consistent, I used the same OLED example for all test cases
since it neither performed the best nor the worst compared to all four examples.
With regard to memory footprint, we would expect the actual timestepping rate to depend directly

7
on the size of the data arrays in memory, but we wouldn’t expect any changes in efficiency/speedup.
Figure 4 illustrates the numerical experiment I performed to quantify this relationship. As expected,
the increase in timestepping was linear with memory size, but the speedup was consistent between
the different sizes. We can rule out any possibility that our initial example simulations were too
small to benefit from any accelerated parallelism.

Figure 4: Effect of simulation size on OpenMP speedup. The OLED example with different sized
domains (600 MB, 1.4 GB, 3.8 GB, 18.5 GB, and 67.4 GB in memory) was run with different thread
counts. While the actual runtime is directly dependent on the memory footprint (left plots), the
actual speedup and efficiency are practically independent (right two plots).

Next I opted to run the OLED example (18.5 GB) using different runtime schedulers. In practice,
we would expect significant difference in performance from scheduler to scheduler. As shown in
Figure 5, however, there is practically no difference between the dynamic, static, guided, or auto
schedulers. This is rather troublesome for two reasons. First, we would expect significant changes
between them as they operate very differently. Next, the whole point of our OpenMP implementation
is to leverage the dynamic load balancing facilitated by the scheduler. Consequently, we can conclude
that there must be significant computational overhead somewhere else.

8
Figure 5: Effect of OpenMP scheduler on overall speedup. The OLED example was run with the four
basic OpenMP schedulers (dynamic, static, guided, and auto). Interestingly, there is no performance
dependence on any of the schedulers.

Finally, I ran an experiment to determine the influence of the chunk size for each thread’s work-
load. Since our build supports AVX-512 SIMD instructions, and our code explicitly leverages that
instruction set, we should expect some speedup relative to chunk sizes that fit in cache and can
leverage this form of local parallelism. Figure 6 once again demonstrates, however, a relative inde-
pendence on chunk size. Once again we can conclude that there is significant overhead somewhere
else in the code impeding the gains we would hope to accomplish.

Figure 6: Effect of thread chunk size on speedup. There is no net performance dependency on the
OLED example.

9
4.4 Comparison with MPI

While our speedup isn’t as high as we would hope, it still might be better than the current state of
the MPI build. To test this, I ran each example on multiple threads/processes for the MPI build and
the OpenMP build separately. Figure 7 illustrates the comparison.

Figure 7: Comparison of the OpenMP builds and the MPI builds for the grating coupler example
(row 1), the liquid crystal display example (row 2), the OLED example (row 3) and the nanobeam
example (row 4). For each test case, I tracked the net runtime in seconds (column 1), the relative
speedup vs threads/processes (columns 2), and the corresponding efficiency (column 3).

This last experiment revealed many important facts about both the MPI build and our new

10
OpenMP build. First, some of the MPI builds experienced superlinear speedup (efficiency greater
than 100%) for smaller processor counts. This implies that these particular configurations are better
leveraging the SIMD instruction sets. The MPI communication cost quickly drowns out these per-
formance gains as the processor count increases, however. That being said, the communication cost
never fully surpasses the computational cost as there is still strong speedup when using all of the
processor cores.
We also note that for one process/thread, the MPI build is always significantly faster than the
corresponding OpenMP build. While this is surprising, it helps us understand why the scheduler and
chunk size don’t seem to be impacting the performance of our OpenMP build. As we earlier predicted,
there is significant overhead initializing the OpenMP thread pool. Unlike most codes that leverage
OpenMP, our code reopens a thread pool between 40 and 120 times each timestep, depending on the
physics of the example problem. This is largely because of the way that the heterogeneous stencils
must be solved. They depend on earlier states of themselves at different grid points. Consequently,
we must loop over the entire grid several times. This is why we see somewhat better speedup with
the liquid crystal display example. There are fewer stencils that require a new spawning of the thread
pool. That being said, even this example performs worse than the raw MPI build.
Ideally, we would spawn one thread pool at the start of our simulation and keep that thread pool
alive, handing work to each thread as it became available. However, that sort of paradigm would
require a significant refactoring of the current codebase.
The MPI implementation is truly SIMD. From the beginning, each process is handed a subset of the
grid and only loops over that subset. Communication between processes only occurs at boundary
(ghost) pixels. While there is still overhead in spawning these processes and communicating the
respective surface areas, it is significantly less than the overhead of spawning multiple thread workers
dozens of times each timestep.

5 Conclusion

I’ve refactored an existing FDTD code to leverage OpenMP acceleration. In all, I rewrote 11 loop
macros, refactored over 106 different parallel loops, adopted a "first-touch" initialization scheme,
and coalesced the memory access to leverage vectorization. I generated four different examples that
use various aspects of the available physics to test the robustness of the solution. I found surprisingly
different results for each example. However each example under-performed the MPI equivalent. This
is largely due to the number of thread groups that must be spawned over and over.
I tried refactoring the code even further to reduce the number of thread spawns, but realized

11
this would not sufficiently limit the thread overhead. In fact, the very multi-stencil paradigm itself
inhibits the benefits of dynamic load balancing, since so many different stencils must be called. Future
efforts should focus on other SIMD forms of parallelism, i.e. with a GPU. While this is still a static
load distribution determined before the simulation starts, it allows for several more workers to split
the load on a hardware platform with significantly more memory bandwidth (which is important
since the algorithm is heavily memory bound). The stenciling algorithm itself could be refactored to
allow caching of values, making the routine less memory bound and more compute bound. However,
the diversity of the stencils once again makes this a difficult problem to solve.
Regardless of the somewhat disappointing outcome, I am satisfied that my current codebase is
performing exceptionally well. I can continue to launch parallel jobs knowing that my implementa-
tion is efficient. My OpenMP branch is documented in (10) and all of my experimental results are
found in (11).

12
References

[1] A. F. Oskooi, D. Roundy, M. Ibanescu, P. Bermel, J. D. Joannopoulos, and S. G. Johnson, “Meep:

A flexible free-software package for electromagnetic simulations by the fdtd method,” Computer
Physics Communications, vol. 181, no. 3, pp. 687–702, 2010.

[2] A. Oskooi, C. Hogan, A. M. Hammond, M. Reid, and S. G. Johnson, “Factorized machine learn-
ing for performance modeling of massively parallel heterogeneous physical simulations,” arXiv
preprint arXiv:2003.04287, 2020.

[3] C. Warren, A. Giannopoulos, A. Gray, I. Giannakis, A. Patterson, L. Wetter, and A. Hamrah, “A

cuda-based gpu engine for gprmax: Open source fdtd electromagnetic simulation software,”
Computer Physics Communications, vol. 237, pp. 208–218, 2019.

[4] C. Warren, A. Giannopoulos, and I. Giannakis, “gprmax: Open source software to simulate
electromagnetic wave propagation for ground penetrating radar,” Computer Physics Communi-
cations, vol. 209, pp. 163–170, 2016.

[5] P. Wahl, D.-S. Ly-Gagnon, C. Debaes, D. A. Miller, and H. Thienpont, “B-calm: An open-source
gpu-based 3d-fdtd with multi-pole dispersion for plasmonics,” Optical and Quantum Electronics,
vol. 44, no. 3-5, pp. 285–290, 2012.

[6] xFDTD solver. https://www.remcom.com/xfdtd-3d-em-simulation-software.

[7] CST solver. https://www.3ds.com/products-services/simulia/products/cst-studio-suite/.

[8] SEMCAD solver. https://speag.swiss/products/semcad/solutions/.

[9] K. Yee, “Numerical solution of initial boundary value problems involving maxwell’s equations in
isotropic media,” IEEE Transactions on antennas and propagation, vol. 14, no. 3, pp. 302–307,
1966.

[10] A. Hammond, Multithreading branch, 2020. https://github.com/smartalecH/meep/tree/multithreading.

[11] A. Hammond, Experimental data, 2020. https://github.com/smartalecH/meepo penmp.

Mastering Python: Basic To Advanced
No ratings yet
Mastering Python: Basic To Advanced
46 pages
Assignment 1 - Operating System
No ratings yet
Assignment 1 - Operating System
137 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
Gpu Programming
100% (2)
Gpu Programming
96 pages
Computer Organization: - by Rama Krishna Thelagathoti (M.Tech CSE From IIT Madras)
No ratings yet
Computer Organization: - by Rama Krishna Thelagathoti (M.Tech CSE From IIT Madras)
118 pages
Walker A Kennedy W Go in Action Second Edition Meap v1
No ratings yet
Walker A Kennedy W Go in Action Second Edition Meap v1
59 pages
Level 2 - Microservices and React - Js Notes
No ratings yet
Level 2 - Microservices and React - Js Notes
42 pages
The Next Generation of GPU Performance in PyTorch With Nvfuser - 1647043230943001r3L1
No ratings yet
The Next Generation of GPU Performance in PyTorch With Nvfuser - 1647043230943001r3L1
64 pages
FDTD Book Matlab - Chapter 3
No ratings yet
FDTD Book Matlab - Chapter 3
27 pages
Class 8
No ratings yet
Class 8
72 pages
Multi Gpu Programming With Mpi
No ratings yet
Multi Gpu Programming With Mpi
93 pages
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators
No ratings yet
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators
30 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Applied Parallel Computing-Honest
100% (1)
Applied Parallel Computing-Honest
218 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
4 - Performance Issues
No ratings yet
4 - Performance Issues
48 pages
FDTD Cuda
No ratings yet
FDTD Cuda
118 pages
Log 1
No ratings yet
Log 1
46 pages
A Deep Dive Into The Latest HPC Software
No ratings yet
A Deep Dive Into The Latest HPC Software
38 pages
Performance Evaluation of A Two Dimensional Lattice Boltzmann-2017
No ratings yet
Performance Evaluation of A Two Dimensional Lattice Boltzmann-2017
22 pages
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
No ratings yet
Code Generation Compiler For The Openmp 4.0 Accelerator Model Onto Ompss
74 pages
Operating System Assignment
No ratings yet
Operating System Assignment
35 pages
Distributed Performance Improvement of Alternating Iterative Methods Running On Master Worker Paradigm With Mpi
No ratings yet
Distributed Performance Improvement of Alternating Iterative Methods Running On Master Worker Paradigm With Mpi
13 pages
Sap GRC Risk Analysis and Remediation
0% (1)
Sap GRC Risk Analysis and Remediation
38 pages
Hysing PHD Thesis
No ratings yet
Hysing PHD Thesis
134 pages
829-Article Text-5973-2-10-20210114
No ratings yet
829-Article Text-5973-2-10-20210114
11 pages
sc09 Fluid Sim Cohen
No ratings yet
sc09 Fluid Sim Cohen
33 pages
CS8491 Ca Unit 4
No ratings yet
CS8491 Ca Unit 4
32 pages
S8 Perf
No ratings yet
S8 Perf
15 pages
1 s2.0 S0098300413001040 Main
No ratings yet
1 s2.0 S0098300413001040 Main
9 pages
Game Log
No ratings yet
Game Log
24 pages
Blueprint OS
No ratings yet
Blueprint OS
4 pages
kbiswas,+ACES Journal May 2019 Paper 14
No ratings yet
kbiswas,+ACES Journal May 2019 Paper 14
7 pages
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
No ratings yet
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
31 pages
Applications Enabled by FPGA-Based Technology
No ratings yet
Applications Enabled by FPGA-Based Technology
4 pages
Byte Python Concurrent and Parallel Programming V2
No ratings yet
Byte Python Concurrent and Parallel Programming V2
38 pages
1 s2.0 S0010465518303990 Main
No ratings yet
1 s2.0 S0010465518303990 Main
11 pages
Kaiser Wagner PikeOS
No ratings yet
Kaiser Wagner PikeOS
8 pages
A Strategy For Testing C++: White Paper
No ratings yet
A Strategy For Testing C++: White Paper
22 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
CS3391 QB
No ratings yet
CS3391 QB
12 pages
David - Wille - 1995 - Electromagnetic Field Computations On Massively Parallel Computers
No ratings yet
David - Wille - 1995 - Electromagnetic Field Computations On Massively Parallel Computers
6 pages
Parallelizing Particle-In-Cell Codes With Openmp and Mpi: Nils Magnus Larsgård
No ratings yet
Parallelizing Particle-In-Cell Codes With Openmp and Mpi: Nils Magnus Larsgård
74 pages
Parallel Computing in CFD: Milovan Perić
No ratings yet
Parallel Computing in CFD: Milovan Perić
25 pages
Concurrent Caching at Google
No ratings yet
Concurrent Caching at Google
20 pages
MEEp Tutorial For ICE Cube
No ratings yet
MEEp Tutorial For ICE Cube
39 pages
1d-Fdtd Using Matlab
No ratings yet
1d-Fdtd Using Matlab
13 pages
3319534289
No ratings yet
3319534289
137 pages
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
No ratings yet
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
22 pages
Journal of Computational Physics: Sanghyun Ha, Junshin Park, Donghyun You
No ratings yet
Journal of Computational Physics: Sanghyun Ha, Junshin Park, Donghyun You
19 pages
Distributed NUMA Implementation of A Direct Solver For DDM Preconditioning
No ratings yet
Distributed NUMA Implementation of A Direct Solver For DDM Preconditioning
2 pages
Homework#6 24
No ratings yet
Homework#6 24
2 pages
WRF-GPU DR Young-Tae+Kim
No ratings yet
WRF-GPU DR Young-Tae+Kim
22 pages
CCV User Manual 2013 10 03
No ratings yet
CCV User Manual 2013 10 03
32 pages
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
No ratings yet
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
52 pages
C# Background Worker Tutorial
No ratings yet
C# Background Worker Tutorial
12 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Vridge Log
No ratings yet
Vridge Log
24 pages
CS2209-Oops Lab Manual
No ratings yet
CS2209-Oops Lab Manual
18 pages
XStream WhitePaper Glossy Final
No ratings yet
XStream WhitePaper Glossy Final
8 pages
Chapter 04
No ratings yet
Chapter 04
17 pages
The Parallel Finite Difference Time Domain (FDTD) Project
No ratings yet
The Parallel Finite Difference Time Domain (FDTD) Project
4 pages
Implementing Condition Variables With Semaphores: Andrew D. Birrell Microsoft Research-Silicon Valley January 2003
No ratings yet
Implementing Condition Variables With Semaphores: Andrew D. Birrell Microsoft Research-Silicon Valley January 2003
8 pages
Neural Network Implementation Using CUDA and OpenMP
No ratings yet
Neural Network Implementation Using CUDA and OpenMP
7 pages
Parallel Performance Study of Monte Carlo Photon Transport Code On Shared-, Distributed-, and Distributed-Shared-Memory Architectures
No ratings yet
Parallel Performance Study of Monte Carlo Photon Transport Code On Shared-, Distributed-, and Distributed-Shared-Memory Architectures
7 pages
Csit3913 PDF
No ratings yet
Csit3913 PDF
12 pages
Opencl Programming For The Cuda Architecture
No ratings yet
Opencl Programming For The Cuda Architecture
23 pages
This Study Resource Was: 1 What Are The Two Main Functions of An Operating System?
No ratings yet
This Study Resource Was: 1 What Are The Two Main Functions of An Operating System?
6 pages
L23 - Computational Issues: Data Must Be Shared Between Disciplines More Easily Computational Requirements Increase
No ratings yet
L23 - Computational Issues: Data Must Be Shared Between Disciplines More Easily Computational Requirements Increase
5 pages
Applied Electromagnetics Ground Bounce Project
No ratings yet
Applied Electromagnetics Ground Bounce Project
15 pages
Choosing Right Photonic Design Software
No ratings yet
Choosing Right Photonic Design Software
5 pages
Computational Electromagnetics: The Finite-Difference Time-Domain
No ratings yet
Computational Electromagnetics: The Finite-Difference Time-Domain
11 pages
Summary Master Thesis
No ratings yet
Summary Master Thesis
3 pages
Project Report of Operating System
77% (13)
Project Report of Operating System
14 pages
A Parallel FDTD Algorithm Using The MPI L
No ratings yet
A Parallel FDTD Algorithm Using The MPI L
10 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
Som (System On Module) Based Count Down Time Distribution and Reception Unit
No ratings yet
Som (System On Module) Based Count Down Time Distribution and Reception Unit
6 pages
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
No ratings yet
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
4 pages
A Novel FDTD Application Featuring Openmp-Mpi Hybrid Parallelization
No ratings yet
A Novel FDTD Application Featuring Openmp-Mpi Hybrid Parallelization
12 pages
Gujarat Technological University: W.E.F. AY 2018-19
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
3 pages
Using GPU Technologies To Drastically Accelerate FDTD Simulations
No ratings yet
Using GPU Technologies To Drastically Accelerate FDTD Simulations
8 pages
Protecting Updates of Shared Variables
No ratings yet
Protecting Updates of Shared Variables
2 pages
Implementing FDTD Tutorial
No ratings yet
Implementing FDTD Tutorial
10 pages
Unit Iii Programming Concepts and Embedded Programming in C, C++
No ratings yet
Unit Iii Programming Concepts and Embedded Programming in C, C++
6 pages
Java J2EE Interview Questions PDF
No ratings yet
Java J2EE Interview Questions PDF
3 pages
OpenMP in Practice: Definitive Reference for Developers and Engineers
From Everand
OpenMP in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Computational Geometry: Exploring Geometric Insights for Computer Vision
From Everand
Computational Geometry: Exploring Geometric Insights for Computer Vision
Fouad Sabry
No ratings yet
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet

Meep Openmp

Uploaded by

Meep Openmp

Uploaded by

Efficient Shared Memory Parallelization of FDTD

2.1 Computational Complexity and Kernel Analysis

Listing 1: Sample loop parallelization using complicated macros and folding.

4.1 Hardware and Build Details

4.3 OpenMP Parameter Tuning

[1] A. F. Oskooi, D. Roundy, M. Ibanescu, P. Bermel, J. D. Joannopoulos, and S. G. Johnson, “Meep:

[3] C. Warren, A. Giannopoulos, A. Gray, I. Giannakis, A. Patterson, L. Wetter, and A. Hamrah, “A

[6] xFDTD solver. https://www.remcom.com/xfdtd-3d-em-simulation-software.

[7] CST solver. https://www.3ds.com/products-services/simulia/products/cst-studio-suite/.

[8] SEMCAD solver. https://speag.swiss/products/semcad/solutions/.

[10] A. Hammond, Multithreading branch, 2020. https://github.com/smartalecH/meep/tree/multithreading.

[11] A. Hammond, Experimental data, 2020. https://github.com/smartalecH/meepo penmp.

You might also like