pyC2Ray: A flexible and GPU-accelerated Radiative Transfer Framework
for Simulating the Cosmic Epoch of Reionization

Patrick Hirling Michele Bianco [email protected] Sambit K. Giri Ilian T. Iliev Garrelt Mellema Jean-Paul Kneib
Abstract

Detailed modeling of the evolution of neutral hydrogen in the intergalactic medium during the Epoch of Reionization, 5z205𝑧205\leq z\leq 205 ≤ italic_z ≤ 20, is critical in interpreting the cosmological signals from current and upcoming 21-cm experiments such as the Low-Frequency Array (LOFAR) and the Square Kilometre Array (SKA). Numerical radiative transfer codes provide the most physically accurate models of the reionization process. However, they are computationally expensive as they must encompass enormous cosmological volumes while accurately capturing astrophysical processes occurring at small scales (Mpcless-than-or-similar-toabsentMpc\lesssim\rm Mpc≲ roman_Mpc). Here, we present pyC2Ray, an updated version of the massively parallel ray-tracing and chemistry code, C2-Ray, which has been extensively employed in reionization simulations. The most time-consuming part of the code is calculating the hydrogen column density along the path of the ionizing photons. Here, we present the Accelerated Short-characteristics Octahedral ray-tracing (ASORA) method, a ray-tracing algorithm specifically designed to run on graphical processing units (GPUs). We include a modern Python interface, allowing easy and customized use of the code without compromising computational efficiency. We test pyC2Ray on a series of standard ray-tracing tests and a complete cosmological simulation with volume size (349Mpc)3superscript349Mpc3(349\,\rm Mpc)^{3}( 349 roman_Mpc ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, mesh size of 2503superscript2503250^{3}250 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and approximately 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT sources. Compared to the original code, pyC2Ray achieves the same results with negligible fractional differences, 105similar-toabsentsuperscript105\sim 10^{-5}∼ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, and a speedup factor of two orders of magnitude. Benchmark analysis shows that ASORA takes a few nanoseconds per source per voxel and scales linearly for an increasing number of sources and voxels within the ray-tracing radii.

keywords:
Radiative Transfer , Epoch of Reionization , ray-tracing , GPU methods , 21-cm , Cosmology , Intergalactic medium
journal: Astronomy &\&& Computing
\affiliation

[first]organization=Institute of Physics, Laboratory of Astrophysics, Ecole Polytechnique Fédérale de Lausanne (EPFL), addressline=Observatoire de Sauverny, city=Versoix, postcode=1290, country=Switzerland \affiliation[second]organization=Nordita, KTH Royal Institute of Technology and Stockholm University, addressline=Hannes Alfvéns väg 12, city=Stockholm, postcode=SE-106 91, country=Sweden \affiliation[third]organization=Astronomy Centre, Department of Physics & Astronomy, addressline=Pevensey III Building, University of Sussex, city=Falmer, Brighton, postcode=BN1 9QH, country=United Kingdom \affiliation[fourth]organization=The Oskar Klein Centre, Department of Astronomy,, addressline=Stockholm University, AlbaNova, city=Stockholm, postcode=SE-10691, country=Sweden

1 Introduction

The Epoch of Reionization (EoR) is a period of significant interest in the history of the Universe, as it marks the appearance of the very first sources of radiation that drove the transition of the intergalactic medium (IGM) from its primordial cold and neutral state to the present-day hot and highly ionized one (see e.g. Furlanetto et al., 2006; Gorbunov and Rubakov, 2011; Dayal and Ferrara, 2018, for reviews about this era). While indirect observational evidence, such as using high redshift quasar spectra (e.g. Bosman et al., 2022) and the cosmic microwave background (CMB) radiation (e.g. Planck Collaboration et al., 2020), situates the EoR at redshifts between about 5 and 30, its main characteristics are still unknown (Pritchard and Loeb, 2012; Barkana, 2016). Current and upcoming interferometric radio telescopes, such as the Low-Frequency Array (LOFAR; van Haarlem et al., 2013), Hydrogen Epoch of Reionization Array (HERA; DeBoer et al., 2017), Murchison Widefield Array (MWA; Wayth et al., 2018) and Square Kilometre Array (SKA; Mellema et al., 2013), are expected to uncover the details of this key event in cosmic history by detecting the distribution of the redshifted 21-cm signal in the IGM, produced by the spin-flip transitions in neutral hydrogen (Pritchard and Loeb, 2012; Zaroubi, 2013). Accurate modeling of the EoR, which is needed to interpret the observational constraints provided by these experiments, will require performing detailed numerical radiative transfer (RT) and radiation hydrodynamics (RHD) studies on large cosmological scales (100greater-than-or-equivalent-toabsent100\gtrsim 100≳ 100 Mpc). These simulations are challenging because the EoR is a non-local process, and the underlying RT equation contains both angular, spatial, and frequency dimensions. Various modeling methods exist, a review of which may be found in, e.g., Gnedin and Madau (2022).

Today, most fully numerical RT codes can be divided into two main classes: moment-based and ray-tracing methods. The former works by considering the hierarchy of angular moments of the RT equation, with some ‘closure relation’ to limit the number of equations to be solved, and treat the radiation as a fluid (e.g. Aubert and Teyssier, 2008). This makes coupling to hydrodynamics natural and, from a computational perspective, has the huge benefit of being independent of the number of ionizing sources in the simulation. On the other hand, moment methods suffer from increased diffusion and unrealistic shadows on optically thick objects. A few examples of codes using moment-based methods are OTVET (Gnedin and Abel, 2001), RAMSES-RT (Rosdahl et al., 2013) and AREPO-RT (Kannan et al., 2019). Although they combine N-body, hydrodynamic and radiative feedback, they tend to be computationally expensive and cannot simulate the required large volumes. Iliev et al. (2014) show that we require simulations with a minimum volume size of similar-to\sim100 cMpc to model the 21-cm signal and cover the large field of view expected by SKA and its precursors(e.g. Mertens et al., 2020; Trott et al., 2020; HERA Collaboration, 2023). Moreover, the small mass sources (similar-to\sim10M8superscriptsubscript𝑀direct-product8{}^{8}M_{\odot}start_FLOATSUPERSCRIPT 8 end_FLOATSUPERSCRIPT italic_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT), which are expected to drive reionization (e.g. Nebrin et al., 2023; Gelli et al., 2023; Atek et al., 2024), are often not resolved by these simulations.

Ray-tracing methods take a more physical approach by casting rays around each source and modeling how the radiation propagates, i.e., is absorbed and scattered, along those rays. The photo-ionization rate occurring at any point in space is then determined by the number of absorptions between the source and said point, normally expressed as the optical depth between the source and that point. This approach can potentially be more accurate and less diffusive than moment methods but is quite expensive, as the cost of ray-tracing generally scales linearly with the number of radiating sources. Thus, in practice, the number of sources that can be considered has been, until recent years, severely limited by the available computational power. C2-Ray (Mellema et al., 2006), ZEUS-MP (Whalen and Norman, 2006), CRASH (Ciardi et al., 2001), SPHRAY (Altay et al., 2008), LICORNE (Semelin et al., 2007), ART (Nakamoto et al., 2001), FLASH-HC (Rijkhorst et al., 2006) are a few notable examples of ray-tracing-based codes. Moment and ray-tracing methods have been compared extensively (Iliev et al., 2006, 2009). The main differences are due to numerical diffusion for the different treatments of the energy equation in moment-based methods and how the multi-frequency radiation is implemented. The advantages of using one over the other have been shown to depend greatly on the problem and context. That being said, by requiring huge volumes and large numbers of ionizing sources (Kaur et al., 2020; Giri et al., 2023), developing more efficient RT methods for EoR, especially ray-tracing-based ones, is highly desirable.

In recent years, there has been a significant surge in the use of general-purpose GPUs for numerical scientific research. These devices have enabled remarkable performance improvements when used to develop applications for problems that can be divided into numerous simple and independent tasks suitable for parallel processing. Consequently, GPU acceleration has been integrated in various astrophysics and cosmological-related tools (e.g. Ocvirk et al., 2016; Potter et al., 2016; Rácz et al., 2019; Cavelan et al., 2020; Wang and Meng, 2021). The ATON (Aubert and Teyssier, 2010) and EMMA (Aubert et al., 2015) codes are the first applications of GPU-accelerated algorithms for radiative transfer codes in the context of extra-galactic astrophysics. To our knowledge, this technology has not yet been imported to short-characteristic ray-tracing methods, thus making the current work a first.

Given the success of GPUs in accelerating ray-tracing tasks in computer graphics (Owens et al., 2008; Nickolls and Dally, 2010; Navarro et al., 2014), it is reasonable to explore their application to ray-tracing problems in astrophysics. This motivates our work, where we introduce an Accelerated Short-characteristics Octhaedral ray-tracing (ASORA) method designed specifically for C2-Ray. By incorporating GPU methods, we anticipate significant performance enhancements and more efficient simulations, thus opening up new possibilities for research and analysis. Our work aims to bridge the gap between the potential of GPU acceleration and the requirements of ray-tracing tasks in astrophysics, providing a promising avenue for further advancements in this domain.

C2-Ray is a 3D ray-tracing radiative transfer code designed for simulating the EoR and was initially developed by Mellema et al. (2006) (hereafter: M06). It conserves photons at a voxel-by-voxel level, allowing for large, optically thick grid voxels while maintaining accuracy. Furthermore, the method allows for long time steps, even surpassing the voxel-crossing time of ionization fronts. It has been extensively used in EoR simulations and updated to include photoheating, X-ray radiation, and helium chemistry (Friedrich et al., 2012; Ross et al., 2017, 2019). C2-Ray is written in Fortran90 and designed for massively parallel systems, utilizing a hybrid MPI and OpenMP approach for efficient radiation propagation. The ionizing sources are distributed over MPI processes, and each of these processes further employs OpenMP threading to propagate radiation in a domain-decomposed manner.

As a stand-alone code, C2-Ray is a post-processing code111The algorithm can, however, be used in conjunction with a hydrodynamics code, as was, for example, done in Arthur et al. (2011) and Medina et al. (2014). It acts on the output snapshots of a previous (cosmological) hydrodynamical simulation and propagates radiation on the gas fields of these snapshots. As is detailed in the following sections, it is also a grid code, meaning that the gas fields must be projected onto this grid through some smoothing method. Sources are identified in the initial simulation via a variety of models. Currently, the typical approach is to run a halo finder on each snapshot and use a physical model to translate a halo into a radiating source. The update to C2-Ray in this work comprises two main aspects:

  1. 1.

    GPU-Accelerated ray-tracing Method: The original ray-tracing method used by C2-Ray is not well-suited for GPU parallelization. A new algorithm based on the same short-characteristics scheme has been developed to address this limitation. This new method is specifically designed for running on GPUs, enabling efficient computation of column densities, which is the most computationally intensive task in the radiative transfer (RT) method. The GPU implementation leverages massive multi-threading capabilities, resulting in significantly faster performance than the CPU method. This new algorithm is written as a C++/CUDA (e.g. Garland et al., 2008) library with Python bindings for ease of use and integration.

  2. 2.

    Python Wrapper and Interface: The highly-optimized Fortran90 implementation of C2-Ray excels at computationally intensive and time-consuming tasks, such as the solving of chemistry equations and, until now, ray-tracing. However, due to its compiled and statically typed nature, Fortran is less suited for all the parts of the code that require frequent tweaking, such as the radiation source implementation, interfacing, I/O operations, cosmological model, and more generally the setup of each particular simulation. These tasks contain most of the conceptual baggage of future simulations but only represent a negligible fraction of the computational workload. Thus, to enhance usability and flexibility, we decided to wrap the time-critical core Fortran subroutines of C2-Ray and rewrite the non-time-consuming parts of the code in Python, making frequent use of standard libraries.

As a result, users can now write an entire C2-Ray simulation as a Python script, making it easier to tweak parameters and add new features without frequently recompiling the core Fortran subroutines. These updates enable more efficient GPU utilization for critical computations and improve the overall accessibility and versatility of the C2-Ray code through Python scripting and interface enhancements.

This paper is structured as follows. In § (2), we describe how reionization is modeled and summarize how the C2-Ray method works. In § (3), we describe the ray-tracing method used, present our newly developed ASORA algorithm, and briefly discuss the new Python wrapping and interface to the code. Then, in § (4), the updated code is tested on standard idealized situations and benchmarked to determine how much performance improvement is achieved. The source code of pyC2Ray is publicly available at https://github.com/cosmic-reionization/pyC2Ray.

2 Simulating Cosmic Reionization

To study the EoR, we need to model the time evolution of the ionization state of the intergalactic medium (IGM) within a cosmological framework. This involves solving a system of chemistry equations that track the evolution of the ionization state of primordial species, such as hydrogen and helium. These equations take into account various physical processes, including photoionization, collisional excitation, recombination, heating, and cooling (e.g. Furlanetto et al., 2006).

In this paper, we will focus on the simplest case, considering only hydrogen. This choice is justified because hydrogen constitutes the major part of the IGM. The original C2-Ray code includes extensions also to consider helium ionization and multi-frequency photo-heating (Friedrich et al., 2012), and we plan to incorporate these extensions into pyC2Ray gradually. The primary objective of this paper is to present an update to the general ray-tracing method.

The ionization state of the hydrogen gas is described by the following chemistry equation (e.g. Choudhury and Ferrara, 2006; Choudhury, 2009),

dxHIIdt=(1xHII)(Γ+neCH(T))xHIIneαH(T),𝑑subscript𝑥HII𝑑𝑡1subscript𝑥HIIΓsubscript𝑛𝑒subscript𝐶H𝑇subscript𝑥HIIsubscript𝑛𝑒subscript𝛼H𝑇\frac{dx_{\mathrm{HII}}}{dt}=(1-x_{\mathrm{HII}})\left(\Gamma+n_{e}\,C_{\rm H}% (T)\right)-x_{\mathrm{HII}}\,n_{e}\,\alpha_{\rm H}(T),divide start_ARG italic_d italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG = ( 1 - italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT ) ( roman_Γ + italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT ( italic_T ) ) - italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT ( italic_T ) , (1)

where xHIIsubscript𝑥HIIx_{\mathrm{HII}}italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT is the fraction of ionized hydrogen, nesubscript𝑛𝑒n_{e}italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the electron number density, ΓΓ\Gammaroman_Γ is the photo-ionization rate per unit time, and CH(T)subscript𝐶H𝑇C_{\rm H}(T)italic_C start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT ( italic_T ) and αH(T)subscript𝛼H𝑇\alpha_{\rm H}(T)italic_α start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT ( italic_T ) are the collisional ionization and recombination coefficients for ionized hydrogen and free electrons, at temperature T𝑇Titalic_T. C2-Ray uses the on-the-spot (OTS) approximation, which assumes that the diffused photons resulting from recombination to the ground state are reabsorbed locally and, thus, solely accounted for by using a different value for αHsubscript𝛼H\alpha_{\rm H}italic_α start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT (e.g., Ritzerveld, 2005).

The photo-ionization rate ΓΓ\Gammaroman_Γ quantifies the effect of ionizing UV radiation on the gas and is determined by the distribution of radiation sources. To illustrate this point, consider the simple situation of a single isotropic ionizing source in a homogeneous medium. As photons propagate away from the source in all directions, they form a spherical "shell" of ionizing radiation. The photons are absorbed by gas particles, which subsequently become ionized. These photo-ionizations also attenuate the strength of the radiation further away from the source, in addition to the attenuation occurring due to geometrical effects alone. Photo-ionization is also countered by recombinations. Together, these phenomena result in the formation of a spherical ionized bubble around the source, also known as a Strömgren sphere.

In EoR simulations, more than one source is typically present, and the medium is distinctly inhomogeneous, leading to a much more complicated situation. The number of these sources during the EoR depends critically on the size of the volume and minimum mass of source haloes. We typically start with less than a few hundred sources at high redshift (z20greater-than-or-equivalent-to𝑧20z\gtrsim 20italic_z ≳ 20) to a few tens of a million at low redshift (z6𝑧6z\approx 6italic_z ≈ 6). Below, we first summarize the method used in C2-Ray to solve Equation 1, and then discuss in detail the computation of the photoionization rates.

2.1 Summary of the C2Ray Method

To solve the chemistry equation (Equation 1), one could, in principle, use a finite-differencing scheme and assume all rates to be constant over a reasonably short timestep. This approach is used by, e.g., Grackle (Smith et al., 2017) to solve very complex chemistry networks. The problem here lies in the photoionization rate ΓΓ\Gammaroman_Γ. It is determined by the amount of ionizing photons arriving at the target point where Equation 1 is considered. This amount depends directly on how radiation is absorbed along the path from its source to the target point, which in turn depends on the density nHsubscript𝑛Hn_{\mathrm{H}}italic_n start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT and ionization state xHIIsubscript𝑥HIIx_{\mathrm{HII}}italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT of the medium along this path. This means that ΓΓ\Gammaroman_Γ is strongly dependent on the solution variables of the problem and that this dependence is also highly non-local. For the finite-differencing scheme to be accurate, this implies very stringent constraints on the timestep size, especially in the presence of fast-moving ionization fronts (I-fronts).

C2-Ray overcomes this problem using an alternative approach, illustrated schematically in Figure 1. As is argued in M06, when recombinations and collisional ionizations are neglected, the solution of Equation 1 over any timestep ΔtΔ𝑡\Delta troman_Δ italic_t depends only on the time-averaged photoionization rate within that timestep, denoted by Γdelimited-⟨⟩Γ\langle\Gamma\rangle⟨ roman_Γ ⟩. Furthermore, only small deviations arise when collisions and recombinations are included, as is tested in M06. The idea behind the C2-Ray algorithm is to converge to the correct Γdelimited-⟨⟩Γ\langle\Gamma\rangle⟨ roman_Γ ⟩ within a given ΔtΔ𝑡\Delta troman_Δ italic_t by iterating between a Ray-tracing Step, which computes Γdelimited-⟨⟩Γ\langle\Gamma\rangle⟨ roman_Γ ⟩ based on the currently assumed solution for the time-averaged ionization state xHIIdelimited-⟨⟩subscript𝑥HII\langle x_{\mathrm{HII}}\rangle⟨ italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT ⟩ of the whole medium, and a Chemistry Step, which computes an updated xHIIdelimited-⟨⟩subscript𝑥HII\langle x_{\mathrm{HII}}\rangle⟨ italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT ⟩ based on the new Γdelimited-⟨⟩Γ\langle\Gamma\rangle⟨ roman_Γ ⟩. This is illustrated in Figure 1 by the long vertical black arrow, which goes through a convergence test to determine whether the iteration needs to be repeated.

The chemistry step itself is not entirely trivial, as it still relies on being able to solve the differential equation. The method used in C2-Ray is based on Schmidt-Voigt and Koeppen (1987), who argue that when nesubscript𝑛𝑒n_{e}italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, ΓΓ\Gammaroman_Γ, CHsubscript𝐶HC_{\rm H}italic_C start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT and αHsubscript𝛼H\alpha_{\rm H}italic_α start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT are assumed to be constant, an analytical solution exists for Equation 1. Using this solution, the time-averaged ionization state can be expressed as

xdelimited-⟨⟩𝑥\displaystyle\langle x\rangle⟨ italic_x ⟩ =xeq+(x0xeq)(1eΔt/ti)tiΔtabsentsubscript𝑥𝑒𝑞subscript𝑥0subscript𝑥𝑒𝑞1superscript𝑒Δ𝑡subscript𝑡𝑖subscript𝑡𝑖Δ𝑡\displaystyle=x_{eq}+(x_{0}-x_{eq})(1-e^{-\Delta t/t_{i}})\frac{t_{i}}{\Delta t}= italic_x start_POSTSUBSCRIPT italic_e italic_q end_POSTSUBSCRIPT + ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_e italic_q end_POSTSUBSCRIPT ) ( 1 - italic_e start_POSTSUPERSCRIPT - roman_Δ italic_t / italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) divide start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG (2)
xeqsubscript𝑥𝑒𝑞\displaystyle x_{eq}italic_x start_POSTSUBSCRIPT italic_e italic_q end_POSTSUBSCRIPT =Γ+neCHΓ+ne(CH+αH)absentΓsubscript𝑛𝑒subscript𝐶HΓsubscript𝑛𝑒subscript𝐶Hsubscript𝛼H\displaystyle=\frac{\Gamma+n_{e}\,C_{\rm H}}{\Gamma+n_{e}\left(C_{\rm H}+% \alpha_{\rm H}\right)}= divide start_ARG roman_Γ + italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT end_ARG start_ARG roman_Γ + italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT ) end_ARG (3)
tisubscript𝑡𝑖\displaystyle t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =[Γ+ne(CH+αH)]1.absentsuperscriptdelimited-[]Γsubscript𝑛𝑒subscript𝐶Hsubscript𝛼H1\displaystyle=\left[\Gamma+n_{e}\left(C_{\rm H}+\alpha_{\rm H}\right)\right]^{% -1}.= [ roman_Γ + italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (4)

Here, x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the ionization state at the beginning of the timestep, and xeqsubscript𝑥𝑒𝑞x_{eq}italic_x start_POSTSUBSCRIPT italic_e italic_q end_POSTSUBSCRIPT is the equilibrium solution, i.e., for dxHII/dt=0𝑑subscript𝑥HII𝑑𝑡0dx_{\mathrm{HII}}/dt=0italic_d italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT / italic_d italic_t = 0, while tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a constant time scale employed for the time-averaged inhomogeneous solution. Note also that ΓΓ\Gammaroman_Γ has been used instead of Γdelimited-⟨⟩Γ\langle\Gamma\rangle⟨ roman_Γ ⟩ to represent the time-averaged photoionization rate to ease up notation. Since the non-time-averaged rate is never used in the algorithm, this new notation shall be used from now on. C2-Ray uses this solution, and iterates for the electron density nesubscript𝑛𝑒n_{e}italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT (which depends on xHIIsubscript𝑥HIIx_{\mathrm{HII}}italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT through roughly nexHIInHsimilar-tosubscript𝑛𝑒subscript𝑥HIIsubscript𝑛Hn_{e}\sim x_{\mathrm{HII}}n_{\mathrm{H}}italic_n start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∼ italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT), until xHIIdelimited-⟨⟩subscript𝑥HII\langle x_{\mathrm{HII}}\rangle⟨ italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT ⟩ converges. The thick horizontal arrow within the chemistry step illustrates this second iterative process in Figure 1. Again, a convergence criterion is implicitly used to determine when to end the iteration.

The ray-tracing step requires further consideration, as it is the focus of the present work. It is discussed in detail in §2.2 and §3.1. The main takeaway from this section is that the C2-Ray method makes it possible to use very long timesteps while still remaining accurate even in the presence of fast-moving I-fronts.

Refer to caption
Figure 1: Flowchart representation of the method used by C2-Ray. The figure shows the procedure for a single time-step in which the ionized fraction of hydrogen xHIsubscript𝑥HIx_{\rm HI}italic_x start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT is evolved for the whole 3D grid. The method can be divided into a "ray-tracing" and a "chemistry" step, and multiple iterations of either typically occur in a single time step.

2.2 Computing Rates

In the "ray-tracing step" introduced above, the properties of ionizing sources are used together with the knowledge of the ionization state of the medium inside a simulation volume to compute the photo-ionization rate ΓΓ\Gammaroman_Γ occurring at any point in space. An ionizing source of specific luminosity Lνsubscript𝐿𝜈L_{\nu}italic_L start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT located at a point psubscript𝑝\vec{p}_{\star}over→ start_ARG italic_p end_ARG start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT produces at a target point p𝑝\vec{p}over→ start_ARG italic_p end_ARG a (time-averaged) photo-ionization rate given by

Γ=14πr2νthLνσνeτνhν𝑑ν,Γ14𝜋superscript𝑟2superscriptsubscriptsubscript𝜈𝑡subscript𝐿𝜈subscript𝜎𝜈superscript𝑒delimited-⟨⟩subscript𝜏𝜈𝜈differential-d𝜈\Gamma=\frac{1}{4\pi r^{2}}\int_{\nu_{th}}^{\infty}\frac{L_{\nu}\sigma_{\nu}e^% {-\langle\tau_{\nu}\rangle}}{h\nu}d\nu,roman_Γ = divide start_ARG 1 end_ARG start_ARG 4 italic_π italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∫ start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG italic_L start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - ⟨ italic_τ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ⟩ end_POSTSUPERSCRIPT end_ARG start_ARG italic_h italic_ν end_ARG italic_d italic_ν , (5)

where νthsubscript𝜈𝑡\nu_{th}italic_ν start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT is the threshold frequency for photoionization (hνth=13.6subscript𝜈𝑡13.6h\nu_{th}=13.6italic_h italic_ν start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT = 13.6 eV), τντν(r)delimited-⟨⟩subscript𝜏𝜈subscript𝜏𝜈𝑟\left<\tau_{\nu}\right>\equiv\tau_{\nu}(r)⟨ italic_τ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ⟩ ≡ italic_τ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ( italic_r ) is the time-averaged optical depth between source and target and r=|pp|𝑟subscript𝑝𝑝r=|\vec{p}_{\star}-\vec{p}|italic_r = | over→ start_ARG italic_p end_ARG start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT - over→ start_ARG italic_p end_ARG | is the distance between psubscript𝑝\vec{p}_{\star}over→ start_ARG italic_p end_ARG start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT and p𝑝\vec{p}over→ start_ARG italic_p end_ARG. The optical depth is proportional to the column density of neutral hydrogen NHIsubscript𝑁HIN_{\rm HI}italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT, and the proportionality factor is its frequency-dependent photoionization cross-section, τν=σνNHIsubscript𝜏𝜈subscript𝜎𝜈subscript𝑁HI\tau_{\nu}=\sigma_{\nu}N_{\rm HI}italic_τ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT. The frequency-dependence of σνsubscript𝜎𝜈\sigma_{\nu}italic_σ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT approximately follows a power law whose index depends on the frequency band considered (see, e.g. Friedrich et al., 2012, for further details).

C2-Ray in its current form is a Cartesian grid code, which discretizes space using N3superscript𝑁3N^{3}italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT cubic voxels, where N𝑁Nitalic_N is the number of voxels in each dimension. Using Equation 5 directly on a grid is problematic when the voxels are optically thick, i.e. the optical depth of a single voxel is non-negligible. The photoionization rate will then vary appreciably from one side of a voxel to the other. Using Equation 5 computed at an arbitrary point in the voxel as a representative rate for the whole voxel will, therefore, lead to an error in photon conservation - the number of ionizations occurring in the voxel will not be equal to the number of absorptions. To avoid this problem without being forced to use impractically small voxels, C2-Ray works by imposing that the number of ionizations is equal to the number of absorptions used to attenuate the radiation. As is detailed in M06, using this condition leads to an alternative expression for the photoionization rate,

Γ=νthLνhνeτν(1eΔτν)nHIVshell𝑑ν,Γsuperscriptsubscriptsubscript𝜈𝑡subscript𝐿𝜈𝜈superscript𝑒delimited-⟨⟩subscript𝜏𝜈1superscript𝑒Δsubscript𝜏𝜈subscript𝑛HIsubscript𝑉𝑠𝑒𝑙𝑙differential-d𝜈\Gamma=\int_{\nu_{th}}^{\infty}\frac{L_{\nu}}{h\nu}\frac{e^{-\left<\tau_{\nu}% \right>}(1-e^{-\Delta\tau_{\nu}})}{n_{\rm HI}V_{shell}}d\nu,roman_Γ = ∫ start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG italic_L start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT end_ARG start_ARG italic_h italic_ν end_ARG divide start_ARG italic_e start_POSTSUPERSCRIPT - ⟨ italic_τ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ⟩ end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - roman_Δ italic_τ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_s italic_h italic_e italic_l italic_l end_POSTSUBSCRIPT end_ARG italic_d italic_ν , (6)

where ΔτνΔsubscript𝜏𝜈\Delta\tau_{\nu}roman_Δ italic_τ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT is the optical depth through the voxel, which is proportional to the light travel path length ds𝑑𝑠dsitalic_d italic_s through the voxel, and τνdelimited-⟨⟩subscript𝜏𝜈\left<\tau_{\nu}\right>⟨ italic_τ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT ⟩ is the optical depth up to the voxel. nHIsubscript𝑛HIn_{\rm HI}italic_n start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT is the number density of neutral hydrogen inside the voxel, while the factor Vshell=4πr2dssubscript𝑉𝑠𝑒𝑙𝑙4𝜋superscript𝑟2𝑑𝑠V_{shell}=4\pi r^{2}dsitalic_V start_POSTSUBSCRIPT italic_s italic_h italic_e italic_l italic_l end_POSTSUBSCRIPT = 4 italic_π italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_s accounts for both geometrical diffusion of radiation and the finite size of the cell. Note that, in the optically thin limit (Δτν0Δsubscript𝜏𝜈0\Delta\tau_{\nu}\rightarrow 0roman_Δ italic_τ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT → 0), the above expression reduces to Equation 5. By defining the function

γ(NHI)νthLνeσνNHIhν𝑑ν,𝛾subscript𝑁HIsuperscriptsubscriptsubscript𝜈𝑡subscript𝐿𝜈superscript𝑒subscript𝜎𝜈subscript𝑁HI𝜈differential-d𝜈\gamma(N_{\rm HI})\equiv\int_{\nu_{th}}^{\infty}\frac{L_{\nu}e^{-\sigma_{\nu}N% _{\rm HI}}}{h\nu}d\nu,italic_γ ( italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT ) ≡ ∫ start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG italic_L start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_h italic_ν end_ARG italic_d italic_ν , (7)

Equation 6 can be written in a more suggestive way

Γ=1nHIVshell[γ(NHI)γ(NHI+ΔNHI)].Γ1subscript𝑛HIsubscript𝑉𝑠𝑒𝑙𝑙delimited-[]𝛾subscript𝑁HI𝛾subscript𝑁HIΔsubscript𝑁HI\Gamma=\frac{1}{n_{\rm HI}V_{shell}}\left[\gamma(N_{\rm HI})-\gamma(N_{\rm HI}% +\Delta N_{\rm HI})\right].roman_Γ = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_s italic_h italic_e italic_l italic_l end_POSTSUBSCRIPT end_ARG [ italic_γ ( italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT ) - italic_γ ( italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT + roman_Δ italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT ) ] . (8)

This means that rather than numerically solving the integral in Equation 6 each time it is required, the function γ(NHI)𝛾subscript𝑁HI\gamma(N_{\rm HI})italic_γ ( italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT ) can be pre-calculated and tabulated for a range of column densities and a simple interpolation used to evaluate it for any given value of NHIsubscript𝑁HIN_{\rm HI}italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT. Note that the individual properties of the voxel where ΓΓ\Gammaroman_Γ is computed are not part of the tabulation and are explicitly accounted for in Equation 8.

When more than one source is present, the situation becomes slightly more complicated. The approach described in the original C2-Ray paper (see Figure 4 in M06) involves randomizing the order of sources and performing the chemistry step for each source individually before testing global convergence. However, this approach was modified in subsequent updates to the code, and it is this updated algorithm that we use here. The idea is simply to compute the 3D rate array (one rate per voxel) for each source and sum these arrays to obtain a global rate array. This global rate is then used for the chemistry step, and the process is repeated until convergence. Note that this is the process as illustrated in Figure 1, where we use the notation ΓkisubscriptΓ𝑘𝑖\Gamma_{ki}roman_Γ start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT to signify that this quantity applies to a given source indexed by k𝑘kitalic_k and a given voxel, indexed by i𝑖iitalic_i.

3 Novel ray-tracing Method: ASORA

As shown by Equation 6, the problem of finding ionization rates boils down to computing the column density NHIsubscript𝑁HIN_{\rm HI}italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT of neutral hydrogen between a source and grid voxels. This is the process we refer to as ray-tracing in this context. In principle, given a cubic grid with N𝑁Nitalic_N voxels in each dimension, it is possible to compute NHIsubscript𝑁HIN_{\rm HI}italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT directly for all voxels of the grid, an approach known as “long characteristics” (LC). For a single source, it scales as 𝒪(N4)𝒪superscript𝑁4\mathcal{O}(N^{4})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ). This is because, for each of the N3superscript𝑁3N^{3}italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT voxels to treat, the number of other voxels that lie along the ray coming from the source is on the order 𝒪(N)𝒪𝑁\mathcal{O}(N)caligraphic_O ( italic_N ). LC has the advantage of being easy to parallelize as all rays are treated independently. However, given that radiation propagates causally outward from the source and that column density is an additive quantity along a given line of sight, this algorithm also contains a lot of redundancy. A variety of methods have been proposed to make ray-tracing more efficient (see, e.g. Rosdahl et al., 2013, for an overview). C2-Ray uses a version of the “short-characteristics” (SC) ray-tracing method (Raga et al., 1999), which reduces the redundancy of the problem by using interpolation from inner-lying voxels relative to the source to compute the column density to outer-lying ones. This method reduces the complexity to 𝒪(N3)𝒪superscript𝑁3\mathcal{O}(N^{3})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) but is harder to parallelize as it introduces voxel dependency.

Since the effect of each source is independent, the total cost of the ray-tracing step is the number of sources Nsrcsubscript𝑁srcN_{\mathrm{src}}italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT times whatever the cost for a single source is, e.g., 𝒪(N4)𝒪superscript𝑁4\mathcal{O}(N^{4})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) for LC or 𝒪(N3)𝒪superscript𝑁3\mathcal{O}(N^{3})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) for SC. On the other hand, the total cost of a chemistry step only scales with the number of voxels in the grid, i.e., N3superscript𝑁3N^{3}italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. This clarifies why the ray-tracing step is the primary target for optimization in an EoR code like C2-Ray, where typically Nsrc1much-greater-thansubscript𝑁src1N_{\mathrm{src}}\gg 1italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT ≫ 1. In fact, at low redshift, it is common to have a source in almost every voxel, so that NsrcN3similar-tosubscript𝑁srcsuperscript𝑁3N_{\mathrm{src}}\sim N^{3}italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT ∼ italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, and that, in turn, the complexity of the ray-tracing step is 𝒪(N6)similar-toabsent𝒪superscript𝑁6\sim\mathcal{O}(N^{6})∼ caligraphic_O ( italic_N start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) (using SC) versus 𝒪(N3)𝒪superscript𝑁3\mathcal{O}(N^{3})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) for the chemistry step.

We should also mention that we never separately treat more than one source per voxel and instead simply add the luminosity of all sources whenever more than one is present, implying NsrcN3subscript𝑁srcsuperscript𝑁3N_{\mathrm{src}}\leq N^{3}italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT ≤ italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. The result is identical to treating them separately and adding up the resulting rates because a source is always assumed to be at the center of a voxel. Note, however, that this approach is possible only because the spectra of all sources are identical, and in the future, when different source types may be considered by the code, this procedure will have to be adapted by only summing up the sources that belong to the same type.

Below, we first discuss in detail the short-characteristics ray-tracing method used in C2-Ray (§ 3.1). Then, we give an overview of the CPU-parallelization strategy the code has used so far (§ 3.2). Next, we introduce the adaptation of the method for GPUs (§ 3.3), and finally, in § 3.4, we discuss the structure of the new Python wrapper built around C2-Ray.

3.1 Ray-tracing in C2Ray

Here, we closely follow the discussion of Appendix A in M06, and in particular, refer the reader to Figure A1, which provides a good visual description of the geometric arguments detailed below. For a voxel located at mesh position p=(i,j,k)𝑝𝑖𝑗𝑘p=(i,j,k)italic_p = ( italic_i , italic_j , italic_k ) and a source at s=(is,js,ks)𝑠subscript𝑖𝑠subscript𝑗𝑠subscript𝑘𝑠s=(i_{s},j_{s},k_{s})italic_s = ( italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), the full column density NHI=N~HI+ΔNHIsubscript𝑁HIsubscript~𝑁HIΔsubscript𝑁HIN_{\rm HI}=\tilde{N}_{\rm HI}+\Delta N_{\rm HI}italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT = over~ start_ARG italic_N end_ARG start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT + roman_Δ italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT along the ray from s𝑠sitalic_s to p𝑝pitalic_p can be decomposed into a part up to the voxel N~HIsubscript~𝑁HI\tilde{N}_{\rm HI}over~ start_ARG italic_N end_ARG start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT and a part within the voxel ΔNHIΔsubscript𝑁HI\Delta N_{\rm HI}roman_Δ italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT. The latter is proportional to the physical path length dl=αds𝑑𝑙𝛼𝑑𝑠dl=\alpha\,dsitalic_d italic_l = italic_α italic_d italic_s through the voxel at p𝑝pitalic_p, where ds𝑑𝑠dsitalic_d italic_s is the path length in mesh units and α𝛼\alphaitalic_α is the physical length of a grid voxel,

ΔNHI=nHIαds.Δsubscript𝑁HIsubscript𝑛HI𝛼𝑑𝑠\Delta N_{\rm HI}=n_{\rm HI}\,\alpha\,ds.roman_Δ italic_N start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT italic_α italic_d italic_s . (9)

Defining Δi=iisΔ𝑖𝑖subscript𝑖𝑠\Delta i=i-i_{s}roman_Δ italic_i = italic_i - italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (and similarly ΔjΔ𝑗\Delta jroman_Δ italic_j and ΔkΔ𝑘\Delta kroman_Δ italic_k), one can determine through which plane the ray coming from s𝑠sitalic_s enters the voxel at p𝑝pitalic_p. For example, if Δk>ΔiΔ𝑘Δ𝑖\Delta k>\Delta iroman_Δ italic_k > roman_Δ italic_i and Δk>ΔjΔ𝑘Δ𝑗\Delta k>\Delta jroman_Δ italic_k > roman_Δ italic_j, the ray enters through one of the constant-z𝑧zitalic_z planes, with the ΔkΔ𝑘\Delta kroman_Δ italic_k sign indicating which one. In this particular case, the path length through the voxel is

ds=1+Δi2+Δj2Δk2,𝑑𝑠1Δsuperscript𝑖2Δsuperscript𝑗2Δsuperscript𝑘2ds=\sqrt{1+\frac{\Delta i^{2}+\Delta j^{2}}{\Delta k^{2}}},italic_d italic_s = square-root start_ARG 1 + divide start_ARG roman_Δ italic_i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Δ italic_j start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG roman_Δ italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG , (10)

and the analogous expressions apply if the ray enters through the constant-x𝑥xitalic_x or y𝑦yitalic_y plane. The main assumption of the short-characteristics method is that N~HIsubscript~𝑁HI\tilde{N}_{\rm HI}over~ start_ARG italic_N end_ARG start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT can be computed by interpolation with neighboring voxels of p𝑝pitalic_p that are closer to s𝑠sitalic_s. The particular scheme used by C2-Ray (Raga et al., 1999) uses 4 neighbours, whose positions are given by

Refer to caption
Figure 2: Parallelization strategy used by the original C2-Ray code. In the first step (A), 6 grid domains can be treated independently, corresponding to axes around the source voxel. In (B), the 12 planes joining them form independent domains, while in the third one (C), the 8 octants between the planes do.
e1=(i,j,kσk),e2=(i,jσj,k),e3=(iσi,j,k),e4=(iσi,jσj,kσk)\displaystyle\begin{split}e_{1}&=(i,j,k-\sigma_{k}),\quad\quad e_{2}=(i,j-% \sigma_{j},k),\\ e_{3}&=(i-\sigma_{i},j,k),\quad\quad e_{4}=(i-\sigma_{i},j-\sigma_{j},k-\sigma% _{k})\,\end{split}start_ROW start_CELL italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL = ( italic_i , italic_j , italic_k - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_i , italic_j - italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_k ) , end_CELL end_ROW start_ROW start_CELL italic_e start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL start_CELL = ( italic_i - italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_j , italic_k ) , italic_e start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = ( italic_i - italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_j - italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_k - italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW (11)

where σi,j,k=|Δi,j,k|Δi,j,k\sigma_{i,j,k}=\frac{|\Delta i,j,k|}{\Delta i,j,k}italic_σ start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT = divide start_ARG | roman_Δ italic_i , italic_j , italic_k | end_ARG start_ARG roman_Δ italic_i , italic_j , italic_k end_ARG. The interpolated column density up to p𝑝pitalic_p then reads

N~HI=w1Ne1+w2Ne2+w3Ne3+w4Ne4.subscript~𝑁HIsubscript𝑤1subscript𝑁subscript𝑒1subscript𝑤2subscript𝑁subscript𝑒2subscript𝑤3subscript𝑁subscript𝑒3subscript𝑤4subscript𝑁subscript𝑒4\tilde{N}_{\rm HI}=w_{1}N_{e_{1}}+w_{2}N_{e_{2}}+w_{3}N_{e_{3}}+w_{4}N_{e_{4}}.over~ start_ARG italic_N end_ARG start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (12)

The interpolation weights wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are a simple geometric weighting based on the xy𝑥𝑦xyitalic_x italic_y-distance from the corner to the point of intersection between the ray and the surface of the cell. They are chosen such that when the ray is parallel to an axis or lies on a grid diagonal, in which case N~HIsubscript~𝑁HI\tilde{N}_{\rm HI}over~ start_ARG italic_N end_ARG start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT is exactly equal to the column density of only one of the neighbors, all but the weight of that neighbor vanish. The interested reader is referred to Appendix A in M06 for the details of this weighting choice.

The above scheme describes how the column density up to a given voxel can be approximated using the knowledge of the equivalent quantity corresponding to 4 other voxels that lie closer to the source. This inter-voxel dependency naturally implies that for the scheme to be applied correctly, one must treat the voxels in a particular order, starting at the source voxel and moving outward from there. This ensures that the interpolation step does not attempt to use information that doesn’t exist yet, so we say that SC is a causal algorithm. In fact, the simplest way to traverse the grid is to simply perform a triple loop over the xyz𝑥𝑦𝑧x\rightarrow y\rightarrow zitalic_x → italic_y → italic_z indices of all voxels by starting the loop at the source voxel indices. This is a fully sequential approach. The next two sections deal with the problem of finding parallel alternatives to the latter.

3.2 Existing CPU Parallelization and Optimizations

The current version of C2-Ray uses various methods to optimize the cost of ray-tracing and make the procedure scalable to massively parallel CPU systems. A key feature of the code is that the treatment of each source is completely independent. C2-Ray harnesses this independence by distributing the full list of sources between MPI ranks. Each rank receives a copy of the full grid data and works on a subset of the sources. It performs ray-tracing for each source in this subset and sums together their respective ionization rate arrays (see §2.2). Then, an MPI reduction operation is used to sum the rate arrays of all ranks and obtain a global ΓΓ\Gammaroman_Γ that includes the contributions of all sources. This allows the full ray-tracing workload to be distributed over many processors in shared and distributed memory setups. The main limitation of this setup is memory since each rank carries a full copy of the 3D grid.

Refer to caption
Figure 3: Sequence of octahedral shells Sqsubscript𝑆𝑞S_{q}italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT used in the ASORA ray-tracing method. All voxels belonging to a shell Sqsubscript𝑆𝑞S_{q}italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, with q>0𝑞0q>0italic_q > 0, depend strictly on voxels from previous shells {Sr|r<q}conditional-setsubscript𝑆𝑟𝑟𝑞\left\{S_{r}\,|\,r<q\right\}{ italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | italic_r < italic_q }. The shells q=1𝑞1q=1italic_q = 1, q=2𝑞2q=2italic_q = 2 and q=9𝑞9q=9italic_q = 9 are shown, with the source voxel (q=0𝑞0q=0italic_q = 0) at the origin of the axes.

As was explained in §3.1, the ray-tracing work for a single source is more challenging to do in parallel due to the inter-voxel dependency of the SC method. However, it is possible to find independent subdomains of the grid and use an approach similar to domain decomposition. This approach performs the following steps in order, which are illustrated in Figure 2:

  1. 1.

    Do the 6 axes outward from the source voxel (A) in parallel

  2. 2.

    Do the 12 planes joining these axes (B) in parallel

  3. 3.

    Do the 8 octants between the planes (C) in xyz𝑥𝑦𝑧x\rightarrow y\rightarrow zitalic_x → italic_y → italic_z order, in parallel,

where the labels A, B and C correspond to the three sketches in Figure 2. C2-Ray uses OpenMP tasks to do the independent domains following this approach, which can yield a speedup of S8less-than-or-similar-to𝑆8S\lesssim 8italic_S ≲ 8.

Refer to caption
Figure 4: Implementation of the ASORA method. Nsrcsubscript𝑁srcN_{\mathrm{src}}italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT sources, labeled by isubscript𝑖\star_{i}⋆ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, are treated in batches of a given size M𝑀Mitalic_M, and one block is dispatched for each source in the batch. Threads within a block are synchronized between each shell Sqsubscript𝑆𝑞S_{q}italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (see Figure 3) but are independent across different sources. Each block atomically contributes to the global ionization rate array. In MPI mode, each rank independently follows the above framework and the ΓΓ\Gammaroman_Γ arrays of all ranks are sum-reduced to the root.

Finally, the ray-tracing procedure itself is optimized using the following technique: rather than ray-tracing the whole grid, the program first treats only a cubic sub-region around the source, namely a "sub-box", and then calculates the total amount of radiation that leaves this sub-box (i.e. a photon loss). If this loss is above a given threshold, the program increases the size of the sub-box, treats the additional voxels and repeats this procedure until the photon loss is low enough. This allows C2-Ray to avoid expensively ray-tracing all voxels when, in fact, almost no radiation reaches the ones far away from the source. The threshold value should be chosen based on convergence studies of the type of problem being simulated. The sub-box technique has been found to work well in EoR settings, where the density field is almost Gaussian. In some more specific situations, where narrow, optically thin tunnels exist in otherwise optically thick regions, the technique might produce inaccurate results. In these cases, using a very small threshold value or, in the worst case, ray-tracing the whole grid may be desirable. Additionally, the user can impose a hard limit on the maximum distance any photon can reach relative to the source.

3.3 GPU Implementation

GPUs are designed to execute numerous concurrent operations, organized into units referred to as blocks in CUDA and workgroups in AMD terminology. Given that ASORA has been implemented using CUDA, we will continue to use CUDA terminology. We are also planning a future port of the library for AMD platforms. Threads can be synchronized within a block, while blocks run asynchronously (Nickolls et al., 2008). It is possible to perform a synchronization between blocks only globally. To fully harness the resources of a GPU, one aims to ensure that the number of threads active at any given time is as close as possible to the theoretical maximum of the used device. While no universal prescription exists to achieve this, it is generally desirable that blocks have a similar workload and their number is in the same order as the number of streaming multiprocessors (SMs) available on the device. This suggests a natural implementation for the ray-tracing problem: dispatch one block for each source and use intra-block synchronization to respect the causality of the short characteristics algorithm.

For this approach to be efficient, however, the work for a single source cannot be simply parallelized following the domain decomposition approach described in § 3.2 as this would allow at most 8 threads to be active within a block. To parallelize the work for a single source in a way more suited for the capabilities of a GPU, we recall that radiation would propagate as a spherical wavefront around a point source in a continuous medium. This translates to a series of shells around a source voxel in the discretized setting. It turns out that there is a particular sequence of disjoint shells Sqsubscript𝑆𝑞S_{q}italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, illustrated in Figure 3, which are causally ordered with respect to the SC scheme used by C2-Ray. q𝑞qitalic_q indexes the "distance" of the shell to the source; the q=0𝑞0q=0italic_q = 0 shell is simply the source voxel itself, and q=1𝑞1q=1italic_q = 1 contains the 6 directly adjacent voxels to the source. The causal ordering can be summarized by the following conditions:

  1. 1.

    The first shell q=0𝑞0q=0italic_q = 0 contains only the source voxel, which can be treated directly without interpolation.

  2. 2.

    For any voxel pSq𝑝subscript𝑆𝑞p\in S_{q}italic_p ∈ italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT with q>0𝑞0q>0italic_q > 0, all 4 interpolation neighbors appearing in Equation 11 belong strictly to shells Srsubscript𝑆𝑟S_{r}italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT with r<q𝑟𝑞r<qitalic_r < italic_q, in other words, only to shells "below" the current one.

  3. 3.

    In particular, all voxels pSq𝑝subscript𝑆𝑞p\in S_{q}italic_p ∈ italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT are independent of one another with respect to the interpolation scheme.

This means that the full ray-tracing work for a single source can be divided into the sequence of tasks {Sq}q=0Qsuperscriptsubscriptsubscript𝑆𝑞𝑞0𝑄\{S_{q}\}_{q=0}^{Q}{ italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_q = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, where Q𝑄Qitalic_Q is the size of the largest shell. These Q𝑄Qitalic_Q tasks must be done sequentially by definition, but each task comprises subtasks (one subtask for each voxel in the shell) that can be performed in parallel. Note that the number of voxels inside a shell Sqsubscript𝑆𝑞S_{q}italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, and hence the number of independent subtasks per task, is nq=4q2+2subscript𝑛𝑞4superscript𝑞22n_{q}=4q^{2}+2italic_n start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = 4 italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2. Going back to the discussion above, when, for instance, 100 threads are assigned per source for each task with q5𝑞5q\geq 5italic_q ≥ 5, it is theoretically possible for all threads to be actively engaged in performing work. This effectively resolves the challenge of parallelizing the computation on a per-source basis.

Rather than giving a maximal shell size Q𝑄Qitalic_Q, it is more convenient to set a maximum physical radius Rγsubscript𝑅𝛾R_{\gamma}italic_R start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT any photon can travel from the source. If the physical size of a grid voxel is, as previously, denoted by α𝛼\alphaitalic_α, the (dimensionless) size index of the largest shell required to cover the chosen radius fully is given by Q=Rγα3𝑄subscript𝑅𝛾𝛼3Q=\lceil\frac{R_{\gamma}}{\alpha\sqrt{3}}\rceilitalic_Q = ⌈ divide start_ARG italic_R start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT end_ARG start_ARG italic_α square-root start_ARG 3 end_ARG end_ARG ⌉. Any cell inside SQsubscript𝑆𝑄S_{Q}italic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT whose distance to the source exceeds Rγsubscript𝑅𝛾R_{\gamma}italic_R start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT can simply be excluded from the computation to yield a spherical region in which ΓΓ\Gammaroman_Γ is nonzero.

The full implementation, illustrated in Figure 4, goes as follows. We dispatch one CUDA thread block for each source that works through the sequence of shells. The result, i.e. the photo-ionization rate produced by that source in each voxel, is atomically added to the global rate array ΓΓ\Gammaroman_Γ. In practice, there is a small additional caveat to consider, namely that by the nature of the algorithm, each source requires a temporary memory space to store the values of the previously interpolated voxels needed for the next interpolation. The required space can typically be a good fraction of the whole grid, so the number of blocks that can be dispatched together is limited by GPU memory. In fact, rather than directly dispatching one block for each of the Nsrcsubscript𝑁srcN_{\mathrm{src}}italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT sources in the simulation, we group the sources into batches of size M𝑀Mitalic_M, and work on these batches one after the other. M𝑀Mitalic_M is determined by available GPU memory and grid size N𝑁Nitalic_N. As long as M𝑀Mitalic_M is large enough to saturate the GPU, this approach should not result in a significant performance loss compared to the ideal scenario of immediately deploying one block per source, without any batching, since the workload for each source is the same.

Finally, ASORA is also MPI-enabled, using mpi4py (Dalcin and Fang, 2021) in the same way as it was intended for the original C2-Ray. Namely, the sources are evenly distributed to multiple MPI processes. Each MPI rank maps to one GPU, which then uses the model laid out above to process its subset of sources and broadcasts the result ΓΓ\Gammaroman_Γ to the root rank, using MPI_REDUCE with a sum operation. This allows using ASORA on a multi-GPU setup across multiple nodes to further speed up ray-tracing on very heavy workloads.

To conclude this section, we note that the ASORA method, as presented here, only applies to uniform grids, as the octahedral shell approach builds on this assumption. We acknowledge that this is a strong limitation of our method, and we plan to explore its adaptability to non-uniform grids. Further technical details on ASORA can be found in 1.

3.4 Python-wrapping of C2-Ray

Refer to caption
Figure 5: Structure of the pyC2Ray code. The main python package, pyc2ray, sets up the simulation and acts as the front end to the user. Internally, the time-evolution method of this package executes functions from two compiled extension modules. One is ASORA, the new GPU ray-tracing module written in CUDA C++, while the other contains a set of wrapped Fortran subroutines taken and adapted from the original C2-Ray code.

Here, we provide a brief overview of the pyC2Ray interface and architecture, summarized visually in Figure 5. The package amalgamates key components from the original Fortran90 code, the new ray-tracing library as discussed above, and elements of pure Python. This integration is facilitated through f2py, a tool developed as part of the NumPy project (Harris et al., 2020). This tool streamlines the creation of extension modules from Fortran90 source files.

The incorporated Fortran90 subroutines primarily encompass the chemistry solver and retain the original CPU-based ray-tracing module as a contingency. The novel ASORA method is written in C++/CUDA and compiled as a Python extension module natively compatible with NumPy. The principal time-evolution function within pyC2Ray is implemented in Python, and it invokes the ray-tracing method, choosing between the CPU and GPU versions and the chemistry method sourced from these extension modules. The prior process of precalculating photoionization rate tables, as introduced earlier, has transitioned to direct implementation in Python. This is achieved using numerical integration techniques from the SciPy library (Virtanen et al., 2020), which relies on the underlying QUADPACK library for lower-level computations. It is worth noting that these integration methods differ from the custom Romberg integration subroutines utilized by the original C2-Ray framework. The commonly needed cosmological equations and physical quantities are now provided by Astropy (Astropy Collaboration, 2022).

Beyond these technical aspects, the inherent method within pyC2Ray —apart from the ray-tracing component— has undergone minimal alteration. Key features of C2-Ray, including photoionization and hydrogen chemistry, have been seamlessly migrated to the Python version without compromising computational efficiency. Our strategy involves a gradual integration of additional extensions over time.

4 Validation Testing & Benchmarking

In § 4.1, we validate our new code using a series of well-established tests, comparing our results to analytical solutions and to the results of our original C2-Ray code. In § 4.2, we investigate how the updated ray-tracing method scales relative to the main problem parameters. In all tests, the temperature conditions of the gas are assumed to be isothermal, i.e., no heating effects are modeled.

4.1 Accuracy Tests

We begin by conducting Tests 1 and 4 from M06, labeled as Test 1 and 2 here, to evaluate the precision of our code in monitoring I-fronts in single-source mode. This evaluation encompasses scenarios both with and without cosmological background expansion. Following this, we investigate the interplay among multiple sources and the occurrence of shadow formation behind an opaque object, Test 3 and Test 4.

Refer to caption
Figure 6: Result for Test 1 (Single-source H II region expansion in uniform gas). The test is conducted with a "coarse" time step Δtc=tevo/10Δsubscript𝑡𝑐subscript𝑡𝑒𝑣𝑜10\Delta t_{c}=t_{evo}/10roman_Δ italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_e italic_v italic_o end_POSTSUBSCRIPT / 10 and a "fine" one, Δtf=tevo/100Δsubscript𝑡𝑓subscript𝑡𝑒𝑣𝑜100\Delta t_{f}=t_{evo}/100roman_Δ italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_e italic_v italic_o end_POSTSUBSCRIPT / 100. The time evolution of the ionization front radius (middle) and velocity (bottom) are shown. The error between the numerical and analytical results can be seen in the top panel.

4.1.1 Test 1: Single-Source HII Region Expansion

Refer to caption
Figure 7: Result for Test 2 (Single-source HII region expansion in cosmological expanding background). Notation is the same as in Figure 6. The source turns on at zi=9subscript𝑧𝑖9z_{i}=9italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 9, and the I-front radius is given in comoving kpc, with the scale factor a(ti)=1𝑎subscript𝑡𝑖1a(t_{i})=1italic_a ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1, normalized to the instantaneous Strömgren radius at zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, rS,isubscript𝑟𝑆𝑖r_{S,i}italic_r start_POSTSUBSCRIPT italic_S , italic_i end_POSTSUBSCRIPT

. The green dotted line shows the analytical result without cosmological expansion for reference.

Consider the classical scenario of a single ionizing source within an initially-neutral, uniformly dense field at a constant temperature. In this case, any cosmological effects are disregarded. Assuming the photoionization cross section remains frequency-independent, σν=σ0subscript𝜎𝜈subscript𝜎0\sigma_{\nu}=\sigma_{0}italic_σ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, known as grey opacity, this system has a well-established analytical solution for the velocity and radius of the ensuing ionization front with respect to time. The solution is given by

rI(t)subscript𝑟I𝑡\displaystyle r_{\mathrm{I}}(t)italic_r start_POSTSUBSCRIPT roman_I end_POSTSUBSCRIPT ( italic_t ) =rS[1exp(t/trec)]1/3absentsubscript𝑟Ssuperscriptdelimited-[]1𝑡subscript𝑡rec13\displaystyle=r_{\mathrm{S}}\left[1-\exp(-t/t_{\rm rec})\right]^{1/3}= italic_r start_POSTSUBSCRIPT roman_S end_POSTSUBSCRIPT [ 1 - roman_exp ( - italic_t / italic_t start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT (13)
vI(t)subscript𝑣I𝑡\displaystyle v_{\mathrm{I}}(t)italic_v start_POSTSUBSCRIPT roman_I end_POSTSUBSCRIPT ( italic_t ) =rS3trecexp(t/trec)[1exp(t/trec)]2/3.absentsubscript𝑟S3subscript𝑡rec𝑡subscript𝑡recsuperscriptdelimited-[]1𝑡subscript𝑡rec23\displaystyle=\frac{r_{\mathrm{S}}}{3\,t_{\rm rec}}\frac{\exp(-t/t_{\rm rec})}% {\left[1-\exp(-t/t_{\rm rec})\right]^{2/3}}\ .= divide start_ARG italic_r start_POSTSUBSCRIPT roman_S end_POSTSUBSCRIPT end_ARG start_ARG 3 italic_t start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT end_ARG divide start_ARG roman_exp ( - italic_t / italic_t start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT ) end_ARG start_ARG [ 1 - roman_exp ( - italic_t / italic_t start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 / 3 end_POSTSUPERSCRIPT end_ARG . (14)

The above expressions depend on the Strömgren sphere radius rSsubscript𝑟Sr_{\mathrm{S}}italic_r start_POSTSUBSCRIPT roman_S end_POSTSUBSCRIPT, recombination time trecsubscript𝑡rect_{\mathrm{rec}}italic_t start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT and luminosity emitted by the source (or the number of photons per unit time). These quantities are defined as,

rS=subscript𝑟Sabsent\displaystyle r_{\mathrm{S}}=italic_r start_POSTSUBSCRIPT roman_S end_POSTSUBSCRIPT = (3N˙γ4παH(T)nH2)1/3,superscript3subscript˙𝑁𝛾4𝜋subscript𝛼H𝑇superscriptsubscript𝑛H213\displaystyle\left(\frac{3\,\dot{N}_{\gamma}}{4\,\pi\,\alpha_{\mathrm{H}}(T)\,% n_{\mathrm{H}}^{2}}\right)^{1/3}\ ,( divide start_ARG 3 over˙ start_ARG italic_N end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT end_ARG start_ARG 4 italic_π italic_α start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT ( italic_T ) italic_n start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT , (15)
trec=subscript𝑡recabsent\displaystyle t_{\rm rec}=italic_t start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT = 1αH(T)nH,1subscript𝛼H𝑇subscript𝑛H\displaystyle\frac{1}{\alpha_{\mathrm{H}}(T)\,n_{\rm H}}\ ,divide start_ARG 1 end_ARG start_ARG italic_α start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT ( italic_T ) italic_n start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT end_ARG , (16)
N˙γ=subscript˙𝑁𝛾absent\displaystyle\dot{N}_{\gamma}=over˙ start_ARG italic_N end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = νthLνhν𝑑ν.superscriptsubscriptsubscript𝜈thsubscript𝐿𝜈𝜈differential-d𝜈\displaystyle\int_{\nu_{\rm th}}^{\infty}\frac{L_{\nu}}{h\nu}d\nu\ .∫ start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG italic_L start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT end_ARG start_ARG italic_h italic_ν end_ARG italic_d italic_ν . (17)
Refer to caption
Figure 8: Result for Test 3 (Expansion of Overlapping H II regions around Multiple Black-Body Sources). The top and middle rows show slices through the simulation domain at the z𝑧zitalic_z-coordinate of the 5 sources, for C2-Ray and pyC2Ray respectively. The leftmost column corresponds to the case with grey opacity, and the remaining 3 columns to those where black body spectra with different temperatures Tbbsubscript𝑇𝑏𝑏T_{bb}italic_T start_POSTSUBSCRIPT italic_b italic_b end_POSTSUBSCRIPT were used. Colors are normalized across each row. The bottom row shows the distribution of relative per-voxel errors between the 2 codes for the whole 3D grid in all 4 cases.

Here, Lνsubscript𝐿𝜈L_{\nu}italic_L start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT is again the specific luminosity of the source (power per unit frequency), which is related to the luminosity N˙γsubscript˙𝑁𝛾\dot{N}_{\gamma}over˙ start_ARG italic_N end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT (number of ionizing photons per unit time) through Equation 17. We conduct our first test using the following numerical parameters: the luminosity of the source is N˙γ=1048subscript˙𝑁𝛾superscript1048\dot{N}_{\gamma}=10^{48}over˙ start_ARG italic_N end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT 48 end_POSTSUPERSCRIPT s-1, the number density of hydrogen nH=103subscript𝑛Hsuperscript103n_{\mathrm{H}}=10^{-3}italic_n start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT cm-3, its temperature T=104𝑇superscript104T=10^{4}italic_T = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT K and the simulation box size is 10kpc10kpc10\,{\rm kpc}10 roman_kpc. As stated above, we use the case B recombination coefficient for Hydrogen, αH(T=104K)=2.59×1013cm3s1subscript𝛼H𝑇superscript104K2.59superscript1013superscriptcm3superscripts1\alpha_{\mathrm{H}}(T=10^{4}\,{\rm K})=2.59\times 10^{-13}\,\rm cm^{3}\,s^{-1}italic_α start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT ( italic_T = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT roman_K ) = 2.59 × 10 start_POSTSUPERSCRIPT - 13 end_POSTSUPERSCRIPT roman_cm start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT roman_s start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Using these parameters, the recombination time is trec122.35similar-to-or-equalssubscript𝑡rec122.35t_{\mathrm{rec}}\simeq 122.35italic_t start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT ≃ 122.35 Myr and the Strömgren radius is rS=3.15kpcsubscript𝑟𝑆3.15kpcr_{S}=3.15\,{\rm kpc}italic_r start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = 3.15 roman_kpc. The simulation is run with mesh size 2563superscript2563256^{3}256 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT for tevo=500Myr4trecsubscript𝑡evo500Myr4subscript𝑡rect_{\rm evo}=500\,\mathrm{Myr}\approx 4t_{\mathrm{rec}}italic_t start_POSTSUBSCRIPT roman_evo end_POSTSUBSCRIPT = 500 roman_Myr ≈ 4 italic_t start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT, following the prescription of Test 1 in Iliev et al. (2006). As in M06, the simulation is repeated once with a coarse time step Δt=50MyrsubscriptΔ𝑡50Myr\Delta_{t}=50\,\rm Myrroman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 50 roman_Myr and once with a fine one, Δt=5MyrΔ𝑡5Myr\Delta t=5\,\rm Myrroman_Δ italic_t = 5 roman_Myr. We track the position of the I-front along the x𝑥xitalic_x-axis and define rI(tk)subscript𝑟Isubscript𝑡𝑘r_{\mathrm{I}}(t_{k})italic_r start_POSTSUBSCRIPT roman_I end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) as the radius where xHI=0.5subscript𝑥HI0.5x_{\rm HI}=0.5italic_x start_POSTSUBSCRIPT roman_HI end_POSTSUBSCRIPT = 0.5. The precise location within a voxel is found by linear interpolation. The numerical I-front velocity, vIsubscript𝑣Iv_{\mathrm{I}}italic_v start_POSTSUBSCRIPT roman_I end_POSTSUBSCRIPT, is found by finite-differencing rIsubscript𝑟Ir_{\mathrm{I}}italic_r start_POSTSUBSCRIPT roman_I end_POSTSUBSCRIPT, using the same approach as in M06.

The results are shown in Figure 6, where the three panels contain the time evolution of the ratio between numerical to analytical results (top), the I-front radius (middle) and its velocity (bottom). At times ttrecless-than-or-similar-to𝑡subscript𝑡rect\lesssim t_{\mathrm{rec}}italic_t ≲ italic_t start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT, pyC2Ray is in excellent agreement with the analytical prediction, both with a coarse and a fine time step choice. At ttrecgreater-than-or-equivalent-to𝑡subscript𝑡rect\gtrsim t_{\mathrm{rec}}italic_t ≳ italic_t start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT, the numerical I-front overestimates the analytical prediction by as much as 6%. This is consistent with the findings of, e.g., Iliev et al. (2006), where all tested codes predict such an overestimate. Pawlik and Schaye (2008) have demonstrated that this is because, in reality, the ionized fraction varies smoothly within the ionized bubble, whereas the Strömgren argument assumes a sharp transition from fully ionized to fully neutral.

4.1.2 Test 2: Single-Source HII Region in expanding background

We next test if pyC2Ray correctly models the propagation of I-fronts in an expanding universe. Test 2 uses the same source parameters as Test 1, with the source turning on at z=9𝑧9z=9italic_z = 9 and then shining for 500 Myr, while the background density starts with the same value as before and evolves with the expansion of the universe. Shapiro and Giroux (1987) showed that a generalized analytical solution exists in this case. The comoving I-front radius is given by rI(t)=rS,iy(t)1/3subscript𝑟I𝑡subscript𝑟S𝑖𝑦superscript𝑡13r_{\mathrm{I}}(t)=r_{\mathrm{S},i}\,y(t)^{1/3}italic_r start_POSTSUBSCRIPT roman_I end_POSTSUBSCRIPT ( italic_t ) = italic_r start_POSTSUBSCRIPT roman_S , italic_i end_POSTSUBSCRIPT italic_y ( italic_t ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT, where rS,i=(3N˙γ/4παH(T)nH,i2)1/3subscript𝑟S𝑖superscript3subscript˙𝑁𝛾4𝜋subscript𝛼H𝑇superscriptsubscript𝑛H𝑖213r_{\mathrm{S},i}=(3\dot{N}_{\gamma}/4\pi\alpha_{\mathrm{H}}(T)n_{\mathrm{H},i}% ^{2})^{1/3}italic_r start_POSTSUBSCRIPT roman_S , italic_i end_POSTSUBSCRIPT = ( 3 over˙ start_ARG italic_N end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT / 4 italic_π italic_α start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT ( italic_T ) italic_n start_POSTSUBSCRIPT roman_H , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 3 end_POSTSUPERSCRIPT is the instantaneous Strömgren radius at the ignition time, tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, of the source (with the scale factor set to unity at tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ai=1subscript𝑎𝑖1a_{i}=1italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1), and

y(t)=λeλti/t[ttiE2(λti/t)E2(λ)],𝑦𝑡𝜆superscript𝑒𝜆subscript𝑡𝑖𝑡delimited-[]𝑡subscript𝑡𝑖subscript𝐸2𝜆subscript𝑡𝑖𝑡subscript𝐸2𝜆y(t)=\lambda e^{\lambda t_{i}/t}\left[\frac{t}{t_{i}}E_{2}(\lambda t_{i}/t)-E_% {2}(\lambda)\right],italic_y ( italic_t ) = italic_λ italic_e start_POSTSUPERSCRIPT italic_λ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_t end_POSTSUPERSCRIPT [ divide start_ARG italic_t end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_t ) - italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_λ ) ] , (18)

where E2(x)=1t2ext𝑑tsubscript𝐸2𝑥superscriptsubscript1superscript𝑡2superscript𝑒𝑥𝑡differential-d𝑡E_{2}(x)=\int_{1}^{\infty}t^{-2}e^{-xt}dtitalic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) = ∫ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_x italic_t end_POSTSUPERSCRIPT italic_d italic_t is the second-order exponential integral. λ=ti/trec,i𝜆subscript𝑡𝑖subscript𝑡rec𝑖\lambda=t_{i}/t_{\mathrm{rec},i}italic_λ = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_t start_POSTSUBSCRIPT roman_rec , italic_i end_POSTSUBSCRIPT is the ratio of the age of the universe at source ignition to the recombination time at that age. We set up the test with nH,i=1.87×104cm3subscript𝑛𝐻𝑖1.87superscript104superscriptcm3n_{H,i}=1.87\times 10^{-4}\,\mathrm{cm}^{-3}italic_n start_POSTSUBSCRIPT italic_H , italic_i end_POSTSUBSCRIPT = 1.87 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT roman_cm start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and Li=7×1024subscript𝐿𝑖7superscript1024L_{i}=7\times 10^{24}~{}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 7 × 10 start_POSTSUPERSCRIPT 24 end_POSTSUPERSCRIPTcm and using otherwise the same parameters as before. The result is shown in Figure 7, where rI(t)subscript𝑟𝐼𝑡r_{I}(t)italic_r start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_t ) represents the comoving I-front radius, keeping in mind that a(ti)=1𝑎subscript𝑡𝑖1a(t_{i})=1italic_a ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1. For this test, we used the same cosmology as in M06, namely h=0.70.7h=0.7italic_h = 0.7, ΩM=0.27subscriptΩ𝑀0.27\Omega_{M}=0.27roman_Ω start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = 0.27 and Ωb=0.043subscriptΩ𝑏0.043\Omega_{b}=0.043roman_Ω start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 0.043.

pyC2Ray again shows excellent agreement with the analytical result. While the effect of cosmic expansion is not evident at first sight, the analytical prediction without cosmology, Equation 13, is also plotted for reference in the figure (green dotted line), and the difference is clearly visible. Again, results are almost as accurate when using a coarse time step.

Refer to caption
Figure 9: Result for Test 4 (I-Front Trapping in a Dense Clump and Formation of a Shadow). Shown are slices through the z𝑧zitalic_z-plane containing one ionizing source at the center and a dense clump of hydrogen diagonally offset from the source. The top row shows the ionized hydrogen fraction for C2-Ray (left) and pyC2Ray (middle), as well as the relative error between the two (right). The bottom row shows the same comparison for the photoionization rate.

4.1.3 Test 3: Expansion of Overlapping HII regions around Multiple Black-Body Sources

Now we turn to the more realistic case of non-grey opacity and parameterize the cross section as σν=σ0(ν/ν0)αsubscript𝜎𝜈subscript𝜎0superscript𝜈subscript𝜈0𝛼\sigma_{\nu}=\sigma_{0}(\nu/\nu_{0})^{-\alpha}italic_σ start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ν / italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT, where ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the ionization threshold frequency. The parameters of the power law are as in M06, σ0=6.3×1018subscript𝜎06.3superscript1018\sigma_{0}=6.3\times 10^{18}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 6.3 × 10 start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT cm-2 and α=2.8𝛼2.8\alpha=2.8italic_α = 2.8. We test how the ionization front is affected by the spectral characteristics of the sources. For harder spectra, where the energy peak is well above the ionization threshold, we expect wider ionization fronts, as the hard photons can penetrate deeper into the medium (Spitzer, 1998). To test this and at the same time visualize how different HII regions overlap, we place 5 black-body sources, each with total ionizing flux N˙γ=5×1048subscript˙𝑁𝛾5superscript1048\dot{N}_{\gamma}=5\times 10^{48}over˙ start_ARG italic_N end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = 5 × 10 start_POSTSUPERSCRIPT 48 end_POSTSUPERSCRIPT but with different temperatures Tbbsubscript𝑇𝑏𝑏T_{bb}italic_T start_POSTSUBSCRIPT italic_b italic_b end_POSTSUBSCRIPT, in a dice-like pattern on the same z𝑧zitalic_z-plane. The box size is L=14𝐿14L=14italic_L = 14 kpc, the mesh 1283superscript1283128^{3}128 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and the constant hydrogen density is nH=103subscript𝑛𝐻superscript103n_{H}=10^{-3}italic_n start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT cm-3. We simulate for tevo=10Myrsubscript𝑡𝑒𝑣𝑜10Myrt_{evo}=10\,\rm Myritalic_t start_POSTSUBSCRIPT italic_e italic_v italic_o end_POSTSUBSCRIPT = 10 roman_Myr, with time step Δt=1MyrΔ𝑡1Myr\Delta t=1\,\rm Myrroman_Δ italic_t = 1 roman_Myr. Figure 8 shows cuts through the source plane of the final ionized hydrogen fraction xHIIsubscript𝑥HIIx_{\mathrm{HII}}italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT, for pyC2Ray (top) and C2-Ray(middle), along with the distribution of the absolute relative error, |(xHIIpyC2RayxHIIC2-Ray)/xHIIC2-Ray|superscriptsubscript𝑥HIIpyC2Raysuperscriptsubscript𝑥HIIC2-Raysuperscriptsubscript𝑥HIIC2-Ray\left|(x_{\mathrm{HII}}^{\text{{pyC${}^{2}$Ray}{}}}-x_{\mathrm{HII}}^{\text{{C% ${}^{2}$-Ray}{}}})/x_{\mathrm{HII}}^{\text{{C${}^{2}$-Ray}{}}}\right|| ( italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT start_POSTSUPERSCRIPT pyC Ray end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT start_POSTSUPERSCRIPT C -Ray end_POSTSUPERSCRIPT ) / italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT start_POSTSUPERSCRIPT C -Ray end_POSTSUPERSCRIPT |, between the two coeval cubes (bottom panels). The leftmost column is the grey-opacity case as in the two previous tests, while the three remaining columns contain the results for Tbb={5×103,5×104,1×105}Ksubscript𝑇𝑏𝑏5superscript1035superscript1041superscript105KT_{bb}=\{5\times 10^{3},5\times 10^{4},1\times 10^{5}\}\,\rm Kitalic_T start_POSTSUBSCRIPT italic_b italic_b end_POSTSUBSCRIPT = { 5 × 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , 5 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , 1 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT } roman_K.

Qualitatively, both C2-Ray and pyC2Ray reproduce the expected softness of ionization fronts for hot spectra, and the overlap of individual H II regions is also correctly modeled. The largest value for the relative error is on the order 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT in all cases, while the mean increases for harder spectra. Although relatively small, this error requires an explanation, as both codes should, in principle, produce equal results in the absence of unit conversion or floating point errors. In fact, an important technical difference between the two is the choice of numerical integration method used to pre-compute Equation 7 as described in §2.2. pyC2Ray uses the standard quad wrapper of the SciPy package (Virtanen et al., 2020), which uses the adaptive quadrature method from the QUADPACK Fortran library. On the other hand, C2-Ray uses a custom-written Romberg integration scheme. Both methods are valid choices, but they will inevitably yield slightly different results depending on the chosen resolution. We tested this by varying the frequency bins used by the Romberg method in C2-Ray and found that the relative error between the two codes drops significantly as this number increases. We thus conclude that this technical difference is the most likely explanation for this result.

Refer to caption
Refer to caption
Figure 10: Scaling of the ASORA ray-tracing library. Left: Computation time per source per voxel for an increasing number of sources Nsrcsubscript𝑁srcN_{\mathrm{src}}italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT and different ray-tracing radii R𝑅Ritalic_R. This time approaches a constant value as more sources are added and faster for larger R𝑅Ritalic_R. Right: Speedup in terms of the number of blocks M𝑀Mitalic_M, given by t1/tMsubscript𝑡1subscript𝑡𝑀t_{1}/t_{M}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_t start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, where t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the timing when a single block is used. The vertical black line marks M=56𝑀56M=56italic_M = 56, corresponding to the number of SMs on the NVIDIA® P100 GPU used in this benchmark.

4.1.4 Test 4: I-Front Trapping in a Dense Clump and Formation of a Shadow

Finally, to probe more specifically the ray-tracing method, we test for the formation of a shadow behind an overdense region. Correct modeling of shadows is one of the key advantages of ray-tracing over other techniques, making this an important check. In this test, the box size is L=14𝐿14L=14italic_L = 14 kpc with mesh 1283superscript1283128^{3}128 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and a source with total ionizing flux N˙γ=1049subscript˙𝑁𝛾superscript1049\dot{N}_{\gamma}=10^{49}over˙ start_ARG italic_N end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT 49 end_POSTSUPERSCRIPT s-1 is placed at its center. The hydrogen has a mean density n¯H=103subscript¯𝑛Hsuperscript103\bar{n}_{\rm H}=10^{-3}over¯ start_ARG italic_n end_ARG start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT cm-3, and a spherical overdense region of radius r=8.75𝑟8.75r=8.75italic_r = 8.75 pc is placed on the same z𝑧zitalic_z-plane as the source, at a distance d=2.01𝑑2.01d=2.01italic_d = 2.01 kpc diagonally from it. Within this region the density is nH=6n¯Hsuperscriptsubscript𝑛H6subscript¯𝑛Hn_{\mathrm{H}}^{\star}=6\,\bar{n}_{\rm H}italic_n start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = 6 over¯ start_ARG italic_n end_ARG start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT. The source has a black body temperature Tbb=5×104subscript𝑇𝑏𝑏5superscript104T_{bb}=5\times 10^{4}italic_T start_POSTSUBSCRIPT italic_b italic_b end_POSTSUBSCRIPT = 5 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT K, and tevosubscript𝑡evot_{\rm evo}italic_t start_POSTSUBSCRIPT roman_evo end_POSTSUBSCRIPT and ΔtsubscriptΔ𝑡\Delta_{t}roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are as in Test 3. The result is visualized in Figure 9, where a cut through the source plane of the final ionized hydrogen fraction xHIIsubscript𝑥HIIx_{\mathrm{HII}}italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT is shown on top and the photoionization rate ΓΓ\Gammaroman_Γ below, for both codes along with the relative error as before. We want to point out that the fuzziness of the shadow is a feature of the short characteristic ray-tracing. The relative error is small again, and we believe it to be due to the choice of integration method used in the previous test. Interestingly, this error is larger by an order of magnitude at the edge of the overdense region. This is not so surprising, given that the overdensity is very optically thick and thus contains a large density gradient at its boundary. We noticed that the relative error is negative closer to the source, then positive, and then close to 0, reflecting the net photon flux conservation.

Refer to caption
Refer to caption
Figure 11: Results from of the 349cMpc349cMpc349\,\rm cMpc349 roman_cMpc EoR test simulation. The left and middle columns show slices through the simulation domain for C2-Ray and pyC2Ray, respectively. The right column shows the distribution relative per-voxel error for the 2503 grid. The simulation includes only dark matter halos masses with an efficiency factor fγ=30subscript𝑓𝛾30f_{\gamma}=30italic_f start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = 30 and a maximal comoving photon radius R=15cMpc𝑅15cMpcR=15\,\rm cMpcitalic_R = 15 roman_cMpc.
Refer to caption
Refer to caption
Figure 12: Left: Comparison of reionization history from the 349 Mpc EoR test simulation, performed with C2-Ray and pyC2Ray. The top panel shows the evolution of the volume xHIIvdelimited-⟨⟩subscriptsuperscript𝑥vHII\langle x^{\mathrm{v}}_{\mathrm{HII}}\rangle⟨ italic_x start_POSTSUPERSCRIPT roman_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT ⟩ and mass-averaged xHIImdelimited-⟨⟩subscriptsuperscript𝑥mHII\langle x^{\mathrm{m}}_{\mathrm{HII}}\rangle⟨ italic_x start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT ⟩ fraction of ionized hydrogen over the redshift range z[21.06,8.636]𝑧21.068.636z\in[21.06,8.636]italic_z ∈ [ 21.06 , 8.636 ] and the bottom panel the relative error between the two codes. Right: Comparison of the 21-cm power spectra from the same simulation at different redshifts (indicated in the legend) reveals a consistent match between C2-Ray and pyC2Ray.

4.2 Performance Benchmark

We now examine the performance of the new ray-tracing library more closely. All benchmarks in this section are performed on a size N=250𝑁250N=250italic_N = 250 grid and run on one node of the Piz Daint222https://www.cscs.ch/computers/piz-daint/ computer at CSCS, containing in particular a single NVIDIA® Tesla P100 GPU. First, we determine how the ray-tracing performance scales as more sources are added or the radius of ray-tracing per source increases. We expect the code to scale linearly with the number of sources Nsrcsubscript𝑁srcN_{\mathrm{src}}italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT and as 𝒪(R3)𝒪superscript𝑅3\mathcal{O}(R^{3})caligraphic_O ( italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) with the ray-tracing radius, R=NmeshRmax/LB𝑅subscript𝑁𝑚𝑒𝑠subscript𝑅maxsubscript𝐿BR=N_{mesh}\cdot R_{\rm max}/L_{\rm B}italic_R = italic_N start_POSTSUBSCRIPT italic_m italic_e italic_s italic_h end_POSTSUBSCRIPT ⋅ italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / italic_L start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT, where Rmaxsubscript𝑅maxR_{\rm max}italic_R start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is the maximum radius for ray-tracing and LBsubscript𝐿BL_{\rm B}italic_L start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT the box size, both in cMpccMpc\rm cMpcroman_cMpc units. The benchmark is set up as follows. For R=[10,30,50,100]𝑅103050100R=[10,30,50,100]italic_R = [ 10 , 30 , 50 , 100 ], the ray-tracing routine is called (on its own, without solving the chemistry afterward) on Nsrc=10a,a=0,,6formulae-sequencesubscript𝑁srcsuperscript10𝑎𝑎06N_{\mathrm{src}}=10^{a},\ a=0,\dots,6italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_a = 0 , … , 6 sources, and its run time is averaged over 10 executions. The left panel of Figure 10 shows the computation time per source per voxel,

Δt(Nsrc,R)=t(Nsrc,R)43πR3Nsrc,Δ𝑡subscript𝑁src𝑅𝑡subscript𝑁src𝑅43𝜋superscript𝑅3subscript𝑁src\Delta t(N_{\mathrm{src}},R)=\frac{t(N_{\mathrm{src}},R)}{\frac{4}{3}\pi R^{3}% N_{\mathrm{src}}},roman_Δ italic_t ( italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT , italic_R ) = divide start_ARG italic_t ( italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT , italic_R ) end_ARG start_ARG divide start_ARG 4 end_ARG start_ARG 3 end_ARG italic_π italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT end_ARG , (19)

where t(Nsrc,R)𝑡subscript𝑁src𝑅t(N_{\mathrm{src}},R)italic_t ( italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT , italic_R ) is the run time of the function running on Nsrcsubscript𝑁srcN_{\mathrm{src}}italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT sources and computing ΓΓ\Gammaroman_Γ in a spherical volume of radius R𝑅Ritalic_R (in voxel units) for each of them. With increasing Nsrcsubscript𝑁srcN_{\mathrm{src}}italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT, Δt(Nsrc,R)Δ𝑡subscript𝑁src𝑅\Delta t(N_{\mathrm{src}},R)roman_Δ italic_t ( italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT , italic_R ) approaches a constant value of about 3.156ns3.156ns3.156\,\rm ns3.156 roman_ns on our system. Furthermore, this convergence is faster when the radius R𝑅Ritalic_R is larger. This implies that when few sources are present, overheads represent a non-negligible fraction of the execution time, even more so when the work per source (determined by R𝑅Ritalic_R) is low. However, we can see that above 1000similar-toabsent1000\sim 1000∼ 1000 sources, the execution time is very close to its minimum, even for a relatively small RT radius. With few sources, the total amount of work is low and is not an expensive calculation. But typically, EoR simulations require Nsrc1000much-greater-thansubscript𝑁src1000N_{\mathrm{src}}\gg 1000italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT ≫ 1000. Our code runs in a regime where the work and not overheads dominate the performance of the code.

Next, we test how the code scales as the source batch size M𝑀Mitalic_M increases, corresponding to increasing the number of CUDA blocks dispatched to the device between global synchronizations. The right panel of Figure 10 presents the speedup t1/tMsubscript𝑡1subscript𝑡𝑀t_{1}/t_{M}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_t start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT (where tMsubscript𝑡𝑀t_{M}italic_t start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT is the execution time using M𝑀Mitalic_M blocks) achieved in 3 cases; (R=10,Nsrc=104)formulae-sequence𝑅10subscript𝑁srcsuperscript104(R=10,N_{\mathrm{src}}=10^{4})( italic_R = 10 , italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ), (R=10,Nsrc=105)formulae-sequence𝑅10subscript𝑁srcsuperscript105(R=10,N_{\mathrm{src}}=10^{5})( italic_R = 10 , italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ) and (R=30,Nsrc=104)formulae-sequence𝑅30subscript𝑁srcsuperscript104(R=30,N_{\mathrm{src}}=10^{4})( italic_R = 30 , italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) to see the impact of both the radius and total number of sources. This test is an analog of the "strong scaling" measurement typically performed on CPU cores. We observe that on our system, in all 3 cases, the code scales well up to M32similar-to𝑀32M\sim 32italic_M ∼ 32 and does not gain any performance above M50similar-to𝑀50M\sim 50italic_M ∼ 50, which seems to indicate that the sequential portion of the code prevents further scaling (analogously to Amdahl’s law in CPU computing). This test, however, only gives a picture of the speedup achieved relative to the single-block case for the whole program and hence does not indicate how good the occupancy of the GPU itself is. Detailed profiling using standard NVIDIA software has revealed that the number of registers per thread required by the ray-tracing kernel is likely a limiting factor that prevents the code from ever reaching maximum occupancy in its current state, even on GPUs with higher compute capability than the P100. Overcoming this limitation should be one of the main targets for future performance updates.

Two conclusions arise from this section: (1) The library is most optimized for use cases where many sources are present in the simulation, as is the case in EoR modeling. However, in cases where few sources are present, it will run optimally if the number of raytraced voxels is large. This may be the case when performing high-resolution radiative transfer simulations of smaller volumes, thus expanding the possible usage scenarios for pyC2Ray. (2) A good value for the batch size M𝑀Mitalic_M will depend strongly on the system on which the code is run while simultaneously being limited by the available memory. This is because each block needs a cache space for the ray-tracing, the size of which scales with the grid, i.e., 𝒪(N3)𝒪superscript𝑁3\mathcal{O}(N^{3})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ).

5 Running a Cosmological Reionization Simulation

The ultimate test for the updated code is to see whether it can reproduce the results of a simulation performed with the original C2-Ray while at the same time achieving a gain in performance. Here, we post-process a (349Mpc)3superscript349Mpc3(349\mathrm{Mpc})^{3}( 349 roman_M roman_p roman_c ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT volume N𝑁Nitalic_N-body simulation run with 40003superscript400034000^{3}4000 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT dark matter particles, which models the formation of high-redshift structures. These N𝑁Nitalic_N-body simulations used the code CUBEP3M (Harnois-Déraps et al., 2013)333https://github.com/jharno/cubep3m, which has an on-the-fly halo finder, providing halo catalogs at each redshift snapshot using the spherical overdensity method (see Watson et al., 2013, for more detail). The N𝑁Nitalic_N-body dark matter particles and the halo catalog are then gridded, with an SPH-like smoothing technique, onto a regular grid of size Nmesh=2503subscript𝑁meshsuperscript2503N_{\rm mesh}=250^{3}italic_N start_POSTSUBSCRIPT roman_mesh end_POSTSUBSCRIPT = 250 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT that is later used as inputs for the RT simulation. This simulation resolves dark matter haloes with mass Mhalo109Msubscript𝑀halosuperscript109subscriptMdirect-productM_{\rm halo}\geq 10^{9}{\rm M_{\odot}}italic_M start_POSTSUBSCRIPT roman_halo end_POSTSUBSCRIPT ≥ 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT roman_M start_POSTSUBSCRIPT ⊙ end_POSTSUBSCRIPT. This simulation contains approximately 107superscript10710^{7}10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT sources toward the end of reionization. See Dixon et al. (2016) and Giri et al. (2018) for more detailed descriptions.

We follow the same source model presented in previous work (e.g. Iliev et al., 2014; Bianco et al., 2021) that assumes a linear relation between the emissivity and the mass of the hosting dark matter halo. In this model, the grand total of ionizing photons, N˙γsubscript˙𝑁𝛾\dot{N}_{\gamma}over˙ start_ARG italic_N end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT, produced by a source residing in dark matter halo mass Mhalosubscript𝑀haloM_{\rm halo}italic_M start_POSTSUBSCRIPT roman_halo end_POSTSUBSCRIPT is

N˙γ=fγMhaloΩbΩMmpts,subscript˙𝑁𝛾subscript𝑓𝛾subscript𝑀halosubscriptΩbsubscriptΩMsubscript𝑚𝑝subscript𝑡𝑠\dot{N}_{\gamma}=f_{\gamma}\,\frac{M_{\rm halo}\,\Omega_{\rm b}}{\Omega_{\rm M% }\,m_{p}\,t_{s}},over˙ start_ARG italic_N end_ARG start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT divide start_ARG italic_M start_POSTSUBSCRIPT roman_halo end_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT end_ARG start_ARG roman_Ω start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG , (20)

where the efficiency factor fγ=30subscript𝑓𝛾30f_{\rm\gamma}=30italic_f start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = 30 and the source lifetime ts10Myrsubscript𝑡𝑠10Myrt_{s}\approx 10\,{\rm Myr}italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≈ 10 roman_Myr is taken to be the time difference between the simulation snapshots. Two time steps are performed for each redshift interval. Here, we choose an extreme value for the efficiency factor to speed up the reionization process so we could run C2-Ray in a reasonable amount of time and computational resources. We should note that reionization ends quite early compared to more realistic models in Dixon et al. (2016) and Giri et al. (2019) produced using C2-Ray—however, the outcomes of the comparison hold for any source model.

In Figure 11, we show slices of the simulated ionized fraction, xHIIsubscript𝑥HIIx_{\mathrm{HII}}italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT, comparing C2-Ray (left column) and pyC2Ray (middle column) at redshift z=11.090, 10.110, 9.457𝑧11.09010.1109.457z=11.090,\,10.110,\,9.457italic_z = 11.090 , 10.110 , 9.457 and 8.6368.6368.6368.636, corresponding to a volume-averaged ionized fraction xHII=0.045, 0.180, 0.420delimited-⟨⟩subscript𝑥HII0.0450.1800.420\left<x_{\mathrm{HII}}\right>=0.045,\,0.180,\,0.420⟨ italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT ⟩ = 0.045 , 0.180 , 0.420 and 0.8370.8370.8370.837. We show the relative error in the right column of the same figure for each redshift. At high redshift, the error distribution is mostly centered at 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, similar to what we show in § 4.1.3 and 4.1.4. While from z10similar-to𝑧10z\sim 10italic_z ∼ 10, it shows two peaks with the distribution transitioning from 104.5superscript104.510^{-4.5}10 start_POSTSUPERSCRIPT - 4.5 end_POSTSUPERSCRIPT to 1010superscript101010^{-10}10 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT. The double-peaked feature of the error distribution is visible from the moment the source contribution becomes substantial. This indicates that the error distribution is initially associated with the precision error in the vast neutral field while later with the growing ionized regions. In the left panels of Figure 12, we calculate the volume- and mass-averaged ionized fraction, xHIIvsubscriptdelimited-⟨⟩subscript𝑥HIIv\left<x_{\mathrm{HII}}\right>_{\rm v}⟨ italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT roman_v end_POSTSUBSCRIPT and xHIImsubscriptdelimited-⟨⟩subscript𝑥HIIm\left<x_{\mathrm{HII}}\right>_{\rm m}⟨ italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT, against redshift. With solid lines, we indicate the results obtained with C2-Ray, while in dashed lines, the one with pyC2Ray. Similar to what we show in the previous paragraph, on average, the relative error is at least five orders of magnitude smaller, 105similar-toabsentsuperscript105\sim 10^{-5}∼ 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, compared to the dynamic range of the ionized field, making the difference indiscernible. Notice that we show the result to z=8.575𝑧8.575z=8.575italic_z = 8.575 when the IGM is about 86%percent8686\%86 % ionized. However, at this reionization epoch, the simulation has approximately 1.5×106similar-toabsent1.5superscript106\sim 1.5\times 10^{6}∼ 1.5 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT sources, and C2-Ray starts to become computationally demanding.

Radio experiments, such as HERA, LOFAR, and MWA, aim to observe the spatial distribution 𝒓𝒓\boldsymbol{r}bold_italic_r of the differential brightness temperature δTb(𝒓,z)𝛿subscript𝑇b𝒓𝑧\delta T_{\mathrm{b}}(\boldsymbol{r},z)italic_δ italic_T start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT ( bold_italic_r , italic_z ) corresponding to the 21-cm signal. This quantity can be given as (e.g. Pritchard and Loeb, 2012),

δTb(𝒓,z)27mK𝛿subscript𝑇𝑏𝒓𝑧27mK\displaystyle\delta T_{b}(\boldsymbol{r},z)\approx 27~{}\mathrm{mK}italic_δ italic_T start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( bold_italic_r , italic_z ) ≈ 27 roman_mK (0.15ΩMh21+z10)12(Ωbh20.023)superscript0.15subscriptΩ𝑀superscript21𝑧1012subscriptΩ𝑏superscript20.023\displaystyle\left(\frac{0.15}{\Omega_{M}h^{2}}\frac{1+z}{10}\right)^{\frac{1}% {2}}\left(\frac{\Omega_{b}h^{2}}{0.023}\right)( divide start_ARG 0.15 end_ARG start_ARG roman_Ω start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG 1 + italic_z end_ARG start_ARG 10 end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( divide start_ARG roman_Ω start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 0.023 end_ARG ) (21)
×[1xHII(𝒓,z)][1+δb(𝒓,z)],absentdelimited-[]1subscript𝑥HII𝒓𝑧delimited-[]1subscript𝛿b𝒓𝑧\displaystyle\times[1-x_{\mathrm{HII}}(\boldsymbol{r},z)][1+\delta_{\rm b}(% \boldsymbol{r},z)],× [ 1 - italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT ( bold_italic_r , italic_z ) ] [ 1 + italic_δ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT ( bold_italic_r , italic_z ) ] ,

where xHIIsubscript𝑥HIIx_{\mathrm{HII}}italic_x start_POSTSUBSCRIPT roman_HII end_POSTSUBSCRIPT and δbsubscript𝛿b\delta_{\mathrm{b}}italic_δ start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT are ionization hydrogen fraction and baryon overdensity, respectively. We should note that we have assumed a spin temperature to be saturated and ignore the impact of redshift-space distortion. We refer the interested readers to Ross et al. (2021) for exploration of both these aspects in simulations with C2-Ray. We compute δTb(𝒓,z)𝛿subscript𝑇b𝒓𝑧\delta T_{\mathrm{b}}(\boldsymbol{r},z)italic_δ italic_T start_POSTSUBSCRIPT roman_b end_POSTSUBSCRIPT ( bold_italic_r , italic_z ) and subsequently the power spectrum using reionization simulation snapshots with our data analysis software, Tools21cm444https://github.com/sambit-giri/tools21cm (Giri et al., 2020). In the top-right panel of Figure 12, we present the 21-cm power spectrum at various redshifts. We observe a precise agreement between the results obtained from pyC2Ray and C2-Ray, also evident from the relative error in the bottom-right panel, demonstrating that these upgrades can accurately replicate the spatial distribution of the 21-cm signal.

The simulation with pyC2Ray cost 2.52.52.52.5 GPU-hours on our single-GPU system, while the comparison run, with the Fortran90 CPU version of C2-Ray, computed on 128 cores for a total of 13,8241382413,82413 , 824 core-hours. While GPU hours are, in general, more expensive than core hours, the observed speedup is so large that pyC2Ray is significantly cheaper to run than the original code by a factor of 100similar-toabsent100\sim 100∼ 100, depending on the computing center, which was part of the motivation behind this update. In B, we illustrate further the computational advantage of porting algorithms to GPU.

6 Summary and Conclusions

The main challenge in simulating the cosmic Epoch of Reionization is that we must concurrently simulate a large volume of the order of the GpcGpc\rm Gpcroman_Gpc scale while resolving compact and dense cosmic structures. These requirements make Radiative Transfer (RT) simulations extremely computationally expensive and demanding. For this reason, most RT codes are implemented with programming languages suited for scientific computing, such as Fortran90 or C/C++. However, this makes any changes or regular updates to the code cumbersome for new users, as any slight modification requires frequent recompilation and debugging. Moreover, relatively little effort has been made to make ray-tracing algorithms for reionization simulations computationally efficient and functional on general-purpose graphic process units (GPU).

Therefore, this paper introduces pyC2Ray, a Python wrapped updated version of the extensively used C2-Ray RT code for cosmic reionization simulations. In particular, we present the newly developed Accelerated Short-characteristics Octhaedral RAy-tracing algorithm, ASORA, that utilizes GPU architectures to achieve drastic speedup in fully numerical RT simulations.

In § 2, we recap the differential equation solved during a cosmological reionization simulation. In § 2.1, we summarize the well-established time-averaged method that solves the chemistry equation in C2-Ray, Equation 1, allowing the solution to be integrated on a larger time-step compared to the reionization time scales, otherwise required by a more direct approach. In § 2.2, we explain in detail the necessity for an efficient ray-tracing method for our code. With Equation 6 and 8, we highlight the core and most computationally expensive operation in RT algorithms, which consists of computing the column density and, thus, the optical depth for each voxel, that ultimately quantifies the number of ionizing photons that are absorbed by a cell along the ray. The combination of the time-averaged and short-characteristics methods are the distinguishing features of the C2-Ray code. In Figure 1, we summarize the algorithm for both the C2-Ray and pyC2Ray methods presented here.

In § 3.1, we remind the reader of the short-characteristic approach of C2-Ray inherited by pyC2Ray. In § 3.2, we describe the existing CPU parallelization of the current version of C2-Ray, which consists of splitting the source input list into equal parts for each MPI processor. For each rank, 8 OpenMP threads, corresponding to the number of independent domains around each source, compute the HIHI\rm HIroman_HI column density. This parallelization strategy is not optimal for GPU architectures. Therefore, in § 3.3 we propose a new interpolation approach for the C2-Ray RT algorithm specifically designed for GPUs. The ASORA interpolation scheme comes from the physical intuition that the radiation propagates as an outward wavefront around a source. This new approach changes the domain decomposition to an interpolation between concentric surfaces of an octahedron centered around the source as illustrated by Figure 3. From a technical perspective, in pyC2Ray, we keep the same MPI source distribution, as presented in § 3.2, and instead replace the OpenMP domain decomposition with the ASORA method.

The update also includes the conversion to Python of the non-time-consuming subroutines of C2-Ray. In § 3.4, we mention how the use of commonly used libraries, such as Numpy, Scipy and Astropy can be easily included according to the user’s need. Moreover, the pyC2Ray user interface makes it easier to employ other codes that have also been Python-wrapped. For instance, we can easily incorporate in pyC2Ray photo-ionization rates from other spectral energy distributions calculated with a population synthesis code such as PEGASE-2 (Fioc et al., 2011) or a different chemistry solver such GRACKLE (Smith et al., 2017).

In § 4, we show pyC2Ray results on a series of standard RT tests. In § 4.1.1 and 4.1.2, we demonstrate that pyC2Ray agrees with the analytical solutions of the ionization front size, rIsubscript𝑟Ir_{\rm I}italic_r start_POSTSUBSCRIPT roman_I end_POSTSUBSCRIPT, for the single sources in a static and expanding lattice. To test that the conversion to Pyhton of the non-time-critical subroutines was successful and does not introduce substantial differences, in § 4.1.3, we test the results on overlapping HIIHII\rm HIIroman_HII regions for sources with different black body spectra. In § 4.1.4, we probe the formation of a shadow behind an overdense region, a standard test for ray tracing methods.

In § 4.2, we examine the performance of the new ray-tracing methods accomplished on the Piz Daint cluster at the Swiss National Supercomputing Centre (CSCS) equipped with an NVIDIA® Tesla P100 GPU. Our main finding is that the ASORA RT computing time grows linearly with the increasing number of sources, Nsrcsubscript𝑁srcN_{\mathrm{src}}italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT, and in cubic fashion with respect to the maximum radius for ray tracing, so R3superscript𝑅3R^{3}italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, i.e., distance is given in a number of voxels. In the case of the Tesla P100 GPU, the computing time per source per voxel within the ray-tracing distance saturates with value 3.156ns3.156ns3.156\,\rm ns3.156 roman_ns when Nsrc>105subscript𝑁srcsuperscript105N_{\mathrm{src}}>10^{5}italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT > 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT. This study allows the user to quantify the computing time and cost of a future simulation run with pyC2Ray. If we consider a cosmological simulation with 68 redshift steps, each with 2-time steps, ray-tracing radius R=11𝑅11R=11italic_R = 11 (grid units) and approximately Nsrc4×106subscript𝑁src4superscript106N_{\mathrm{src}}\approx 4\times 10^{6}italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT ≈ 4 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT sources. We can run the entire simulation from z=21𝑧21z=21italic_z = 21 to 8.58.58.58.5 with a total of 2.75similar-toabsent2.75\sim 2.75∼ 2.75 GPU-h, corresponding to the cost obtained in the cosmological example presented in §5. Secondly, the method scales strongly with the batch size up to 32similar-toabsent32\sim 32∼ 32 on our system, suggesting that the GPU occupancy is not yet optimal, an issue that may be addressed in future updates. We estimate that running a reionization simulation on the same volume down to z6similar-to𝑧6z\sim 6italic_z ∼ 6, where Nsrc=1.5×107subscript𝑁src1.5superscript107N_{\mathrm{src}}=1.5\times 10^{7}italic_N start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT = 1.5 × 10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT, would cost approximately 10.3 GPU-h.

Finally, in § 5, we compare pyC2Ray and C2-Ray on an actual cosmological simulation. We demonstrate that the differences within the same simulation are negligible with an absolute-relative error between 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and 1012superscript101210^{-12}10 start_POSTSUPERSCRIPT - 12 end_POSTSUPERSCRIPT on the HIIHII\rm HIIroman_HII field, while both mass- and volume-averaged ionized fractions and the power spectra accumulate an error that stays below the order of <105absentsuperscript105<10^{-5}< 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. As mentioned in the previous paragraph, the computational cost for this simulation was 2.52.52.52.5 GPU-hours, while the same simulation run on 128128128128 cores with C2-Ray took 14ksimilar-toabsent14k\sim 14\rm k∼ 14 roman_k core-hours. Another way to describe the gain in performance is to consider the monetary cost of running these simulations. The cost of running a code on a GPU or CPU cluster varies based on the electricity consumption and other indirect expenses assessed by the high-performance computer facility. Nowadays, one GPU-hour can cost on average 0.8Euros0.8Euros0.8\,\rm Euros0.8 roman_Euros, while one core-hour can be 0.01Euros0.01Euros0.01\,\rm Euros0.01 roman_Euros. Therefore, with these reference fees the simulation presented § 5 would have cost 2Euros2Euros2\,\rm Euros2 roman_Euros if run with pyC2Ray instead of 138.25Euros138.25Euros138.25\,\rm Euros138.25 roman_Euros with C2-Ray.

With this work, we demonstrate that pyC2Ray achieves the same result as C2-Ray for a cosmological EoR simulation, but with a computing cost and time two orders of magnitude lower than the original code, confirming the motivation behind this modernization of C2-Ray. In principle, pyC2Ray is not limited by the volume size or the mass resolution but rather by the spatial resolution, N𝑁Nitalic_N, and the number of sources, Nsrcsubscript𝑁𝑠𝑟𝑐N_{src}italic_N start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT. The ASORA raytracing algorithm needs to store M𝑀Mitalic_M copies of the entire double precision grid data directly on the GPU, where M𝑀Mitalic_M is the source batch size. Therefore, the current limiting factor is the available memory on the GPU, as it is generally desirable to have M20greater-than-or-equivalent-to𝑀20M\gtrsim 20italic_M ≳ 20 to achieve optimal GPU occupancy. For instance, the NVIDIA® P100 has 64 GB of memory; we can, in principle, simulate a 10243superscript102431024^{3}1024 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT mesh grid but are then limited to M<8𝑀8M<8italic_M < 8, which is below the optimal regime. We plan to address this issue by reducing the per-source memory requirement in those cases where the ray-tracing radius is significantly smaller than the whole box and the current implementation is needlessly memory-hungry. In this update, we focused on the simplest simulation setup, namely, no photo-heating and only photo-ionization for hydrogen chemistry. As mentioned, C2-Ray has been extended to also include helium (Friedrich et al., 2012) and X-ray heating (Ross et al., 2017), and has also been used as a module in a hydrodynamic simulation to follow the evolution of an HII region in the interstellar medium (ISM), see Arthur et al. (2011) and Medina et al. (2014). We aim to gradually include these features and extensions in pyC2Ray now that the groundwork has been laid.

Acknowledgements

The authors would like to thank Emma Tolley, Shreyam Krishna and Chris Finlay for their feedback and useful discussions, as well as Hannah Ross, Jean-Guillaume Piccinali, Andreas Fink and Dmitry Alexeev for their help on the technical aspects of the GPU implementation. MB acknowledges the financial support from the Swiss National Science Foundation (SNSF) under the Sinergia Astrosignals grant (CRSII5_193826). PH acknowledges access to Piz Daint at the Swiss National Supercomputing Centre, Switzerland, under the SKA’s share with the project ID sk015. This work has been done as part of the SKACH consortium through funding from SERI. GM’s research is supported by the Swedish Research Council project grant 2020-04691_VR. We also acknowledge the allocation of computing resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at the PDC Center for High-Performance Computing, KTH Royal Institute of Technology, partially funded by the Swedish Research Council through grant agreement no. 2022-06725.

The image processing tools operated on our data were performed with the help of NumPy and SciPy packages. All plots were created with mathplotlib (Hunter, 2007), and the illustration in Figure 3 was made using Blender.

Appendix A ASORA Implementation Details

Table 1: Summary of the carbon footprint consumption of the cosmological simulation presented in this paper if both runs were performed in Switzerland.
Model CO2 emission [kg𝑘𝑔kgitalic_k italic_g] Energy consumption [kWh𝑘𝑊kWhitalic_k italic_W italic_h] Car drive [km]delimited-[]𝑘𝑚[km][ italic_k italic_m ] CO2 absorption [yr𝑦𝑟yritalic_y italic_r]
NVIDIA Tesla P100 0.020.020.020.02 1.071.071.071.07 0.070.070.070.07 1.121.121.121.12
AMD EPYC Zen 2 1.221.221.221.22 105.88105.88105.88105.88 6.976.976.976.97 110.5110.5110.5110.5

Here, we briefly discuss how the ASORA method is implemented in C++/CUDA. As detailed in the paper, each block is assigned to a single source and owns a dedicated memory space to store the values of the column densities of voxels to be used as interpolants in upcoming tasks. Each task Sqsubscript𝑆𝑞S_{q}italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT comprises the |Sq|=4q2+2subscript𝑆𝑞4superscript𝑞22|S_{q}|=4q^{2}+2| italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | = 4 italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 grid voxels belonging to an octahedral shell as illustrated in Figure 3. Threads within a block are labeled by 1D indices x=0,,N𝑥0𝑁x=0,\dots,Nitalic_x = 0 , … , italic_N, where N𝑁Nitalic_N is the block size. Labeling the voxels in the shell by s=0,,|Sq|𝑠0subscript𝑆𝑞s=0,\dots,|S_{q}|italic_s = 0 , … , | italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT |, all voxels can be treated if the threads iterate |Sq|/Nsimilar-toabsentsubscript𝑆𝑞𝑁\sim|S_{q}|/N∼ | italic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | / italic_N times. It then remains to map the 1D indices s𝑠sitalic_s to the actual 3D grid positions (i,j,k)𝑖𝑗𝑘(i,j,k)( italic_i , italic_j , italic_k ) of the voxels within the shell. We use the following mapping: separate the octahedron into a "top" part containing all kks𝑘subscript𝑘𝑠k\geq k_{s}italic_k ≥ italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT planes, where kssubscript𝑘𝑠k_{s}italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the source plane, and a "bottom" part containing the rest. For the top part, which contains 2q(q+1)+12𝑞𝑞112q(q+1)+12 italic_q ( italic_q + 1 ) + 1 voxels in total, the k𝑘kitalic_k index of any voxel can be found from its i,j𝑖𝑗i,jitalic_i , italic_j indices through k=ks+q(|iis|+|jjs|)𝑘subscript𝑘𝑠𝑞𝑖subscript𝑖𝑠𝑗subscript𝑗𝑠k=k_{s}+q-(|i-i_{s}|+|j-j_{s}|)italic_k = italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_q - ( | italic_i - italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | + | italic_j - italic_j start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | ). To find i,j𝑖𝑗i,jitalic_i , italic_j, we follow the procedure illustrated in Figure 13: map s=1,,2q(q+1)𝑠12𝑞𝑞1s=1,\dots,2q(q+1)italic_s = 1 , … , 2 italic_q ( italic_q + 1 ) to Cartesian 2D coordinates (a,b)𝑎𝑏(a,b)( italic_a , italic_b ) as in (A) and apply a shear matrix (a,b)(a,b)𝑎𝑏superscript𝑎superscript𝑏(a,b)\rightarrow(a^{\prime},b^{\prime})( italic_a , italic_b ) → ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) to obtain (B). Apply a translation on the subset of those with a+2b>2q𝑎2𝑏2𝑞a+2b>2qitalic_a + 2 italic_b > 2 italic_q (C) and finally map the remaining voxel s=0𝑠0s=0italic_s = 0 to (i,j)=(is+q,js)𝑖𝑗subscript𝑖𝑠𝑞subscript𝑗𝑠(i,j)=(i_{s}+q,j_{s})( italic_i , italic_j ) = ( italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_q , italic_j start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) to obtain the full squashed top part of the octahedron (D). The same procedure is applied to the lower part, with some slight modifications, as this does not include the source plane and so contains fewer voxels in total (2q212superscript𝑞212q^{2}-12 italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1). For further details, we refer the reader to the source code.

Refer to caption
Figure 13: Schematic representation of the mapping of 1D indices 0,,2q(q+1)02𝑞𝑞10,\dots,2q(q+1)0 , … , 2 italic_q ( italic_q + 1 ) to the 3D grid positions (i,j,k)𝑖𝑗𝑘(i,j,k)( italic_i , italic_j , italic_k ) of voxels in the top part of the q=3𝑞3q=3italic_q = 3 shell. The (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) mapping is a combination of a shear (B) and a translation (C), and the k𝑘kitalic_k coordinate is determined directly from (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) as described in the text.

A last key point to address is that since C2-Ray uses periodic boundary conditions, it is important to impose a further constraint on the indices (i,j,k)𝑖𝑗𝑘(i,j,k)( italic_i , italic_j , italic_k ) of the voxels that are allowed to avoid race conditions on coordinates that map to the same voxel under periodicity. The simulation domain is cubic, so this constraint is satisfied if we impose that no voxel can be farther away from the source than the edges of the grid, translated under periodicity. On an odd mesh (N𝑁Nitalic_N odd), this means only considering voxels at most a grid distance N/2𝑁2N/2italic_N / 2 away from the source on either side. On an even mesh, a convention must be chosen, and in line with the original C2-Ray code, we impose that the maximum distance in each dimension is N/2𝑁2N/2italic_N / 2 on the negative and N/21𝑁21N/2-1italic_N / 2 - 1 on the positive side of the source.

Appendix B Carbon footprint of cosmological simulations

Numerical simulations for cosmological and astrophysics applications often require immense computational power and extensive data processing, and therefore, their energy demands can be substantial. The environmental impact is often underappreciated and sometimes disregarded. Given the escalating concern over climate change, we want to present the ecological advantage of moving to GPU-based algorithms.

We employed Green Algorithm555www.green-algorithms.org to estimate the carbon consumption of the cosmological simulations presented in §5 and compare the run with pyC2Ray and C2-Ray. As we mentioned in §4.2, the cosmological EoR simulation presented in this paper run with pyC2Ray was performed in 2 hours and 30min on 1 GPU NVIDIA® Tesla P100, drawing 1.07 kWh. Based in Switzerland, this has a carbon emission (CO2e) of 12.31 g. This corresponds to the CO2 consumption of driving a car for 70 meters or 0.02%percent\%% of the consumption of the Paris-London flight. Based in Sweden, the same simulation runs with C2-Ray on 128 AMD EPYC Zen 2 CPUs. The cluster draws 105.88 kWh and has a 600.33 g CO2e, corresponding to the consumption of a car drive for 3.43 Km or the 1%percent11\%1 % consumption of the Paris-London travel by plane. A mature tree sequesters on average 0.92 g of CO2 per month (Lannelongue et al., 2021). Based on this estimation, the cosmological run performed, with C2-Ray, would have consumed what one single tree sequester from the atmosphere in approximately 54 years. Meanwhile, the same simulation run with pyC2Ray would take about one year. In Table 1, we compare the simulations CO2 consumption if both runs were performed in Switzerland.

While this analysis highlights the environmental footprint of cosmological simulations, its purpose is not to evoke shame or guilt. Rather, it serves as a reminder of the tangible costs of these essential scientific endeavors. Moreover, we did not consider using renewable energy sources and the potential impact reduction of HPC clusters using renewable energy. We aim to highlight the differences in energy consumption between simulation approaches.

References

  • Altay et al. (2008) Altay, G., Croft, R.A.C., Pelupessy, I., 2008. sphray: a smoothed particle hydrodynamics ray tracer for radiative transfer. MNRAS 386, 1931–1946. URL: http://dx.doi.org/10.1111/j.1365-2966.2008.13212.x, doi:10.1111/j.1365-2966.2008.13212.x.
  • Arthur et al. (2011) Arthur, S.J., Henney, W.J., Mellema, G., de Colle, F., Vázquez-Semadeni, E., 2011. Radiation-magnetohydrodynamic simulations of H II regions and their associated PDRs in turbulent molecular clouds. MNRAS 414, 1747–1768. doi:10.1111/j.1365-2966.2011.18507.x, arXiv:1101.5510.
  • Astropy Collaboration (2022) Astropy Collaboration, 2022. The Astropy Project: Sustaining and Growing a Community-oriented Open-source Project and the Latest Major Release (v5.0) of the Core Package. ApJ 935, 167. doi:10.3847/1538-4357/ac7c74, arXiv:2206.14220.
  • Atek et al. (2024) Atek, H., Labbé, I., Furtak, L.J., Chemerynska, I., Fujimoto, S., Setton, D.J., Miller, T.B., Oesch, P., Bezanson, R., Price, S.H., et al., 2024. Most of the photons that reionized the universe came from dwarf galaxies. Nature 626, 975–978.
  • Aubert et al. (2015) Aubert, D., Deparis, N., Ocvirk, P., 2015. Emma: an adaptive mesh refinement cosmological simulation code with radiative transfer. MNRAS 454, 1012–1037. URL: http://dx.doi.org/10.1093/mnras/stv1896, doi:10.1093/mnras/stv1896.
  • Aubert and Teyssier (2008) Aubert, D., Teyssier, R., 2008. A radiative transfer scheme for cosmological reionization based on a local eddington tensor. MNRAS 387, 295–307.
  • Aubert and Teyssier (2010) Aubert, D., Teyssier, R., 2010. Reionization simulations powered by graphics processing units. i. on the structure of the ultraviolet radiation field. The Astrophysical Journal 724, 244–266. URL: http://dx.doi.org/10.1088/0004-637X/724/1/244, doi:10.1088/0004-637x/724/1/244.
  • Barkana (2016) Barkana, R., 2016. The rise of the first stars: Supersonic streaming, radiative feedback, and 21-cm cosmology. Physics Reports 645, 1–59.
  • Bianco et al. (2021) Bianco, M., Iliev, I.T., Ahn, K., Giri, S.K., Mao, Y., Park, H., Shapiro, P.R., 2021. The impact of inhomogeneous subgrid clumping on cosmic reionization – II. Modelling stochasticity. MNRAS 504, 2443–2460. URL: https://doi.org/10.1093/mnras/stab787, doi:10.1093/mnras/stab787.
  • Bosman et al. (2022) Bosman, S.E., Davies, F.B., Becker, G.D., Keating, L.C., Davies, R.L., Zhu, Y., Eilers, A.C., D’Odorico, V., Bian, F., Bischetti, M., et al., 2022. Hydrogen reionization ends by z= 5.3: Lyman-α𝛼\alphaitalic_α optical depth measured by the xqr-30 sample. MNRAS 514, 55–76.
  • Cavelan et al. (2020) Cavelan, A., Cabezón, R.M., Grabarczyk, M., Ciorba, F.M., 2020. A Smoothed Particle Hydrodynamics Mini-App for Exascale, in: PASC ’20: Proceedings of the Platform for Advanced Scientific Computing ConferenceJune 2020, p. 11. doi:10.1145/3394277.3401855, arXiv:2005.02656.
  • Choudhury (2009) Choudhury, T.R., 2009. Analytical models of the intergalactic medium and reionization. arXiv:0904.4596.
  • Choudhury and Ferrara (2006) Choudhury, T.R., Ferrara, A., 2006. Physics of cosmic reionization. arXiv:astro-ph/0603149.
  • Ciardi et al. (2001) Ciardi, B., Ferrara, A., Marri, S., Raimondo, G., 2001. Cosmological reionization around the first stars: Monte Carlo radiative transfer. MNRAS 324, 381–388. URL: https://doi.org/10.1046/j.1365-8711.2001.04316.x, doi:10.1046/j.1365-8711.2001.04316.x, arXiv:https://academic.oup.com/mnras/article-pdf/324/2/381/3361122/324-2-381.pdf.
  • Dalcin and Fang (2021) Dalcin, L., Fang, Y.L.L., 2021. mpi4py: Status Update After 12 Years of Development. Computing in Science and Engineering 23, 47–54. doi:10.1109/MCSE.2021.3083216.
  • Dayal and Ferrara (2018) Dayal, P., Ferrara, A., 2018. Early galaxy formation and its large-scale effects. Physics Reports 780-782, 1–64. URL: https://www.sciencedirect.com/science/article/pii/S0370157318302266, doi:https://doi.org/10.1016/j.physrep.2018.10.002. early galaxy formation and its large-scale effects.
  • DeBoer et al. (2017) DeBoer, D.R., Parsons, A.R., Aguirre, J.E., Alexander, P., Ali, Z.S., Beardsley, A.P., Bernardi, G., Bowman, J.D., Bradley, R.F., Carilli, C.L., et al., 2017. Hydrogen epoch of reionization array (hera). Publications of the Astronomical Society of the Pacific 129, 045001.
  • Dixon et al. (2016) Dixon, K.L., Iliev, I.T., Mellema, G., Ahn, K., Shapiro, P.R., 2016. The large-scale observational signatures of low-mass galaxies during reionization. MNRAS 456, 3011–3029. doi:10.1093/mnras/stv2887, arXiv:1512.03836.
  • Fioc et al. (2011) Fioc, M., Le Borgne, D., Rocca-Volmerange, B., 2011. PÉGASE: Metallicity-consistent Spectral Evolution Model of Galaxies. Astrophysics Source Code Library, record ascl:1108.007. arXiv:1108.007.
  • Friedrich et al. (2012) Friedrich, M.M., Mellema, G., Iliev, I.T., Shapiro, P.R., 2012. Radiative transfer of energetic photons: X-rays and helium ionization in C2-RAY. MNRAS 421, 2232–2250. doi:10.1111/j.1365-2966.2012.20449.x, arXiv:1201.0602.
  • Friedrich et al. (2012) Friedrich, M.M., Mellema, G., Iliev, I.T., Shapiro, P.R., 2012. Radiative transfer of energetic photons: X-rays and helium ionization in c2-ray. MNRAS 421, 2232–2250.
  • Furlanetto et al. (2006) Furlanetto, S.R., Oh, S.P., Briggs, F.H., 2006. Cosmology at low frequencies: The 21cm transition and the high-redshift universe. Physics Reports 433, 181–301. URL: https://doi.org/10.1016%2Fj.physrep.2006.08.002, doi:10.1016/j.physrep.2006.08.002.
  • Garland et al. (2008) Garland, M., Le Grand, S., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., Volkov, V., 2008. Parallel computing experiences with cuda. IEEE Micro 28, 13–27.
  • Gelli et al. (2023) Gelli, V., Salvadori, S., Ferrara, A., Pallottini, A., Carniani, S., 2023. Quiescent low-mass galaxies observed by jwst in the epoch of reionization. The Astrophysical Journal Letters 954, L11.
  • Giri et al. (2019) Giri, S.K., Mellema, G., Aldheimer, T., Dixon, K.L., Iliev, I.T., 2019. Neutral island statistics during reionization from 21-cm tomography. MNRAS 489, 1590–1605.
  • Giri et al. (2018) Giri, S.K., Mellema, G., Dixon, K.L., Iliev, I.T., 2018. Bubble size statistics during reionization from 21-cm tomography. MNRAS 473, 2949–2964.
  • Giri et al. (2020) Giri, S.K., Mellema, G., Jensen, H., 2020. Tools21cm: A python package to analyse the large-scale 21-cm signal from the epoch of reionization and cosmic dawn. Journal of Open Source Software 5, 2363.
  • Giri et al. (2023) Giri, S.K., Schneider, A., Maion, F., Angulo, R.E., 2023. Suppressing variance in 21 cm signal simulations during reionization. A&A 669, A6.
  • Gnedin and Abel (2001) Gnedin, N.Y., Abel, T., 2001. Multi-dimensional cosmological radiative transfer with a Variable Eddington Tensor formalism. New A 6, 437–455. doi:10.1016/S1384-1076(01)00068-9, arXiv:astro-ph/0106278.
  • Gnedin and Madau (2022) Gnedin, N.Y., Madau, P., 2022. Modeling cosmic reionization. Living Reviews in Computational Astrophysics 8, 3.
  • Gorbunov and Rubakov (2011) Gorbunov, D.S., Rubakov, V.A., 2011. Introduction to the Theory of the Early Universe: Hot Big Bang Theory. 2 ed., World Scientific Publishing Company. doi:10.1142/7874.
  • van Haarlem et al. (2013) van Haarlem, M.P., Wise, M.W., Gunst, A., Heald, G., McKean, J.P., Hessels, J.W., de Bruyn, A.G., Nijboer, R., Swinbank, J., Fallows, R., et al., 2013. Lofar: The low-frequency array. A&A 556, A2.
  • Harnois-Déraps et al. (2013) Harnois-Déraps, J., Pen, U.L., Iliev, I.T., Merz, H., Emberson, J.D., Desjacques, V., 2013. High-performance P3M N-body code: CUBEP3M. MNRAS 436, 540–559. doi:10.1093/mnras/stt1591, arXiv:1208.5098.
  • Harris et al. (2020) Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M.H., Brett, M., Haldane, A., Fernández del Río, J., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., Oliphant, T.E., 2020. Array programming with NumPy. Nature 585, 357–362. doi:10.1038/s41586-020-2649-2.
  • HERA Collaboration (2023) HERA Collaboration, 2023. Improved Constraints on the 21 cm EoR Power Spectrum and the X-Ray Heating of the IGM with HERA Phase I Observations. ApJ 945, 124. doi:10.3847/1538-4357/acaf50.
  • Hunter (2007) Hunter, J.D., 2007. Matplotlib: A 2d graphics environment. CiSE 9, 90–95. doi:10.1109/MCSE.2007.55.
  • Iliev et al. (2006) Iliev, I.T., Ciardi, B., Alvarez, M.A., Maselli, A., Ferrara, A., Gnedin, N.Y., Mellema, G., Nakamoto, T., Norman, M.L., Razoumov, A.O., Rijkhorst, E.J., Ritzerveld, J., Shapiro, P.R., Susa, H., Umemura, M., Whalen, D.J., 2006. Cosmological radiative transfer codes comparison project - I. The static density field tests. MNRAS 371, 1057–1086. doi:10.1111/j.1365-2966.2006.10775.x, arXiv:astro-ph/0603199.
  • Iliev et al. (2014) Iliev, I.T., Mellema, G., Ahn, K., Shapiro, P.R., Mao, Y., Pen, U.L., 2014. Simulating cosmic reionization: how large a volume is large enough? MNRAS 439, 725–743. doi:10.1093/mnras/stt2497, arXiv:1310.7463.
  • Iliev et al. (2009) Iliev, I.T., Whalen, D., Mellema, G., Ahn, K., Baek, S., Gnedin, N.Y., Kravtsov, A.V., Norman, M., Raicevic, M., Reynolds, D.R., Sato, D., Shapiro, P.R., Semelin, B., Smidt, J., Susa, H., Theuns, T., Umemura, M., 2009. Cosmological radiative transfer comparison project - II. The radiation-hydrodynamic tests. MNRAS 400, 1283–1316. doi:10.1111/j.1365-2966.2009.15558.x, arXiv:0905.2920.
  • Kannan et al. (2019) Kannan, R., Vogelsberger, M., Marinacci, F., McKinnon, R., Pakmor, R., Springel, V., 2019. Arepo-rt: radiation hydrodynamics on a moving mesh. MNRAS 485, 117–149.
  • Kaur et al. (2020) Kaur, H.D., Gillet, N., Mesinger, A., 2020. Minimum size of 21-cm simulations. MNRAS 495, 2354–2362.
  • Lannelongue et al. (2021) Lannelongue, L., Grealey, J., Inouye, M., 2021. Green algorithms: Quantifying the carbon footprint of computation. Advanced Science 8, 2100707. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/advs.202100707, doi:https://doi.org/10.1002/advs.202100707, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/advs.202100707.
  • Medina et al. (2014) Medina, S.N.X., Arthur, S.J., Henney, W.J., Mellema, G., Gazol, A., 2014. Turbulence in simulated H II regions. MNRAS 445, 1797–1819. doi:10.1093/mnras/stu1862, arXiv:1409.5838.
  • Mellema et al. (2006) Mellema, G., Iliev, I.T., Alvarez, M.A., Shapiro, P.R., 2006. C 2-ray: A new method for photon-conserving transport of ionizing radiation. New A 11, 374–395. doi:10.1016/j.newast.2005.09.004, arXiv:astro-ph/0508416.
  • Mellema et al. (2013) Mellema, G., Koopmans, L.V., Abdalla, F.A., Bernardi, G., Ciardi, B., Daiboo, S., de Bruyn, A., Datta, K.K., Falcke, H., Ferrara, A., et al., 2013. Reionization and the cosmic dawn with the square kilometre array. Experimental Astronomy 36, 235–318.
  • Mertens et al. (2020) Mertens, F.G., Mevius, M., Koopmans, L.V., Offringa, A., Mellema, G., Zaroubi, S., Brentjens, M., Gan, H., Gehlot, B.K., Pandey, V., et al., 2020. Improved upper limits on the 21 cm signal power spectrum of neutral hydrogen at z= 9.1 from lofar. MNRAS 493, 1662–1685.
  • Nakamoto et al. (2001) Nakamoto, T., Umemura, M., Susa, H., 2001. 3D Radiative Transfer Effects on the Cosmic Reionization, in: Umemura, M., Susa, H. (Eds.), The Physics of Galaxy Formation, p. 143.
  • Navarro et al. (2014) Navarro, C.A., Hitschfeld-Kahler, N., Mateu, L., 2014. A survey on parallel computing and its applications in data-parallel problems using gpu architectures. CiCP 15, 285–329. doi:10.4208/cicp.110113.010813a.
  • Nebrin et al. (2023) Nebrin, O., Giri, S.K., Mellema, G., 2023. Starbursts in low-mass haloes at cosmic dawn. i. the critical halo mass for star formation. MNRAS , stad1852.
  • Nickolls et al. (2008) Nickolls, J., Buck, I., Garland, M., Skadron, K., 2008. Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? Queue 6, 40–53. URL: https://doi.org/10.1145/1365490.1365500, doi:10.1145/1365490.1365500.
  • Nickolls and Dally (2010) Nickolls, J., Dally, W.J., 2010. The gpu computing era. IEEE Micro 30, 56–69. doi:10.1109/MM.2010.41.
  • Ocvirk et al. (2016) Ocvirk, P., Gillet, N., Shapiro, P.R., Aubert, D., Iliev, I.T., Teyssier, R., Yepes, G., Choi, J.H., Sullivan, D., Knebe, A., Gottlöber, S., D’Aloisio, A., Park, H., Hoffman, Y., Stranex, T., 2016. Cosmic Dawn (CoDa): the first radiation-hydrodynamics simulation of reionization and galaxy formation in the Local Universe. MNRAS 463, 1462–1485. URL: https://doi.org/10.1093/mnras/stw2036, doi:10.1093/mnras/stw2036.
  • Owens et al. (2008) Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C., 2008. Gpu computing. Proceedings of the IEEE 96, 879–899. doi:10.1109/JPROC.2008.917757.
  • Pawlik and Schaye (2008) Pawlik, A.H., Schaye, J., 2008. TRAPHIC - radiative transfer for smoothed particle hydrodynamics simulations. MNRAS 389, 651–677. doi:10.1111/j.1365-2966.2008.13601.x, arXiv:0802.1715.
  • Planck Collaboration et al. (2020) Planck Collaboration, Aghanim, N., Akrami, Y., Ashdown, M., Aumont, J., Baccigalupi, C., Ballardini, M., Banday, A., Barreiro, R., Bartolo, N., Basak, S., et al., 2020. Planck 2018 results-vi. cosmological parameters. å 641, A6.
  • Potter et al. (2016) Potter, D., Stadel, J., Teyssier, R., 2016. Pkdgrav3: Beyond trillion particle cosmological simulations for the next era of galaxy surveys. arXiv:1609.08621.
  • Pritchard and Loeb (2012) Pritchard, J.R., Loeb, A., 2012. 21 cm cosmology in the 21st century. Reports on Progress in Physics 75, 086901.
  • Raga et al. (1999) Raga, A.C., Mellema, G., Arthur, S.J., Binette, L., Ferruit, P., Steffen, W., 1999. 3D Transfer of the Diffuse Ionizing Radiation in ISM Flows and the Preionization of a Herbig-Haro Working Surface. Rev. Mexicana Astron. Astrofis. 35, 123.
  • Rijkhorst et al. (2006) Rijkhorst, E.J., Plewa, T., Dubey, A., Mellema, G., 2006. Hybrid characteristics: 3D radiative transfer for parallel adaptive mesh refinement hydrodynamics. A&A 452, 907–920. doi:10.1051/0004-6361:20053401, arXiv:astro-ph/0505213.
  • Ritzerveld (2005) Ritzerveld, J., 2005. The diffuse nature of Strömgren spheres. A&A 439, L23–L26. doi:10.1051/0004-6361:200500150, arXiv:astro-ph/0506637.
  • Rosdahl et al. (2013) Rosdahl, J., Blaizot, J., Aubert, D., Stranex, T., Teyssier, R., 2013. RAMSES-RT: radiation hydrodynamics in the cosmological context. MNRAS 436, 2188–2231. doi:10.1093/mnras/stt1722, arXiv:1304.7126.
  • Ross et al. (2019) Ross, H.E., Dixon, K.L., Ghara, R., Iliev, I.T., Mellema, G., 2019. Evaluating the qso contribution to the 21-cm signal from the cosmic dawn. MNRAS 487, 1101–1119.
  • Ross et al. (2017) Ross, H.E., Dixon, K.L., Iliev, I.T., Mellema, G., 2017. Simulating the impact of x-ray heating during the cosmic dawn. MNRAS 468, 3785–3797.
  • Ross et al. (2021) Ross, H.E., Giri, S.K., Mellema, G., Dixon, K.L., Ghara, R., Iliev, I.T., 2021. Redshift-space distortions in simulations of the 21-cm signal from the cosmic dawn. MNRAS 506, 3717–3733.
  • Rácz et al. (2019) Rácz, G., Szapudi, I., Dobos, L., Csabai, I., Szalay, A.S., 2019. Steps: A multi-gpu cosmological n-body code for compactified simulations. arXiv:1811.05903.
  • Schmidt-Voigt and Koeppen (1987) Schmidt-Voigt, M., Koeppen, J., 1987. Influence of stellar evolution on the evolution of planetary nebulae. I - Numerical method and hydrodynamical structures. A&A 174, 211–222.
  • Semelin et al. (2007) Semelin, B., Combes, F., Baek, S., 2007. Lyman-alpha radiative transfer during the epoch of reionization: contribution to 21-cm signal fluctuations. Astronomy & Astrophysics 474, 365–374. URL: http://dx.doi.org/10.1051/0004-6361:20077965, doi:10.1051/0004-6361:20077965.
  • Shapiro and Giroux (1987) Shapiro, P.R., Giroux, M.L., 1987. Cosmological H II Regions and the Photoionization of the Intergalactic Medium. ApJ 321, L107. doi:10.1086/185015.
  • Smith et al. (2017) Smith, B.D., Bryan, G.L., Glover, S.C.O., Goldbaum, N.J., Turk, M.J., Regan, J., Wise, J.H., Schive, H.Y., Abel, T., Emerick, A., O’Shea, B.W., Anninos, P., Hummels, C.B., Khochfar, S., 2017. GRACKLE: a chemistry and cooling library for astrophysics. MNRAS 466, 2217–2234. doi:10.1093/mnras/stw3291, arXiv:1610.09591.
  • Spitzer (1998) Spitzer, L., 1998. Physical Processes in the Interstellar Medium.
  • Trott et al. (2020) Trott, C.M., Jordan, C., Midgley, S., Barry, N., Greig, B., Pindor, B., Cook, J., Sleap, G., Tingay, S., Ung, D., et al., 2020. Deep multiredshift limits on epoch of reionization 21 cm power spectra from four seasons of murchison widefield array observations. MNRAS 493, 4711–4727.
  • Virtanen et al. (2020) Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., Brett, M., Wilson, J., Millman, K.J., Mayorov, N., Nelson, A.R.J., Jones, E., Kern, R., Larson, E., Carey, C.J., Polat, İ., Feng, Y., Moore, E.W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E.A., Harris, C.R., Archibald, A.M., Ribeiro, A.H., Pedregosa, F., van Mulbregt, P., SciPy 1.0 Contributors, 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, 261–272. doi:10.1038/s41592-019-0686-2.
  • Wang and Meng (2021) Wang, Q., Meng, C., 2021. Photons-gpu: A gpu accelerated cosmological simulation code. RAA 21, 281. URL: https://dx.doi.org/10.1088/1674-4527/21/11/281, doi:10.1088/1674-4527/21/11/281.
  • Watson et al. (2013) Watson, W.A., Iliev, I.T., D’Aloisio, A., Knebe, A., Shapiro, P.R., Yepes, G., 2013. The halo mass function through the cosmic ages. MNRAS 433, 1230–1245.
  • Wayth et al. (2018) Wayth, R.B., Tingay, S.J., Trott, C.M., Emrich, D., Johnston-Hollitt, M., McKinley, B., Gaensler, B.M., Beardsley, A.P., Booler, T., Crosse, B., et al., 2018. The phase ii murchison widefield array: design overview. Publications of the Astronomical Society of Australia 35, e033.
  • Whalen and Norman (2006) Whalen, D., Norman, M.L., 2006. A Multistep Algorithm for the Radiation Hydrodynamical Transport of Cosmological Ionization Fronts and Ionized Flows. ApJS 162, 281–303. doi:10.1086/499072, arXiv:astro-ph/0508214.
  • Zaroubi (2013) Zaroubi, S., 2013. The Epoch of Reionization, in: Wiklind, T., Mobasher, B., Bromm, V. (Eds.), The First Galaxies. Springer Berlin Heidelberg, Berlin, Heidelberg. volume 396, pp. 45–101. doi:10.1007/978-3-642-32362-1_2. series Title: Astrophysics and Space Science Library.