Optimising the Processing and Storage of Radio Astronomy Data
Abstract
The next generation of radio astronomy telescopes are challenging existing data analysis paradigms, as they have an order of magnitude larger collecting area and bandwidth. The two primary problems encountered when processing this data are the need for storage and that processing is primarily I/O limited. An example of this is the data deluge expected from the SKA-Low Telescope of about 300 PB per year. To remedy these issues, we have demonstrated lossy and lossless compression of data on an existing precursor telescope, the Australian Square Kilometre Array Pathfinder (ASKAP), using MGARD and ADIOS2 libraries. We find data processing is faster by a factor of 7 and give compression ratios from a factor of 7 (lossless) up to 37 (lossy with an absolute error bound of ). We discuss the effectiveness of lossy MGARD compression and its adherence to the designated error bounds, the trade-off between these error bounds and the corresponding compression ratios, as well as the potential consequences of these I/O and storage improvements on the science quality of the data products.
1 Introduction
1.1 The Square Kilometre Array
Radio Astronomy is undergoing a paradigm shift with the planning for a number of next-generation instruments, such as the Square Kilometre Array (SKA), the next-generation Very Large Array (ngVLA) and the next-generation Event Horizon Telescope (ngEHT). All of these provide an order of magnitude increase in bandwidth and a few orders of magnitude in collecting area (and sensitivity) over current radio telescopes. This enhancement will provide us with the opportunities to survey the radio sky in exquisite detail, detect the signal from the epoch of reionization when the first stars were born and measure the spectral signal from millions of galaxies. To grasp this potential massive improvement in our understanding, the radio astronomy community must manage and process unprecedented volumes of data. We are focusing on the SKA, as Australia is a founding member of the collaboration. The SKA will be built in Australia for frequencies spanning 50 to 350MHz (SKA-Low) and in South Africa for frequencies from 350MHz to 15GHz (SKA-Mid). Phase 1 will have about 500 40m aperture array elements in SKA-Low and about 200 15m parabolic dishes in SKA-Mid. Construction has commenced and preliminary observations will be made from 2025. The data rates out of the correlator will be 1 and 2.5 TB/s respectively, which will need to be captured into a local buffer and then processed on the day - as storage will be limited.
1.2 Australian SKA Pathfinder (ASKAP)/Yandasoft
The Australian Square Kilometre Array Pathfinder (ASKAP) (Johnston et al., 2007; Hotan et al., 2021) radio telescope is one of the SKA precursors and is opening up a new window for large extragalactic H i surveys beyond the local Universe due to its wide spectral bandwidth and large instantaneous field-of-view (FoV). ASKAP consists of 36 dishes, each of diameter 12 m and equipped with phased array feeds (PAFs) forming multiple receiving beams electronically (DeBoer et al., 2009; Hampson et al., 2012). The baseline lengths of the full array are from 22 m to 6.3 km. The phased array feed technology allows ASKAP to have a large FoV (Bunton & Hay, 2010) and a wide bandwidth of with a channel resolution of 18.5-0.58 kHz in the observing frequency between 0.7 and 1.8 GHz, which makes it an optimal survey instrument, enabling it to conduct both wide and deep surveys in a comparatively short period of time (Hotan et al., 2021).
1.3 Deep Investigation of Neutral Gas Origins (DINGO)
Deep Investigation of Neutral Gas Origins (DINGO) (Meyer, 2009; Rhee et al., 2023) is an ASKAP deep H i survey project aiming to provide a cosmologically representative dataset for H i emission, enabling studies of the H i gas content of galaxies over the past 4 billion years out to distances of 5 billion light-years, due to the accelerated expansion of the universe. The sky coverage of the DINGO survey is wider than deep H i surveys previously conducted and the ongoing deep H i surveys being carried out with other telescopes such as the JVLA111The Karl G. Jansky Very Large Array, MeerKAT222The Meer-Karoo Array Telescope and FAST333Five-hundred-meter Aperture Spherical Telescope (CHILES444The COSMOS H i Large Extra-galactic Survey (Fernández et al., 2013), LADUMA555Looking At the Distant Universe with the MeerKAT Array (Holwerda et al., 2012; Blyth et al., 2016)/MIGHTEE-HI666The H i emission project of the MeerKAT International GigaHertz Tiered Extragalactic Exploration survey (Maddox et al., 2021), and FUDS777FAST Ultra Deep Survey (Xi et al., 2021) respectively). Due to its large volume coverage, the DINGO survey will reduce cosmic variance on H i measurements, thereby providing a unique legacy H i dataset.
DINGO pilot observations were made over the Galaxy and Mass Assembly (GAMA) (Driver et al., 2022) 23 h (G23) field, centred at (J2000) = , in 2020 and 2022, respectively. The pilot observations used the full array of ASKAP’s 36 antennas with the 288 MHz bandwidth (15552 channels) in the observing frequency ranges of 859.5-1147.5 MHz (band 1) and 1151.5-1439.5 MHz (band 2), and a channel width is 18.5 kHz, equivalent to a velocity resolution of in cosmologically nearby galaxies. The DINGO pilot survey obtained 100 hr data both in the band 1 and 2 frequency from one of the G23 tiles to develop a DINGO processing pipeline for deep imaging and long-term data storage, which this paper presents.
The DINGO full survey allocation is 3200 hr for the G23 field, split into two 1600 hr for the frequency ranges of 859.5-1147.5 MHz and 1151.5-1439.5 MHz, respectively. 16 hr observations have been conducted in the higher frequency band so far.
1.4 ADaptive I/O System version 2 (ADIOS2)
The Adaptable Input Output System version 2 (Godoy et al., 2020), is a software framework with a simple input/output abstraction and a self-describing data model centred around distributed data arrays, allowing multiple applications to publish and subscribe data at large levels of concurrency. It also introduces a larger organizing concept, the “step”, for driving data production and consumption within applications. ADIOS2 recently developed a new mechanism to allow applications to use state-of-the-art lossless and lossy compression algorithms. This mechanism makes use of tight integration between I/O and reduction and allows applications to take full advantage of the self-describing formatting and lossy compression techniques.
1.5 MultiGrid Adaptive Reduction of Data (MGARD)
MGARD (Gong et al., 2023) offers error-controlled lossy compression rooted in multi-grid theories. It transforms floating-point scientific data into a multilevel representation, followed by quantization, lossless encoding, and ultimately generating a self-describing compressed buffer. One of MGARD’s notable features is its array of error control options, including , , point-wise relative , and options to define varied error bounds across regions or different frequency components. This flexibility is valuable for preserving Quantities-of-Interest (QoI) (Gong et al., 2022) derived from the reconstructed data. For region-adaptive compression, MGARD accommodates Regions-of-Interest (RoI) specified through either bounding boxes or masks, with the latter especially useful for irregular shaped RoIs. In cases where RoI information is not provided, MGARD employs internal functions to identify regions rich in detail, leveraging data turbulence measured across multiple scales.
1.6 Radio Astronomy Data
Radio data presents a unique challenge. Much of the data is noise, see for example Fig. 2. In this figure, only a small fraction of pixels in the image contain emission from a galaxy, which appears as a spatially concentrated region of high radio emission. Not all astronomical sources are spatially concentrated and with ever improving resolution, what was once a single source can be resolved into spatially extended, diffuse emission. Moreover, some signals, such as the sought-after signal of reionization from the first stars, will be distributed across the entire image and is hidden in the noise.
This data analysis challenge is combined with a data volume challenge. Radio astronomy data volumes from current generation telescopes are of PB-scale. This data is also often stored as a MeasurementSet (Kemball & Wieringa, 2000), a format in which visibility and single-dish data are stored to accommodate synthesis. Although this format has been historically very useful, it does not scale particularly well and often the science process requires non-optimal access, giving rise to additional I/O load.
This data challenge will only increase once next-generation telescopes become operational and is the motivation for this study.
2 Methods
2.1 ASKAPSoft
ASKAPSoft is a package that contains the software necessary for processing data from the ASKAP telescope. Its primary purpose is for the full-scale processing of ASKAP data, from the observed visibility data to spectral-line and continuum images.
ASKAPSoft provides the well-established routines to image a spectral line dataset. That is to: read the data, apply weighting kernels to set the image parameters (Field of View, sensitivity to low surface brightness or compact objects, etc), resample the data onto a regular sampled grid for inversion using the Fast Fourier Transform, iteratively deconvolve for the limited sampling of Fourier terms and finally, for deep images, stacking of multiple epochs of observing for the final image. The DINGO pipeline reorganises these tasks so that grids are preserved for stacking, rather than the images. This reduces the number of inversions required and improves the quality of the deconvolution.
Imaging radio interferometric data at SKA-scales is expected to be I/O bound due to the massive size of the datasets. The computational costs are dominated by the gridding and the inversion steps, and these two are expected to have approximately similar requirements. Thus any reduction in the size of the datasets would have a significant impact on the total processing time and any reduction in the gridding or inversion would have a significant (albeit smaller) impact on the compute costs.
ASKAPSoft contains a wide variety of programs and scripts that are useful in the analysis, manipulation, and processing of radio astronomy data. We will primarily use two applications present in ASKAPSoft: imager & cdeconvolver. imager creates spectral-line image cubes, which use frequency as an analogue for distance, allowing for a 3-dimensional view of the sky. Thus this single program includes the read, weighting, gridding and inversion steps. In our case, imager is used to produce visibility grids, an intermediate product, that we can manipulate before producing the final image. That is, the inversion step is not performed and the normally intermediate data products (i.e. the grids) are saved for later processing. Due to the sparse nature of these grids, compression is particularly efficient and is at a sufficiently early point in the pipeline, so that processing parameters can be changed as needed. The deep imaging mentioned in section 1.3 requires stacking these grids over 3200 hours of observed data to improve sensitivity to sources within the final image. cdeconvolver is a bespoke application designed specifically to perform this stacking and complete the imaging process. Further information on these applications can be found at https://www.atnf.csiro.au/computing/software/askapsoft/sdp/docs/current/index.html
2.2 DINGO Pipeline
We start with a MeasurementSet data format (Kemball & Wieringa, 2000) that contains the calibrated, continuum-subtracted visibilities (the continuum here refers to the components of the data that are independent of frequency). This is the input to imager, described above, which produces a visibility grid, a PSF (Point Spread Function) grid, and a PCF (Point Convolution Function) grid. The visibility grid is a grid representation of the 3-dimensional visibilities projected onto a 2-dimensional grid for each frequency channel (producing a 3-dimensional grid). The PSF grid represents the inherent smearing of point sources due to the baseline sampling of the system. The PCF grid represents the size, location and weighting of the convolutional kernels applied during the gridding process. These are used in the final imaging to apply weighting to individual visibility cells. These grids are passed to the cdeconvolver application which, if more than one observation is provided, will sum the grids together as they are read in. These summed grids are then imaged for analysis. Here we validate that the resulting image is free of detrimental RFI and that the sensitivity of the image is better than that of the non-stacked image.
2.3 Compression
The focus of the compression here is that of the grids. Due to the nature of the PCF grid, lossless compression provided a compression ratio of 100, with lossy compression providing ratios similar to that of the visibility and PSF grids. For this reason, we are only comparing the compression of visibility and PSF grids. We use error bounds of , , , and both as relative and absolute error bounds. The cdeconvolver application failed to complete the imaging of the data for the relative case, the reasons for which are still under investigation. The lossless compression uses the zstd algorithm to compress the grids, although bzip2 also provides a similar level of compression. This provided a consistent compression ratio and (as expected) did not alter the decompressed data in any way.
2.4 Parallel I/O
The integration of ADIOS2 into the I/O stage of ASKAPSoft leveraged the use of casacore and the implementation of existing ADIOS2 storage manager (Wang et al., 2016). An ADIOS2 Image module was developed to bridge the gap between imager and cdeconvolver’s use of image inputs and outputs, and the storage manager’s table interface. These applications use MPI to implement parallel processing which, when passing the communicators to the storage manager (and by extension ADIOS2), enables the I/O to occur in parallel.
3 Results
3.1 Compression Comparison
Figure 1 shows the compression ratios of the visibility and PSF grids as the error bound is increased from zero (lossless) to for both absolute and relative error measurements. Lossless compression is consistent at a value of 7.5, lossy compression appears to be better in the relative case, and the compression ratios of the real and imaginary parts of the visibility grid are consistent whereas the imaginary parts compress better than the real part in the absolute case and worse in the relative case. This is due to the real and imaginary parts of the PSF (unlike the visibilities) represent different properties and cover a different range of values.
Of the eight tests of the compression, seven completed the imaging and produced, qualitatively, decent images. The reason for the failure of relative- result is still under investigation, however this is likely after compression. These images are shown in Figure 2, which shows a collapsed view over the 60 channels that contain a radio source. The only images in this panel that show any qualitative deviation from the original (or lossless) image is that for the relative error bound and the absolute error bound.
This is further reinforced by Figure 3 which shows that majority of the value deviation is at the corners of the image, with the image for the relative and absolute error bounds showing an increased residual near the centre of the image.
The spectral profile shown in the left panel of Figure 4 describes the typical double peak profile of a galaxy. This is the brightest source within this image and provides a good test for the quality of the reconstruction of bright sources after MGARD compression. The panels on the right describe the residuals of this profile between the original image and the compressed image for each error bound. The residuals are shown to be uniform and consistent with the set error bounds, noting that the compression was performed on the visibility and PSF grids and the error bounds are set on the values of these grids.
3.2 Time comparison
Figure 5 shows the comparison between the pre-existing I/O method (labelled CASA here) and the same processing done while using ADIOS2 without engaging any parallel I/O. ADIOS2 appears to perform similarly to CASA for small amounts of data (640 MB to 6.25 GB), but improves significantly as more data (60 GB and above) is written during processing.
Figure 6 shows the comparison of the same processing as above, but using parallel-enabled ADIOS2 for writing. The same trend as before can be seen here, where ADIOS2 performs significantly better as more data is written to disk.
Figure 7 shows this comparison for the 100-channel case, but with MGARD compression during the writing stage of the processing. Lossless compression appears to perform significantly better than the same processing with lossy compression. In the current implementation, MGARD is run completely on CPUs, however, it is written to run on GPUs as well. We plan to implement and test this compression using GPUs in the future, as we expect that this will be more time-efficient. However, if the processing is I/O bound the computational costs will not be highly significant.
4 Discussion
4.1 Compression ratios vs distortion
The histograms of the image residuals (Figure 8) indicate the quality of the reconstructed data after imaging. MGARD produces a consistent residual distribution in terms of distribution shape and the maximum residuals are consistent with the specified error bounds during compression. The residuals produced from the images compressed with relative error bounds surpass one standard deviation for the and cases, and those for the absolute error bounds surpass for the case. That is, the distortions introduced by the compression, will start to be detectable against the noise levels in the images. The images shown in Figure 3 show that majority of these high residuals are situated in the corners of the image which are cut off or normalised out during regular imaging.
An alternate distortion measure is the 2-point correlation of the residuals. This can be produced by performing a Discrete Fourier Transform (DFT) on the residuals, binning the pixels in radius, and calculating the product of the value of these pixels with their conjugate. The result is shown in Figure 9 and describes the prominence of patterns of certain scales within the residuals. The higher error bounds all show a sharp cutoff at 289, whereas the residuals for absolute error and relative error show a negative logarithmic trend between 289 and 410. The value of 289 corresponds to the largest radial distance of an non-zero cell on the visibility grid and corresponds to the smallest resolvable scale. The excess of values above this spatial frequency shows a leakage of values within the PSF grid, adding small scale residuals below the smallest resolvable scale.
Figure 10 shows the distribution of the residuals for the visibility and PSF grids. Each set of residuals appears to meet the specified error bounds in a consistent manner, each turns off toward zero at the specified error bounds and only exceed these bounds by a factor of two. The exception to this is for the absolute error bound, where the visibilities exceeded the error bound by a factor of three and the PSF exceeded the bound by a factor of ten. This is likely due to the either a limit in the internal error calculation within MGARD or that the error bound of is pushing the limit on the compressibility of the data and that lossless compression would be a better fit for this choice of error bound.
4.2 Consequences for SKA data management
The SKA will, and have, invested a significant amount of capital into the storage, transmission and computing infrastructure for both nodes of the SKA. The tests described in the preceding sections have been specifically designed around the bespoke nature of deep imaging with ASKAP. Preliminary tests have shown that applying MGARD to raw SKA data (simulated) will yield similar factors of compression, the inclusion of which would simplify the implementation of storage and transmission solutions considerably for the SKA . The processing done for SKA and ASKAP to produce images is almost identical, meaning that implementing ADIOS2 parallel I/O in the SKA pipelines would only improve the efficiency and speed of processing.
5 Conclusion
We have demonstrated the application of MGARD compression, via the ADIOS2/CASACore framework, within standard radio astronomy software. The impact will be on improved I/O performance, and thus pipeline run times, due to the reduced dataset sizes.
Comparing the images made with various compression approaches allows us to quantify the impact. We show that we can apply lossy compression to the data files and achieve compression ratios up to 15, using well-defined error bounds, without impacting the results.
In addition, the parallel reading and writing provided by ADIOS offers an additional improvement in I/O, and this is readily integrated with most common Radio Astronomy applications via the CASACore libraries.
Compression of the visibilities offers an attractive solution to the SKA I/O challenge, and we have demonstrated that the MGARD approach to compression can guarantee that the data is not degraded.
References
- Blyth et al. (2016) Blyth, S., Baker, A. J., Holwerda, B., et al. 2016, in MeerKAT Science: On the Pathway to the SKA, 4
- Bunton & Hay (2010) Bunton, J. D., & Hay, S. G. 2010, in 2010 International Conference on Electromagnetics in Advanced Applications, 728–730, doi: 10.1109/ICEAA.2010.5651120
- DeBoer et al. (2009) DeBoer, D. R., Gough, R. G., Bunton, J. D., et al. 2009, IEEE Proceedings, 97, 1507, doi: 10.1109/JPROC.2009.2016516
- Driver et al. (2022) Driver, S. P., Bellstedt, S., Robotham, A. S. G., et al. 2022, MNRAS, 513, 439, doi: 10.1093/mnras/stac472
- Fernández et al. (2013) Fernández, X., van Gorkom, J. H., Hess, K. M., et al. 2013, ApJ, 770, L29, doi: 10.1088/2041-8205/770/2/L29
- Godoy et al. (2020) Godoy, W., Podhorszki, N., Wang, R., et al. 2020, SoftwareX, 12, 100561, doi: 10.1016/j.softx.2020.100561
- Gong et al. (2022) Gong, Q., Whitney, B., Zhang, C., et al. 2022, in Proceedings of the 34th International Conference on Scientific and Statistical Database Management, SSDBM ’22 (New York, NY, USA: Association for Computing Machinery), doi: 10.1145/3538712.3538717
- Gong et al. (2023) Gong, Q., Chen, J., Whitney, B., et al. 2023, SoftwareX, 24, 101590, doi: https://doi.org/10.1016/j.softx.2023.101590
- Hampson et al. (2012) Hampson, G., Macleod, A., Beresford, R., et al. 2012, in 2012 International Conference on Electromagnetics in Advanced Applications, 807–809, doi: 10.1109/ICEAA.2012.6328742
- Holwerda et al. (2012) Holwerda, B. W., Blyth, S. L., & Baker, A. J. 2012, in IAU Symposium, Vol. 284, The Spectral Energy Distribution of Galaxies - SED 2011, ed. R. J. Tuffs & C. C. Popescu, 496–499, doi: 10.1017/S1743921312009702
- Hotan et al. (2021) Hotan, A. W., Bunton, J. D., Chippendale, A. P., et al. 2021, PASA, 38, e009, doi: 10.1017/pasa.2021.1
- Johnston et al. (2007) Johnston, S., Bailes, M., Bartel, N., et al. 2007, Publications of the Astronomical Society of Australia, 24, 174, doi: 10.1071/as07033
- Kemball & Wieringa (2000) Kemball, A., & Wieringa, M. 2000, URL: http://casa. nrao. edu/Memos/229. html, 20
- Maddox et al. (2021) Maddox, N., Frank, B. S., Ponomareva, A. A., et al. 2021, A&A, 646, A35, doi: 10.1051/0004-6361/202039655
- Meyer (2009) Meyer, M. J. 2009, ASKAP Survey Science Proposal
- Rhee et al. (2023) Rhee, J., Meyer, M., Popping, A., et al. 2023, MNRAS, 518, 4646, doi: 10.1093/mnras/stac3065
- Wang et al. (2016) Wang, R., Harris, C., & Wicenec, A. 2016, Astronomy and Computing, 16, 146, doi: https://doi.org/10.1016/j.ascom.2016.05.003
- Xi et al. (2021) Xi, H., Staveley-Smith, L., For, B.-Q., et al. 2021, MNRAS, 501, 4550, doi: 10.1093/mnras/staa3931