Improving Photometric Redshift Estimates with Training Sample Augmentation

Irene Moskowitz; Eric Gawiser; John Franklin Crenshaw; Brett H. Andrews; Alex I. Malz; Samuel Schmidt; The LSST Dark Energy Science Collaboration

doi:10.3847/2041-8213/ad4039

1. Introduction

Understanding the nature of dark energy is a major open question in cosmology. Stage-IV dark energy experiments, such as the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST, Ivezić et al. 2019), Euclid (Euclid Collaboration et al. 2022), and Roman (Akeson et al. 2019), are scheduled to come online in the coming years.

Imaging surveys will need to obtain redshifts to galaxies, but there will be too many for spectroscopic redshifts to be feasible. LSST alone is expected to observe billions of galaxies and will therefore rely on photometric redshifts (photo-z's). Photo-z's can be estimated through machine-learning algorithms, which learn to associate photometric quantities, such as colors and magnitudes, with a redshift estimate.

Previous Stage-III dark energy surveys have also used machine learning for estimating photometric redshifts. The Hyper Suprime-Cam Subaru Strategic Program (HSC-SSP; Aihara et al. 2017) used DNNz and DEMPz (Hsieh & Yee 2014) for the Year 3 cosmology results (Miyatake et al. 2023; Rau et al. 2023; Sugiyama et al. 2023). Both DNNz and DEMPz are conditional density estimators. The Dark Energy Survey (DES; Abbott et al. 2018) has used a self-organizing map (SOMPZ; Myles et al. 2021) for estimating photo-z's. The Kilo-Degree Survey (KiDS; Heymans et al. 2021) has also used self-organizing maps for photo-z estimation (Hildebrandt et al. 2021).

Machine-learning methods require a training sample of galaxies with both photometry and spectroscopic redshifts, and it is well known that machine-learning methods trained on nonrepresentative training data perform worse than when trained on representative training sets (see, e.g., Beck et al. 2017 for a general evaluation of photo-z quality when nonrepresentative training samples are used). Stylianou et al. (2022) also demonstrates the effect of some simplistic forms of training sample incompleteness on specific machine-learning methods. However, existing spectroscopic samples are biased toward brighter, redder galaxies than LSST will observe in general, and these also tend to be at lower redshift than the typical LSST galaxy. This means that training samples for photo-z estimation will not be representative of LSST data, leading to poor photo-z estimation for galaxies with photometry not represented in the training sample. The Dark Energy Spectroscopic Instrument (DESI; Flaugher & Bebek 2014), along with spectroscopic redshifts from Euclid, Roman, and 4MOST (de Jong et al. 2019), will alleviate this issue to an extent, but the DESI survey will not be as deep as LSST; additional spectroscopic redshifts from DESI cannot solve the problem alone. We will need methods to improve the redshift estimation that do not involve obtaining more spectroscopic redshifts.

One method for improving training samples without obtaining more spectroscopic redshifts is through data augmentation, which is the process of modifying a training sample in some way to increase the generality of a machine-learning model (Shorten & Khoshgoftaar 2019). Data augmentation can be done by transforming existing training sample data in some way, such as through rotations or deformations in the case of image recognition (Bloice et al. 2017), or by generating synthetic data for the training sample (Bird et al. 2021). Broussard & Gawiser (2021) used this synthetic data generation method for augmentation to estimate photo-z's.

In this Letter, we investigate a slightly different method of augmenting the training sample by adding galaxies from simulated catalogs to our training sample. By selecting simulated galaxies with photometry and/or redshifts not otherwise represented in the training sample, this training sample augmentation can expand the range of feature space capable of producing good photo-z estimates, provided the simulated catalog used for augmentation has reasonable colors. If the simulated catalog is too unrealistic, this will only create confusion in our model.

Section 2 describes our simulated data, including our stand-in for real LSST data and the simulated catalog used for augmenting the training sample. Section 3 describes our methodology, including how a realistically nonrepresentative training sample is created, how we estimate photo-z's, and the process for augmenting the training sample. Section 4 discusses our results, and Section 5 concludes.

2. Simulated Data

2.1. DC2

The LSST Dark Energy Science Collaboration (DESC) Data Challenge 2 catalog (DC2; Abolfathi et al. 2021) is a 300 deg² area of simulated LSST observations. The base input for DC2 is the CosmoDC2 galaxy catalog (Korytov et al. 2019), which is derived from the Outer Rim N-body simulation (Heitmann et al. 2019). Galaxies were assigned to halos using UniverseMachine (Behroozi et al. 2019) and GalSampler (Hearin et al. 2020). Complete galaxy properties are generated with Galacticus (Benson 2012). Galaxy spectra are constructed from stellar population spectra computed with fsps (Conroy et al. 2009).

To generate the DC2 catalog, stars, supernovae, strong lenses, and active galactic nuclei are added to the CosmoDC2 catalog. The object catalogs are passed as inputs to the image simulation software imSim⁷ to generate LSST-like images, which are then processed by the LSST Science Pipelines.

From the DC2 catalog, we select objects with magnitude i > 17. We require signal-to-noise ratio (S/N) >6 in the i band, as well as S/N >3 in at least one other band. To minimize contamination from stars, we select only extended objects. This extendedness cut does not entirely eliminate stars, but the stellar contamination is low, and the training sample selection process (described in Section 3.1) ends up placing all stars in the application sample.

The base, unaugmented training sample and the application sample of galaxies, for which we estimate redshifts, are formed from our selected DC2 objects. In this work, DC2 is a stand-in for real data. We use the term "application sample" to refer to the DC2 stand-in for what would be unlabeled LSST data. While we do have true redshifts for this sample, and it functions as a testing set in this analysis, we keep the terminology to identify our application sample with eventual LSST data.

2.2. Buzzard

The Buzzard simulation (DeRose et al. 2019) was built from the L-GADGET2 dark matter simulations (Springel 2005). Galaxies are assigned to halos using AddGals (Wechsler et al. 2022) using the abundance matching technique. Galaxy spectral energy distributions (SEDs) were assigned to match the measured SED–luminosity–density relationship in Sloan Digital Sky Survey data.

We use the Buzzard catalog selected for the DESC Tomographic Challenge (Zuntz et al. 2021). Details on selection cuts, postprocessing and uncertainty generation can be found in that paper.

The Buzzard method of assigning spectra is completely independent from fsps, so Buzzard SEDs should be sufficiently different from DC2 SEDs to simulate adding simulated galaxies to training samples of real galaxies. In this work, we use the Buzzard catalog as a simulated catalog with which to augment the DC2 training sample.

3. Methodology

3.1. Nonrepresentative Training Sample

Existing spectroscopic galaxy samples are brighter and redder than expected LSST observations, and also tend to be at lower redshifts. To partition our DC2 catalog into a realistically nonrepresentative training sample and application sample, we use the GridSelection degrader in the DESC RAIL⁸ software (LSST-DESC RAIL developer team et al. 2023). We briefly summarize the GridSelection degrader below. A more detailed discussion can be found in Moskowitz et al. (2023).

The GridSelection degrader is based on the second data release of HSC-SSP (Aihara et al. 2019). Galaxies with similar photometry to early LSST observations are selected from the HSC Wide catalog; some of these galaxies have photometry only, and some have matched spectroscopic redshifts. The range in i-band magnitude and (g-z) color are divided into 100 × 100 pixels. Within each pixel, a ratio of the number of galaxies with spectroscopic redshifts to the total number of galaxies is computed, along with the 99th percentile in spectroscopic redshift, denoted z_max. The GridSelection degrader divides our DC2 galaxies into the same set of pixels in i versus (g-z) and automatically assigns DC2 objects with ${z}_{\mathrm{true}}\gt {z}_{\max }$ to the application sample. From the remaining DC2 objects, the GridSelection degrader randomly selects objects for the training sample such that the ratio of DC2 training objects to total objects in a pixel matches the ratio from HSC.

After partitioning the full DC2 sample into training and application samples, the training sample contains 186,837 galaxies, while the application sample contains 5,520,458 objects. The left and center panels of Figure 1 show the resulting DC2 training and application samples, where it is clear that the training sample is redder and brighter than the majority of the application sample. Figure 2 shows the (normalized) redshift distributions of both samples, as well as the distributions of the Buzzard sample and the best-performing augmentation choice. The training sample is biased toward lower redshifts than the application sample as a whole. The right panel of Figure 1 shows the Buzzard sample that will be used for augmentation. See Section 3.3 for more details.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** The results of partitioning our DC2 catalog into training (left) and application (center) samples. The training sample is redder and brighter than the bulk of the application sample. The right panel shows the Buzzard sample used for augmenting the training sample. The horizontal dotted line shows the i-band selection criterion, while the vertical dotted line shows the (g-z) color criterion. The dashed line indicates the section criterion for color+magnitude augmentation, which generally matches the shape of the DC2 training sample in the left panel. Open arrows indicate which regions of color–magnitude space are used for single-feature augmentation, while solid arrows indicate regions used for color+magnitude augmentation See Section 3.3 for more details on augmentation criteria.
Download figure:
Standard image High-resolution image

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Normalized redshift distributions of the DC2 training sample (blue solid line), DC2 application sample (orange dashed line), and Buzzard sample (green dotted line). The DC2 training sample is biased to lower redshifts than the application sample. The vertical dashed line indicates the selection criterion for redshift augmentation, with the arrow indicating the region of redshift space used for augmentation. The black dotted–dashed line shows the redshift distribution of the best-performing, postaugmentation training sample shown in the top right panel of Figure 4.
Download figure:
Standard image High-resolution image

3.2. Photo-z Estimation

Schmidt et al. (2020) tested 12 photo-z estimation codes, albeit using representative training data, and recommended FlexZBoost (Izbicki & Lee 2017; Dalmasso et al. 2020) as an appropriate estimator. Therefore, to estimate photo-z's, we use FlexZBoost as implemented in RAIL. FlexZBoost is a nonparametric conditional density estimator for redshifts. It takes as inputs the magnitudes and errors in each of the bands ugrizy and outputs a photo-z probability density function (pdf) for each object in the application sample.

To evaluate the quality of a set of photo-z estimates, we use the outlier fraction, catastrophic outlier fraction, normalized median absolute deviation (NMAD), and bias. We define an outlier as ∣z_true − z_phot∣/(1 + z_true) > 0.15, while a catastrophic outlier is defined as ∣z_true − z_phot∣ > 1.0. The NMAD is given by

$\begin{eqnarray}&&1.4826\times {\rm{Med}}\left[\left|\displaystyle \frac{{\rm{\Delta }}z}{1+{z}_{\mathrm{true}}}-{\rm{Med}}\left(\displaystyle \frac{{\rm{\Delta }}z}{1+{z}_{\mathrm{true}}}\right)\right|\right],\end{eqnarray} \tag{ 1 }$

where the bias is given by the median (Δz/(1 + z_true)). Although the pdf contains a wealth of useful information that can be used to quantify photo-z quality, such as the 3σ outlier fraction (see, e.g., Jones et al. 2024), cosmological analyzes typically involve assigning galaxies to tomographic redshift bins. Since galaxies can only be assigned to one redshift bin, little information is lost by compressing the pdf into a single photo-z point estimate used to assign the galaxy to a bin. We use the mean of the pdf as a point estimate for each photo-z.

The photo-z's estimated from the base, nonrepresentative DC2 training sample are shown in the top panel of Figure 3. The outlier fraction is quite high at nearly 50%. In particular, the majority of galaxies with z_true ≳ 1.0 have outlier z_phot estimates. This is due to the fact that our DC2 training sample has very few objects with z > 1.0. For comparison, the bottom panel of Figure 3 shows the results for a fully representative DC2 training sample, which obtains an outlier fraction of 0.14. This represents the best we can expect to do using FlexZBoost. As shown in Figure 3, the unaugmented, nonrepresentative training sample produces much worse photo-z's than the representative training sample, particularly at z_true > 1.0.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Top: photo-z's estimated from the realistic, nonrepresentative DC2 training sample shown in Figure 1. Solid black lines indicate the boundary for outliers. Bottom: photo-z's estimated from a fully representative training sample drawn randomly from the DC2 application sample in Figure 1.
Download figure:
Standard image High-resolution image

3.3. Augmentation

We augment the training sample by adding 10,000 Buzzard galaxies according to a set of criteria, taking care to only use knowledge about the application sample that would be available for real data. The simplest criterion for augmentation is to select Buzzard galaxies with higher redshifts than those present in the DC2 training sample. We refer to this as redshift augmentation, and make the selection z_buzzard > 1.0. This region is indicated by the vertical dashed line and arrow in Figure 2.

Since DC2 application galaxies are also dimmer and bluer than the training sample, we also choose magnitude and color selection criteria, which we call magnitude augmentation and color augmentation, respectively. For magnitude augmentation we make the selection i_buzzard > 23, and for color augmentation we choose (g − z)_buzzard < 1.75. These boundaries were chosen to match where the magnitude and (g-z) color distributions in the training sample start to decline. They are indicated by the dotted lines and open arrows in the right panel of Figure 1. We test augmentation with each of the features individually, as well as in combination with each other.

3.4. Photometry Shifts

Since DC2 and Buzzard rely on different methods for determining galaxy SEDs, the colors as a function of redshift are different between the two simulations. If this augmentation method was used for real data, it would be advantageous to shift the simulated photometry to look like the real photometry in the application sample. Therefore, we also attempt to match the Buzzard photometry to the DC2 application sample. This modifies the color–redshift relationship in Buzzard to potentially more closely resemble the color–redshift relationship of DC2. Since we do not use the true redshifts of the application sample, this represents something we could do with real data.

3.4.1. Magnitude Shifts

The simplest way to transform Buzzard colors is to apply a single shift to the Buzzard magnitudes to make their median match the median of the DC2 application sample magnitudes in each band. We will refer to this sample as the "magnitude-shifted Buzzard" sample.

In addition to the medians, we can also rescale the NMADs to match in each band. This is a proxy for matching the first and second moments of the photometry distributions. To shift the NMADS, we apply the following transformation:

$\begin{eqnarray}&&{\mathrm{mag}}_{j,\mathrm{new}}=\displaystyle \frac{{\mathrm{NMAD}}_{j,\mathrm{DC}2}}{{\mathrm{NMAD}}_{j}}\times [{\mathrm{mag}}_{j}-{\mathrm{med}}_{j}]+{\mathrm{med}}_{j}\end{eqnarray} \tag{ 2 }$

where NMAD_j refers to the NMAD in band j, med_j is the median magnitude in band j, and all quantities are for Buzzard unless indicated by the DC2 subscript. We will refer to this sample as the "NMAD-shifted Buzzard" sample.

3.4.2. Normalizing Flows

The simple shift method is able to match the medians and NMADs of the Buzzard and DC2 color distributions, but not the shapes of the distributions. To attempt to more fully match the color distributions, we use normalizing flows to produce a catalog of DC2-like photometry with Buzzard-like redshifts.

We use the PZFlow⁹ package (Crenshaw et al. 2023) as implemented in RAIL for training the normalizing flows. We train two flows: one on DC2 photometry, and one on Buzzard photometry. The DC2 flow learns the probability distribution function of the DC2 photometry, p(photometry), while the Buzzard flow is a conditional flow that learns the probability density function of the redshift given the photometry, p(z∣ photometry). The features used for training are i-band magnitudes and (u-g), (g-r), (r-i), (i-z), and (z-y) colors. We train 100 epochs for the DC2 flow, and 150 epochs for the Buzzard flow.

Once the flows are trained, we sample from the DC2 flow to make a new catalog of galaxies with DC2-like photometry. We then sample from the Buzzard flow, using the new DC2-like photometry as conditions, to generate Buzzard-like redshifts for our DC2-like photometry. Finally, we use the RAIL LSSTErrorModel degrader to generate LSST-like errors on the magnitudes. This set of DC2-like photometry and Buzzard-like redshifts constitutes our flowed catalog from which we draw galaxies for augmentation. We will refer to this sample as the "flowed Buzzard" sample.

4. Results

Table 1 summarizes the outlier fractions and NMADS achieved for each combination of augmentation features and each Buzzard sample (unshifted, magnitude-shifted, and flowed), as well as those achieved for the unaugmented training sample and a fully representative training sample. Every kind of augmentation we tested improved the outlier fraction and NMAD of the resulting photo-z's. Augmentations involving redshift selections performed better than those without. The magnitude-shifted Buzzard sample generally produced better results than the unshifted or flowed Buzzard samples; however, in the case of selecting galaxies for augmentation using only a single feature, the unshifted Buzzard catalog produced the best results.

Table 1. Summary of Outlier Fractions, NMADs, and Bias Achieved for All Color-shifting and Augmentation Cases

Unaugmented	Outlier	NMAD	Bias
Samples	Fraction
Representative	0.141(0.014)	0.057	0.0001
NonRepresentative	0.480(0.21)	0.190	−0.12

Augmented	Unshifted Buzzard			Magnitude-shifted Buzzard			Flowed Buzzard
Samples

	Outlier	NMAD	Bias	Outlier	NMAD	Bias	Outlier	NMAD	Bias
	Fraction			Fraction			Fraction

z (z_buz > 1.0)	0.261(0.040)	0.087	−0.004	0.263(0.046)	0.088	−0.014	0.292(0.045)	0.094	0.003
Mag (i_buz > 23)	0.318(0.12)	0.097	−0.030	0.407(0.17)	0.138	−0.069	0.327(0.045)	0.107	−0.039
Col ((g − z)_buz<1.75)	0.324(0.12)	0.099	−0.031	0.401(0.17)	0.134	−0.066	0.319(0.069)	0.102	−0.035
Mag+z	0.268(0.037)	0.090	−0.001	0.259(0.047)	0.086	−0.015	0.293(0.047)	0.096	0.001
Col+z	0.271(0.037)	0.090	−0.005	0.258(0.045)	0.086	−0.016	0.286(0.046)	0.093	0.0004
Col+Mag	0.311(0.12)	0.093	−0.028	0.400(0.16)	0.134	−0.067	0.327(0.066)	0.107	−0.040
Col+Mag+z	0.268(0.039)	0.089	−0.002	0.245(0.037)	0.084	−0.014	0.284(0.043)	0.092	0.002

Note. We have abbreviated redshift augmentation as "z," magnitude augmentation as "Mag," and color augmentation as "col." Values in the parentheses in the outlier fraction columns are the catastrophic outliers. The bolded values correspond to the best-performing augmentation case.

Download table as: ASCII Typeset image

Adding the NMAD shift to the magnitude shift did not hurt, but showed no improvement over the simple magnitude shift, so we only show results for the magnitude-shifted sample. We also tried combining the magnitude-shifted Buzzard sample with the normalizing flow method, but the standard color flow worked better.

Since FlexZBoost also takes in photometric errors, we tested the effect of changing the errors. We tested multiplying the errors by factors of 0.1, 2.0, and 2.0 × (1 + z). Variations in the outlier fractions and NMADs were smaller than the variations between augmentation cases.

Each method of augmentation produces an outlier fraction and NMAD for the resulting photo-z's. In addition to the raw metrics, we also report the ratio of the augmented results to the unaugmented results. Since the best-case scenario is the fully representative training case, not an outlier fraction/NMAD of 0, we also report a percent recovery toward the outlier fraction/NMAD achieved in the representative case. The percent recovery is calculated as (X_unrep − X_aug)/(X_unrep − X_rep), where X is either the outlier fraction or NMAD, the subscript "aug" refers to the statistic for the augmented training sample, the subscript "unaug" refers to the unaugmented training sample, and the subscript "rep" refers to the representative training sample.

The following subsections discuss the results for single-feature, double-feature, and triple-feature augmentations. Table 1 summarizes the results.

4.1. Augmentation with Individual Features

When augmenting with a single feature, we choose Buzzard galaxies with either z_true > 1.0, i-mag >23 or (g − z) < 1.75. When using a single feature for augmentation, the base, unshifted Buzzard catalog produced the lowest outlier fractions and NMADs.

Of the three features, redshift augmentation produces the best results when a single feature is used for augmentation, while color performs the worst. The redshift-augmented training sample, using the unshifted Buzzard catalog, and the resulting photo-z estimates for the redshift augmentation are shown in the left panel of Figure 4. Results for color and magnitude augmentation, as well as for the shifted and flowed Buzzard catalogs, are listed in Table 1.

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** The best-case photo-z estimation for single-, double-, and triple-feature augmentation (bottom row), with the corresponding augmented training samples (top row). Left: when only a single feature is used for augmentation, redshift augmentation produces the best photo-z estimates. The training sample was augmented with the unshifted Buzzard catalog, which in this case produced better photo-z statistics than the magnitude-shifted or flowed Buzzard catalog. Center: when a double-feature combination is used for selecting galaxies for augmentation, the combination of color+redshift produces the best results. The best results for this case came from using the magnitude-shifted Buzzard catalog. Right: the best photo-z estimates were produced when all three features are combined for selected galaxies for augmentation. The best results for this case came from using the magnitude-shifted Buzzard catalog.
Download figure:
Standard image High-resolution image

4.2. Augmentation with Double-feature Combinations

There are three double-feature combinations possible: magnitude+redshift, color+redshift, and color+magnitude. Since the training sample creates a compact shape in color–magnitude space, we fit two lines to mimic the shape of the top of the color–magnitude distribution of the training sample. We then choose objects above and to the left of this region; see the dashed line and solid arrows in the right panel of Figure 1.

The training sample does not have a simple shape in either color–redshift or magnitude–redshift space. For these feature combinations, we use the intersection of the selection requirements for each feature rather than the union; for example, the magnitude+redshift augmentation selects Buzzard galaxies with both i-mag >23 and z_true > 1.0. The intersection worked mildly better than the union.

When using multiple features to select Buzzard galaxies for augmentation, the magnitude-shifted Buzzard sample produced the best results. Results for the unshifted and flowed Buzzard catalogs are also listed in Table 1.

Similar to the single-feature augmentation, the double-feature combinations that include redshift as a selection criterion perform better than combinations without redshift. Magnitude+redshift and color+redshift combinations perform virtually identically for the magnitude-shifted Buzzard catalog, with outlier fractions of 0.259 and 0.258, respectively. This corresponds to a ratio to the unaugmented results of 0.54, and a 65% recovery of the fully representative case. Both produce an NMAD of 0.086, which corresponds to a ratio to the unaugmented results of 0.452 and a 78% recovery toward the fully representative case. The color+redshift-augmented training sample, using the magnitude-shifted Buzzard catalog, and the resulting photo-z estimates are shown in the middle panel of Figure 4.

When compared to the outlier fraction for the redshift augmentation using the magnitude-shifted catalog (see Table 1), adding color or magnitude provides a small improvement. The color+magnitude combination also provides a small improvement over either color or magnitude alone in most cases.

4.3. Color+Magnitude+Redshift Augmentation

Finally, we use all three features to select Buzzard galaxies for augmentation. We use the intersection of the color+magnitude and redshift selection criteria as this provides better results than the union. This is likely because we always augment the training sample with 10,000 galaxies; in this case, the intersection most efficiently probes the feature space not covered by the DC2 training sample. The color+magnitude+redshift-augmented training sample, using the magnitude-shifted Buzzard catalog, is shown in the top right of Figure 4, and the postaugmentation redshift distribution is shown as the black dotted–dashed line in Figure 2.

As with the double-feature augmentation cases, the simple magnitude-shifted Buzzard catalog produced the best results, and we show that case in the right panel of Figure 4.

This augmentation case produces the lowest outlier fraction and NMAD out of all combinations of color-shifting and augmentation features. At an outlier fraction of 0.245, which is a ratio of 0.51 to the unaugmented case, this augmentation case recovers 69% of the degradation resulting from the nonrepresentative training sample. We achieve an NMAD of 0.084, a ratio of 0.44 to the unaugmented case, and an 80% recovery of the degradation in NMAD compared to the fully representative training sample. We also achieve 8 times less bias with this augmentation than in the unaugmented case (see Table 1).

It can be seen that the augmented training samples using the shifted magnitudes Buzzard catalog have i-band magnitudes that extend fainter than the application sample. We tried imposing an i < 26 cut on the magnitude-shifted Buzzard catalog before selecting galaxies for augmenting, but this produces slightly worse photo-z estimates. We suspect this is because FlexZBoost is a conditional density estimator, and a hard cutoff in the magnitudes makes it difficult for FlexZBoost to estimate the density at the cutoff magnitude. Allowing the magnitudes to drop off naturally makes it easier for FlexZBoost to estimate the density at i = 26, resulting in better photo-z estimates even though the magnitude range extends farther than required to match the application sample.

4.4. Comparison to TPZ

To test whether the results of augmentation depend strongly on the machine-learning method used, we also estimated photo-z's using the RAIL implementation of the code Trees for Photo-Z (TPZ; Carrasco Kind & Brunner 2013), which uses a random forest method. TPZ produces similar results when no training sample augmentation is performed; the photo-z statistics of (outlier fraction, catastrophic outlier fraction, NMAD, and bias) for the unaugmented TPZ case is (0.51, 0.20, 0.22, −0.13) compared to (0.48, 0.21, 0.19, −0.12) for the unaugmented FlexZBoost case. The best-performing augmentation case, using the magnitude-shifted Buzzard sample and augmenting with color+magnitude+redshift, produced photo-z statistics of (0.32, 0.04, 0.11, −0.014) for TPZ, a smaller but still highly significant improvement over the unaugmented case when compared to the FlexZBoost results (see Table 1).

5. Conclusions

Large imaging surveys, such as LSST, will not have access to representative training samples for estimating photometric redshifts. When estimating photo-z's using a realistically nonrepresentative training sample, the outlier fraction reaches nearly 50%, almost a factor of 3.5 worse than when using a representative training sample, and a similar increase in scatter as probed by the NMAD. Obtaining new spectroscopic samples of dim galaxies cannot solve this problem alone, as it is not feasible to obtain a large enough sample by the time LSST is expected to see first light. Training sample augmentation is an easy way to improve photo-z estimates without requiring additional spectroscopic samples of dim galaxies.

We used the DESC DC2 simulation as a stand-in for eventual LSST data, and investigated how augmenting a realistically nonrepresentative training sample with simulated galaxies from Buzzard can improve the photo-z estimates. Even a relatively simple augmentation process of selecting simulated galaxies with redshifts higher than those present in the training sample can recover 65% of the degradation in the outlier fraction when compared to a fully representative training sample.

Shifting the photometry of the galaxy catalog used for augmenting the training sample can improve the results further. We shifted all Buzzard magnitudes so the median magnitude in each band matches the median magnitudes of the DC2 application sample. We then select galaxies for augmentation in regions of color–magnitude–redshift space not covered by the training sample. The resulting photo-z estimates, shown in the right panel of Figure 4 and Table 1, have an outlier fraction below 25%, NMAD of 0.084, and bias of −0.014, representing a nearly 50% reduction over the unaugmented photo-z outlier fraction, 56% reduction in NMAD, and factor of 8 reduction in bias. Augmentation has recovered 69% of the degradation in outlier fraction compared to the fully representative case, 80% of the degradation in NMAD, and 88% of the degradation in bias. With these results in mind, it is clear that training sample augmentation should be considered in the photo-z pipeline for large galaxy surveys, including LSST.

Since redshift seems to be the most important feature for augmentation, the Buzzard catalog is not the best-case scenario for augmentation given how few galaxies there are at z > 1.0. There are almost no Buzzard objects at z > 1.5, while there are still many application sample objects in that redshift range. When using augmentation for real data, choosing a simulation with sufficient redshift coverage, such as the DC2 simulation should produce better results than were achieved here.

While we leave extensions of this method to real data to future work, we discuss below a few possible avenues to explore training sample augmentation for real data. One could, for example, use the deep photo-z catalogs from COSMOS2020 (Weaver et al. 2022) as a truth catalog, where the COSMOS2020 photo-z catalog would take the place of the DC2 catalog used in this work. It could also be worthwhile to explore using a deep photo-z catalog such as COSMO2020 as the augmentation catalog, in place of the Buzzard catalog.

Self-organizing maps (SOMs) have also been used to correct for spectroscopic incompleteness of training samples in DES (see, for example, Buchs et al. 2019; Myles et al. 2021) and KiDS (van den Busch et al. 2022). Assigning training and application samples to SOM cells can be useful for identifying underrepresented regions of photometry in the training sample that can be targeted for spectroscopic follow-up (Masters et al. 2015). Future efforts could use the SOM methodology to identify photometry regions to target with augmented training samples.

We have assumed in this work that the outlier fraction and NMAD are good indicators of photo-z quality. While this is a reasonable assumption, a full quantification of the improvements provided by augmentation would come from using the photo-z estimates to do a full cosmological parameter estimation. For augmentation to truly be useful, it should result in better cosmological parameter estimates than the unaugmented photo-z's. This analysis will be presented in a forthcoming paper.

Acknowledgments

This paper has undergone internal review by the LSST Dark Energy Science Collaboration. The internal reviewers were Huan Lin and Eve Kovacs. The authors thank the internal reviewers for their valuable comments.

I.M. and E.G. acknowledge support for this research from the LSST Corporation via grant No. 2021-42. I.M. and E.G. also acknowledge support from the U.S. Department of Energy, Office of Science, Office of High Energy Physics Cosmic Frontier Research program under Award Number DE-SC0010008. J.F.C. acknowledges support from the U.S. Department of Energy, Office of Science, Office of High Energy Physics Cosmic Frontier Research program under Award Number DE-SC0011665. B.H.A. acknowledges support by the National Science Foundation under Award Number AST-2009251. A.I.M. acknowledges the support of Schmidt Sciences.

Author contributions are as follows. I.M. performed the analysis and wrote the majority of the paper. E.G. advised I.M., suggested approaches, and provided feedback on the text. J.F.C. suggested and advised on methodology for the normalizing flow, and provided feedback on manuscript. B.H.A. provided guidance on bias discussions and feedback on the manuscript. A.I.M. designed and led the development of the RAIL software. S.S. advised on the use of FlexZBoost and feedback on the manuscript.

The DESC acknowledges ongoing support from the Institut National de Physique Nucléaire et de Physique des Particules in France; the Science & Technology Facilities Council in the United Kingdom; and the Department of Energy, the National Science Foundation, and the LSST Corporation in the United States. DESC uses resources of the IN2P3 Computing Center (CC-IN2P3–Lyon/Villeurbanne—France) funded by the Centre National de la Recherche Scientifique; the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231; STFC DiRAC HPC Facilities, funded by UK BEIS National E-infrastructure capital grants; and the UK particle physics grid, supported by the GridPP Collaboration. This work was performed in part under DOE Contract DE-AC02-76SF00515.

Author e-mails

Author affiliations

ORCID iDs

Dates

3.4.1. Magnitude Shifts

3.4.2. Normalizing Flows

Improving Photometric Redshift Estimates with Training Sample Augmentation

Article metrics

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Simulated Data

2.1. DC2

2.2. Buzzard

3. Methodology

3.1. Nonrepresentative Training Sample

3.2. Photo-z Estimation

3.3. Augmentation

3.4. Photometry Shifts

3.4.1. Magnitude Shifts

3.4.2. Normalizing Flows

4. Results

4.1. Augmentation with Individual Features

4.2. Augmentation with Double-feature Combinations

4.3. Color+Magnitude+Redshift Augmentation

4.4. Comparison to TPZ

5. Conclusions

Acknowledgments

Footnotes