0% found this document useful (0 votes)

124 views19 pages

Pangu Weather

Uploaded by

dnht.hhcc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

124 views19 pages

Pangu Weather

Uploaded by

dnht.hhcc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

TECHNICAL REPORT 1

Pangu-Weather: A 3D High-Resolution System

for Fast and Accurate Global Weather Forecast
Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian , Fellow, IEEE

Abstract—In this paper, we present Pangu-Weather, a deep learning based system for fast and accurate global weather forecast. For
this purpose, we establish a data-driven environment by downloading 43 years of hourly global weather data from the 5th generation of
ECMWF reanalysis (ERA5) data and train a few deep neural networks with about 256 million parameters in total. The spatial resolution
of forecast is 0.25◦ × 0.25◦ , comparable to the ECMWF Integrated Forecast Systems (IFS). More importantly, for the first time, an
arXiv:2211.02556v1 [physics.ao-ph] 3 Nov 2022

AI-based method outperforms state-of-the-art numerical weather prediction (NWP) methods in terms of accuracy (latitude-weighted
RMSE and ACC) of all factors (e.g., geopotential, specific humidity, wind speed, temperature, etc.) and in all time ranges (from one
hour to one week). There are two key strategies to improve the prediction accuracy: (i) designing a 3D Earth Specific Transformer
(3DEST) architecture that formulates the height (pressure level) information into cubic data, and (ii) applying a hierarchical temporal
aggregation algorithm to alleviate cumulative forecast errors. In deterministic forecast, Pangu-Weather shows great advantages for
short to medium-range forecast (i.e., forecast time ranges from one hour to one week). Pangu-Weather supports a wide range of
downstream forecast scenarios, including extreme weather forecast (e.g., tropical cyclone tracking) and large-member ensemble
forecast in real-time. Pangu-Weather not only ends the debate on whether AI-based methods can surpass conventional NWP methods,
but also reveals novel directions for improving deep learning weather forecast systems.

Index Terms—Numerical Weather Prediction, Deep Learning, Medium-range Weather Forecast.

1 I NTRODUCTION
Weather forecast is one of the most important scenarios learning1 . The methodology is to use a deep neural network
of scientific computing. It offers the ability of predicting to capture the relationship between the input (observed
future weather changes, especially the occurrence of ex- data) and output (target data to be predicted). On spe-
treme weather events (e.g., floods, droughts, hurricanes, cialized computational device (e.g., GPUs), AI-based meth-
etc.), which has large values to the society (e.g., daily activ- ods run very fast and easily achieve a tradeoff between
ity, agriculture, energy production, transportation, industry, model complexity, prediction resolution, and prediction
etc.). In the past decade, with the bloom of high-performance accuracy [9], [10], [11], [12], [13], [14], [15]. As a recent
computational device, the community has witnessed a rapid example, FourCastNet [14] increased the spatial resolution
development in the research field of numerical weather to 0.25◦ × 0.25◦ , comparable to the ECMWF Integrated
prediction (NWP) [1]. Conventional NWP methods mostly Forecast Systems (IFS), yet it takes only 7 seconds on four
follow a simulation-based paradigm which formulates the GPUs for making a 100-member, 24-hour forecast, which
physical rules of atmospheric states into partial differen- is orders of magnitudes faster than the conventional NWP
tiable equations (PDEs) and solves them using numerical methods. However, the forecast accuracy of FourCastNet is
simulations [2], [3], [4]. Due to the high complexity of still below satisfaction, e.g., the RMSE of 5-day Z500 forecast
solving PDEs, these NWP methods are often very slow, e.g., using a single model and a 100-member ensemble are 484.5
with a spatial resolution of 0.25◦ ×0.25◦ , a single simulation and 462.5, respectively, which are much worse than 333.7 re-
procedure for 10-day forecast can take hours of compu- ported by operational IFS of ECMWF [16]. In [8], researchers
tation using hundreds of nodes in a supercomputer [5]. conjectured that ‘a number of fundamental breakthroughs
This largely reduces the timeliness in daily weather forecast are needed’ before AI-based methods can beat NWP.
and the number of ensemble members that can be used The breakthrough comes much earlier than they thought.
for probabilistic weather forecast. In addition, conventional In this paper, we present Pangu-Weather, a powerful AI-
NWP algorithms largely rely on the parametric numerical based weather forecast system that, for the first time,
models, but these models, albeit being very complex [1], surpasses existing NWP methods (and, of course, AI-based
are often considered inadequate [6], [7], e.g., errors will be methods) in terms of prediction accuracy of all factors.
introduced by parameterization of unresolved processes. The test is performed on the 5th generation of ECMWF
To address the above issues, a promising direction lies reanalysis (ERA5) data. We download 43 years (1979–2021)
in data-driven weather forecast with AI, in particular, deep of global weather data, among which we use the 1979–2017
data for training, the 2019 data for validation, and the 2018,

1. Throughout this paper, we will use ‘conventional NWP’ or simply

• All authors are with Huawei Cloud Computing, Shenzhen, Guangdong
‘NWP’ to refer to the numerical simulation methods, and use ‘AI-based’
518129, China.
or ‘deep learning based’ to specify data-driven forecast systems. We
E-mail: {bikaifeng1,tian.qi1}@huawei.com, [email protected]
understand that, verbally, AI-based methods also belong to NWP, but
• Qi Tian is the corresponding author.
we follow the convention [8] to use these terms.
2 TECHNICAL REPORT

Z500 T850 Z500 Q500

600 3 0.80
RMSE (m2/s2)
0.96

Q500 ACC
Z500 ACC
RMSE (K)
400 2 0.75
0.94
200 0.70
1 0.92
0 0.65
24 72 120 168 24 72 120 168 01 02 03 04 05 06 07 08 09 10 11 01 02 03 04 05 06 07 08 09 10 11
Forecast Time (hours) Forecast Time (hours) Month Month
Z500 T850 T500 U500
1.0 1.0 0.875
0.95
0.850

U500 ACC
T500 ACC
0.9 0.9
ACC

ACC
0.90 0.825
0.8 0.800
0.8 0.85
0.775
24 72 120 168 24 72 120 168 01 02 03 04 05 06 07 08 09 10 11 01 02 03 04 05 06 07 08 09 10 11
Forecast Time (hours) Forecast Time (hours) Month Month
Pangu-Weather Operational IFS FourCastNet Pangu-Weather Operational IFS

Track Forecast for Typhoon Kong-rey from 2018-09-30 00UTC Track Forecast for Typhoon Yutu from 2018-10-23 12UTC 30°N
40°N

20°N
30°N

10°N
20°N
Pangu-Weather Forecast Pangu-Weather Forecast
ECMWF HRES Forecast ECMWF HRES Forecast
Ground Truth Ground Truth 0°
120°E 130°E 140°E 150°E 160°E 110°E 120°E 130°E 140°E 150°E

Fig. 1: A showcase of Pangu-Weather’s forecast results. Top: Pangu-Weather claims significant advantages over operational
IFS (NWP) and FourCastNet (AI-based) in terms of forecast accuracy (i) of different factors (500hPa geopotential, Z500,
and 850hPa temperature, T850) and (ii) with respect to different months in year. Middle: visualization of Pangu-Weather’s
3-day forecast of 2m temperature (T2M) and 10m wind speed at 00:00 UTC, September 1st, 2018, with comparison to
the ERA5 ground-truth. Bottom: Pangu-Weather produces more accurate tracking for two tropical cyclones in 2018, i.e.,
Typhoon Kong-rey (2018-25) and Yutu (2018-26). Specifically, Pangu-Weather predicts the correct path of Yutu (i.e., it goes
to the Philippines) 48 hours earlier than the ECMWF-HRES forecast.
BI et al., PANGU-WEATHER: A 3D HIGH-RESOLUTION MODEL FOR FAST AND ACCURATE GLOBAL WEATHER FORECAST 3

2020, 2021 data for testing. We choose 13 pressure levels, Overall, the contribution of this paper are summarized
each with 5 important variables (i.e., geopotential, specific in the following three aspects:
humidity, temperature, u-component and v -component of
• We end the debate on whether AI-based methods can
wind speed), and the surface level with 4 variables (i.e., 2m
surpass NWP for global weather forecast. We estab-
temperature, u-component and v -component of 10m wind
lish a deep learning framework that, for the first time,
speed, and mean sea-level pressure).
surpasses operational IFS in terms of all weather fac-
Some key results are summarized in Figure 1. Quanti-
tors and all forecast times from one hour to one week,
tatively, Pangu-Weather outperforms all existing weather
meanwhile enjoying a very fast inference speed and
forecast systems. In particular, with a single-member fore-
a high spatial resolution of 0.25◦ × 0.25◦ .
cast, Pangu-Weather reports an RMSE of 5-day Z500 forecast
• Technically, we reveal several key issues that signif-
of 296.7, significantly better than the operational IFS [16]
icantly improve forecast accuracy, namely, (i) using
and the previous best AI-based method (i.e., FourCast-
a 3D deep network to integrate height information,
Net [14]) which reported 333.7 and 462.5, respectively. In
and (ii) applying hierarchical temporal aggregation
addition, the inference cost of Pangu-Weather is merely
to alleviate cumulative forecast errors. Arguably,
1,400ms on a single GPU, more than 10000× faster than
these techniques will be more effective in the fu-
operational IFS and on par with FourCastNet [14]. Qual-
ture with more powerful computational device and
itatively, Pangu-Weather not only shows high-resolution
higher-quality training data.
(0.25◦ × 0.25◦ ) visualization maps (e.g., for temperature and
• We show that Pangu-Weather can easily transfer
wind speed), but also offers high-quality extreme weather
the ability of deterministic forecast to downstream
forecast (e.g., for tropical cyclone tracking).
scenarios such as extreme weather forecast and large-
The technical contribution of Pangu-Weather is two-fold.
member ensemble forecast, where timeliness is guar-
First, we integrate height information (offered by different
anteed by its fast inference speed.
pressure levels) into a new dimension, so that the input and
output of deep neural networks are in 3D forms. We further The remainder of this paper is organized as follows. Sec-
design a 3D Earth-specific transformer (3DEST) architecture tion 2 formulates the problem and briefly reviews previous
to process 3D data. Our experiments show that, although 3D work, based on which we demonstrate our technical in-
data require heavier computational overhead (in particular, sights. Section 3 elaborates the Pangu-Weather system with
the large memory costs obstacle us from using full obser- algorithmic designs and implementation details. Section 4
vation elements and very deep network architectures), 3D shows generic forecast results and investigates two spe-
models can better capture the intrinsic relationship between cific scenarios, namely, extreme weather forecast and large-
different pressure levels and thus yield significant accuracy member ensemble forecast. Section 5 draws conclusions and
gain beyond the 2D counterparts. Second, we apply a reveals future directions.
hierarchical temporal aggregation algorithm that involves
training a series of models with increasing forecast lead
2 P RELIMINARIES AND I NSIGHTS
times (i.e., 1-hour, 3-hour, 6-hour, and 24-hour forecast).
Hence, in the testing stage, the number of iterations needed 2.1 Problem Setting and Notations
for medium-range (e.g., 5-day) forecast is largely reduced Most weather forecast systems were built upon the anal-
and, consequently, the cumulative forecast errors are allevi- ysis or reanalysis beyond observation data. The reanalysis
ated. Compared to previous methods (e.g., FourCastNet [14] datasets are considered the best known estimation [22], [23]
applied plain temporal aggregation with recurrent opti- for most atmospheric variables except for some factors like
mization), our strategy is easier to implement, more stable precipitation. Throughout this paper, we make use of the
during training, and achieves much higher medium-range ERA5 dataset, i.e., the 5th generation of ECMWF reanalysis
forecast accuracy. data [24]. The ERA5 data have four dimensions, namely,
The Pangu-Weather system is built upon a GPU clus- latitude and longitude, pressure levels (for height) and time.
ter of Huawei Cloud with 192 NVIDIA Tesla-V100 GPUs. We can choose an arbitrary number of weather factors
Each single forecast model is trained for 100 epochs which (e.g., geopotential, temperature, etc.), but do not count them
take around 15 days. To maximally support large neural toward a new dimension. With a total size of over 2PB, the
networks, we use a batch size of 1 on each GPU, i.e., the dataset is split into 2D (latitude and longitude) slices to ease
overall batch size is 192. With diagnostic studies, we notice downloading. That said, given a time point (hourly within
that the forecast accuracy continuously goes up with a larger the past 60 years), a pressure level (or Earth’s surface), and
amount of training data and/or a longer training procedure a weather factor, one can download a matrix representing
– 100 epochs, the maximum budget that we can use, are the specified global reanalysis data. We denote the overall
actually insufficient for the training procedure to arrive at ERA5 data as A, and we use superscripts to refer to specific
full convergence. That said, the community can wait for weather factors and pressure levels, and subscripts to indi-
more data (including increasing the time and/or spatial cate spatiotemporal coordinates, e.g., AT850t stands for the
resolutions) or use more powerful computational device to global temperature data (a matrix) at time t and a height of
improve AI-based weather forecast. The trend is similar 850hPa and AZ500 x,y,t the geopotential data at position (x, y),
to establishing large-scale pre-trained models in other AI time t, and a height of 500hPa – note that AZ500x,y,t is a single
scopes, e.g., computer vision [17], [18], natural language number.
processing [19], [20], cross-modal understanding [21], and Based on the ERA5 data, the problem of weather forecast
beyond. is clearly defined: given an initial time t0 , the algorithm
4 TECHNICAL REPORT
Terms Definition in this paper
weather states, and (ii) parameterization [27], which uses
system The entire algorithm for end-to-end weather forecast
model A deep network that produces one-time prediction
an approximate function to solve very complex weather
initial time The time point that weather forecast is made at
processes – typical examples include the parameterization
forecast time The time gap between observation and desired forecast for cloud [28], [29], [30] and convection [31], [32].
lead time The time gap between input and output of one model Prior to this work, NWP methods contribute overall the
spacing The minimum forecast time in a forecast system highest prediction accuracy, but they are still troubled by
range The maximum forecast time in a forecast system
variable An observed weather factor, e.g., 2m temperature the super-linearly increasing computational overhead [1],
parameter A learnable value in deep networks [5], especially when the amount of observation data keeps
x, y Horizontal coordinate (latitude & longitude) growing and it is difficult to perform efficient parallelization
t Temporal coordinate (time point) for NWP methods [33]. The slowness of NWP not only
∆t Lead time added to t weakens the timeliness of operational IFS systems (e.g., most
h Height (in pressure level, e.g., 500hPa)
A The overall weather data (e.g., ERA5) such systems can only update prediction several times a
ATh
x,y,t Temperature at position (x, y), time t, height h day), but also restricts the number of ensemble members
A∗t All variables (all positions and heights) at time point t (i.e., a set of individual prediction results for ensemble),
Â∗t+∆t The forecast results at time point t + ∆t hence weakening the diversity and accuracy of probabilistic
weather forecast. In addition, the formulae used by NWP
TABLE 1: A summary of terminologies and notations used methods inevitably introduce approximation and computa-
in this paper. In this work, we name the proposed system as tional errors [6], [34] which can augment with either itera-
Pangu-Weather and the proposed model architecture as 3D tion or incomplete or inaccurate analysis data [35]. It thus
Earth-specific transformers (3DEST). brings major challenges to maintain NWP methods with a
complex PDE system that takes more and more factors into
consideration.
shall make use of all historical weather data (i.e., A∗t for
t 6 t0 ) to predict future weather data (i.e., A∗t for t > t0 ),
where ∗ stands for all factors. Before starting a survey on 2.3 AI-based Methods
existing methods, we first note that the resolution of weather
To alleviate the above burden, researchers started the second
data is large due to the following facts. First, there are
line that investigates AI-based methods for weather forecast.
37 × 21 + 262 observation factors in total (37 pressure levels,
The cutting edge technology of AI lies in deep learning [36],
each of which has 21 weather variables, and a surface that
a branch of machine learning, assuming that the complex
has 262 variables), and it is believed that different elements
function (i.e., f (·)) can be directly learned from abundant
can impact each other (e.g., temperature is highly correlated
training data without knowing the actual physical proce-
to geopotential). Second, ERA5 provides about 60 years of
dure and/or formulae. Most often, f (·) appears as a deep
hourly observation data, i.e., the scale of time axis is over
neural network which is often written as f (·; θ) where · is
105 . Third, the spatial resolution is 0.25◦ × 0.25◦ , implying
a placeholder for input data and θ denotes the learnable
that each frame of global weather data is of 1440 × 720
parameters. The network often contains a number of lay-
numbers (i.e., ‘pixels’ or ‘voxels’ if the data is to be processed
ers. Each of these layers has a large amount of learnable
by deep neural networks). As we shall see later, the high
parameters, and these parameters are initialized as white
complexity has raised serious concerns on computational
noise and optimized by back-propagating prediction errors
costs for both NWP and AI-based methods.
of the deep network. The most similar field to weather
Based on the above definition, a weather forecast system
forecast is computer vision (CV) where image data appears
is described as a mathematical function f (·) applied on A∗t .
in 2D/3D cubes. In the past decade, the CV community
There are mainly two lines of research for weather forecast,
developed many effective network architectures (e.g., [37],
which we follow the convention [8] to refer to them as NWP
[38], etc.), and recently, they transplanted a kind of powerful
and AI-based methods.
architectures named transformers from natural language
processing [39] and developed the variants [40], [41] that
2.2 NWP Methods are capable of dealing with image data.
The first line is the conventional numerical weather pre- In the scope of weather forecast, AI-based methods were
diction (NWP) methods that approximate f (·) using simu- first applied in the scenarios where it is difficult to predict
lation. Starting with initial weather states, a set of partial future weather data using the NWP methods, e.g., precip-
differential equations (PDEs) are established to simulate itation forecasting based on radar data [42], [43], [44], [45]
different physical processes such as thermodynamics equa- or satellite data [46], [47]. The powerful expressive ability
tions, N-S equations, continuous equations, etc [1], [25], [26]. of deep neural networks led to the success in these data-
To solve the PDEs, the atmospheric states are partitioned driven environments, which further encouraged researchers
into discrete grids. Intuitively, reducing the spacing of grids to delve into the scenarios that the NWP methods are
leads to a larger number of grids and a higher spatial troubled by enormous computational overhead, e.g., direct
resolution of weather forecast, and also increases the com- medium-range weather forecast [10], [12], [13], [14], [15] that
putational costs of simulation. Currently, the spatial reso- consumed most of the computational resources of weather
lution is highly limited by the power of supercomputers. forecast centers in the past decade.
To accelerate computation, more approximation approaches This paper investigates medium-range weather forecast.
were introduced, including (i) interpolation, which first per- NWP and AI-based methods have been competing in this
forms low-resolution simulation and then estimates in-grid scenario, where NWP methods led in forecast accuracy [8]
BI et al., PANGU-WEATHER: A 3D HIGH-RESOLUTION MODEL FOR FAST AND ACCURATE GLOBAL WEATHER FORECAST 5

and resolution, while AI-based methods showed their ad- cast. As an AI-based method, it surpasses the accuracy of
vantages in efficiency (e.g., the inference speed is orders of conventional NWP methods for the first time, meanwhile
magnitude faster than the NWP methods [8], [14]). Prior enjoying a very fast inference speed.
to 2022, AI-based methods cannot achieve the horizontal The core part of Pangu-Weather is a set of deep neural
resolution of 0.25◦ × 0.25◦ as NWP methods can. Recently, networks trained on 39 years of global weather data –
FourCastNet [14] improved the resolution to 0.25◦ × 0.25◦ , we elaborate data preparation and the pre-training task in
but the forecast accuracy (e.g., in terms of RMSE or ACC), is Section 3.2. The key to reduce the accuracy loss is two-fold,
still inferior to operational IFS even after a large-member namely (i) using a 3D Earth-specific transformer (3DEST) to
ensemble was performed. The disadvantages in forecast model the 3D atmosphere effectively – see Section 3.3, and
accuracy and interpretability, especially in extreme weather (ii) applying a hierarchical temporal aggregation strategy
events, hinder the applications of AI-based methods. Con- (i.e., training a few models with various lead times) to
sequently, AI-based methods mostly play the role of fast alleviate cumulative forecast errors – see Section 3.4. The
surrogate models for medium-range weather forecast. Pangu-Weather system can be applied to generic or specific
forecast scenarios, as we shall see in Section 4.
2.4 Insights
We briefly analyze the reasons why AI-based (specifically, 3.2 Data Preparation and the Pre-training Task
deep learning based) methods fell behind NWP methods in We download the ERA5 dataset [24], [48], [49] from the
terms of prediction accuracy. There are mainly two aspects, official website2 for training and evaluating Pangu-Weather.
summarized as follows. It contains global, hourly reanalysis data for the past 60
First, weather forecast shall take high-dimensional (e.g., years. The observation data and the prediction of numerical
3D spatial with 1D time), anisotropic data into consider- models are blended into reanalysis data using numerical
ation, yet existing AI-based methods [10], [12], [13], [14], assimilation methods, providing a high-quality benchmark
[15] often worked on 2D (latitude and longitude) data. for global weather forecast. Following the existing meth-
This brings two-fold disadvantages. On the one hand, the ods [10], [13], [14], we train our models on a subset of ERA5
spacing and distribution of atmospheric states and the – in particular, we use the 1979–2017 (39 years of) data for
relationship between atmospheric patches change rapidly training, the 2019 data for validation, and the 2018, 2020,
across pressure levels, making it difficult for 2D models 2021 data for testing.
to adapt to different situations. On the other hand, many We make use of observation data of every single hour
weather processes (e.g., radiation, convection, etc.) can only so that the algorithm can perform hourly prediction. We
be completely formulated in the 3D space, and thus 2D keep the highest spatial resolution available in ERA5,
models cannot make use of such important patterns. namely, 0.25◦ × 0.25◦ on Earth’s sphere, resulting in an
Second, medium-range weather forecast can suffer from input resolution of 1440 × 721 (1440 for longitude and
cumulative forecast errors when the model is called too 721 for latitude – note that the northmost and south-
many times. As an example, FourCastNet [14] trained a most data do not overlap). The largest difference between
base model for 6-hour forecast, so that performing a 7-day our method and the prior works lies in that we formu-
forecast required executing the model 28 times iteratively. late height information (represented as pressure levels)
Compared to the case in NWP methods, such errors can into the 3rd spatial dimension. To reduce computational
grow rapidly because AI-based methods often do not con- costs, we follow [10] to choose 13 pressure levels (i.e.,
sider real-world constraints (e.g., formulated by the PDEs). 50hPa, 100hPa, 150hPa, 200hPa, 250hPa, 300hPa, 400hPa,
According to the results in FourCastNet, the forecast error 500hPa, 600hPa, 700hPa, 850hPa, 925hPa, and 1000hPa),
often grows super-linearly with time. Note that FourCast- from a total of 37 levels, plus Earth’s surface. To fairly
Net applied a specialized method for reducing iteration compare with the online version of ECMWF control fore-
error, but the actual gain is somewhat limited. cast, we choose to predict the factors published in the
Summarizing the above factors, we come up with the TIGGE dataset [16], namely, five upper-air atmospheric vari-
insights that one shall try to increase the dimensionality of ables (i.e., geopotential, specific humidity, temperature, u-
data and reduce the number of iterations for more accurate component and v -component of wind speed) and four sur-
medium-range weather forecast. However, this encounters face weather variables (i.e., 2m temperature, u-component
difficulties in computational overhead because the weather and v -component 10m wind speed, and mean sea level pres-
data is very large (see Section 2.1). In the next part, we will sure). In addition, three constant masks (i.e., the topography
elaborate a method built upon a tradeoff between accuracy mask, land-sea mask and soil type mask) are added to the
and efficiency – in brief, we use 3D (latitude, longitude, and input of surface variables.
height) data as input and output, and train a few individual The pre-training task is straightforward, i.e., asking the
models for different prediction time gaps to maximally re- model to predict the future weather given historical ob-
duce the maximum number of iterations called for medium- servation data. Technically, this involves sampling a time
range forecast. point t (i.e., date and hour) from the dataset and specifying
a prediction gap ∆t, so that the model, f (·; θ), takes A∗t
3 M ETHODOLOGY as input and predicts Â∗t+∆t , with the goal of approaching
3.1 Overview A∗t+∆t . In the context of deep learning, f (·; θ) appears as a
Based on the above insights, we present our system, termed 2. https://cds.climate.copernicus.eu/ offered by Copernicus Climate
Pangu-Weather, for fast and accurate global weather fore- Data (CDS).
6 TECHNICAL REPORT

2×4×4 3D Earth-Specific Transformer 2×4×4

Patch Patch
Embedding Recovery

Layer 1 Layer 4
merge Earth-Specific Block×2 Earth-Specific Block×2
(8 × 360 × 181 × 𝐶) (8 × 360 × 181 × 𝐶) split
Upper-air Variables Upper-air Variables
(13 × 1440 × 721 × 5) down-sampling up-sampling (13 × 1440 × 721 × 5)

Layer 2 Layer 3
4×4 Earth-Specific Block×6 Earth-Specific Block×6 4×4
(8 × 180 × 91 × 2𝐶) (8 × 180 × 91 × 2𝐶) Patch
Patch
Embedding Recovery

Encoder Decoder

Surface Variables Surface Variables

(1440 × 721 × 4) (1440 × 721 × 4)

Fig. 2: An overview of the 3D Earth-specific transformer (3DEST). Based on the standard encode-decoder design, we (i)
adjust the shifted-window mechanism and (ii) apply an Earth-specific positional bias – see the main texts for details.

differentiable function so that the difference between A∗t+∆t activation for this purpose. In our implementation, a patch
and Â∗t+∆t is computed and back-propagated to update the has 2 × 4 × 4 pixels for upper-air variables and 4 × 4 for sur-
parameters, θ . The technical details, including the design of face variables. The stride of sliding windows is the same as
f (·; θ) and the choice of ∆t values, are to be elaborated in patch size, and necessary zero-value padding is added when
the following subsections. the data size is indivisible by the patch size. The number of
parameters for patch embedding is (4 × 4 × 2 × 5) × C for
3.3 3D Earth-Specific Transformer upper-air variables and (4 × 4 × 4) × C for surface variables.
Patch recovery performs the opposite operation, but it does
This part describes the design of f (·, θ). We name it as a 3D
not share parameters with patch embedding.
Earth-specific transformer (3DEST). The overall architecture
of 3DEST is illustrated in Figure 2. It is a variant of vision The encoder-decoder architecture. The data size re-
transformer [40] with input and output being 3D weather mains unchanged (8 × 360 × 181 × C ) for the first 2 encoder
states at a specified time point. For a single model, the layers, while for the next 6 layers, the horizontal dimensions
lead time between input and output states is fixed, e.g., ∆t are reduced by a factor of 2 and the number of channels
equals to 6 hours. We achieve any-time weather forecast by is doubled, resulting in a data size of 8 × 180 × 91 × 2C .
aggregating multiple models with different lead times, as The decoder part is symmetric to the encoder part, with
elaborated in the next subsection. the first 6 decoder layers sized 8 × 180 × 91 × 2C and
There are two sources of input and output data, namely, the next 2 layers sized 8 × 360 × 181 × C . The outputs
upper-air variables and surface variables. The former in- of the 2nd encoder layer and the 7th decoder layer are
volves 13 pressure levels, and they combined offer a 13 × concatenated along the channel dimension. Down-sampling
1440×721×5 data cube. The latter contains a 1440×721×4 and up-sampling operations connect the adjacent layers of
cube. These parameters are first embedded from the original different resolutions, and we follow the implementation of
space into a C -dimensional latent space. A common tech- Swin transformers [41]. For down-sampling, we merge four
nique in computer vision named patch embedding is used tokens into one (the feature dimensionality increases from
for dimensionality reduction. For the upper-air part, the C to 4C ) and perform a linear layer to reduce the dimen-
patch size is 2×4×4, so that the embedded data has a shape sionality to 2C . For up-sampling, the reverse operations are
of 7×360×181×C . For the surface variables, the patch size is performed.
4×4, so that the embedded data has a shape of 360×181×C . 3D Earth-specific transformer blocks. Each encoder
These two data cubes are then concatenated along the first and decoder layer is a 3D Earth-specific transformer
(height) dimension to yield a 8 × 360 × 181 × C cube. (3DEST) block. It is similar to the standard vision trans-
The cube is then propagated through a standard encoder- former block [40] but specifically designed to align with
decoder architecture with 8 encoder layers and 8 decoder Earth’s geometry. To further reduce computational costs, we
layers. The output of decoder is still a 8 × 360 × 181 × C inherit the window-attention mechanism [41] to partition
cube, which is projected to the original space with patch the feature maps (either 8 × 360 × 181 or 8 × 180 × 91
recovery, producing the desired output. Below, we describe – the last dimension is omitted) into windows, and each
the technical details of each component. window contains up to 2 × 12 × 6 tokens. The standard self-
Patch embedding and patch recovery. We follow the attention mechanism is applied within each window. The
standard vision transformer to use a linear layer with GeLU shifted-window attention mechanism is applied, so that for
BI et al., PANGU-WEATHER: A 3D HIGH-RESOLUTION MODEL FOR FAST AND ACCURATE GLOBAL WEATHER FORECAST 7

Horizontal Grids 500hPa Geopotential Height Distribution w.r.t. Pressure Level

22 280
20 270
18

Mean Wind Speed (m/s)

Mean Temperature (K)

260

Geopotential Height (m)

5800 16 250
5600
14 240
5400
5200 12
5000 230
180
120
) 10
90 60 0
60
( 220
60 de 8 Wind Speed
120 gitu
30
Latitud 0 30 60 180 on Temperature 210
e( ) 90 L
2 4 6 8 10 12
# Pressure Level

Fig. 3: The motivation of using an Earth-specific positional bias. Left: the horizontal map corresponds to an uneven spatial
distribution on Earth’s sphere. Middle: the geopotential height is closely related to the latitude. Right: the mean wind
speed and temperature are closely related to the height (formulated as pressure levels).

every layer, the grid partition differs from the previous one two units within the same window (Swin does not com-
by half window size3 . We refer the reader to the original pute inter-window attentions), we use the window coordi-
paper [41] for more details. The standard self-attention nate (mpl , mlat , mlon ) to locate the corresponding bias sub-
formula is written below: matrix (mlon is not used), and then use the intra-window
√ coordinates, (h01 , φ01 , λ01 ) and (h02 , φ02 , λ02 ), to call for the bias
Attention(Q, K, V) = SoftMax(QK> / D + B)V, (1)
value at (h01 + h02 × Wpl , φ01 + φ02 × Wlat , λ01 − λ02 + Wlon − 1)
where Q, K, and V are the query, key, and value vectors of the sub-matrix.
produced by the transformer block, respectively, D is the
Applying the Earth-specific positional bias brings two-
feature dimensionality of Q and K (i.e., C or 2C ), and B is
fold differences. First, it enables a better formulation of
the positional bias term.
Earth’s atmosphere: In every attention block, the Earth-
Earth-specific positional bias. Swin transformer used
specific positional bias learns different spatial relationship
a relative positional bias to represent the translation invari-
between tokens for different latitudes and heights, hence
ant component of attentions, where the bias is computed
correcting the non-uniformity brought by the uneven spa-
upon the relative coordinate of each window. For global
tial distribution. Second, compared to the original ver-
weather forecast, however, the situation is a bit different.
sion where all grids share the same bias, the number of
Each token corresponds to an absolute position on Earth’s
learnable parameters of each transformer layer is largely
coordinate system and, since the map is a projection of
increased from (2Wpl − 1) × 2(Wlat − 1) × (2Wlon − 1) to
Earth’s sphere, the spacing between neighboring tokens can 2 2
Mpl × Mlat × Wpl × Wlat × (2Wlon − 1). In the first block,
be different – see Figure 3. More importantly, some weather
the latter quantity is about 527× larger than the former
states are closely related to the absolute position. Examples
one. The huge amount of bias parameters allows each block
of geopotential, wind speed, and temperature are shown
to flexibly learn specific patterns for each variable, such
in Figure 3. To capture these properties, we modify B into
as the relationship shown in Figure 3. In practice, we do
an Earth-specific positional bias, termed BESP , adding a
not observe any difficulties in optimizing the large amount
positional bias to each token based on its absolute (rather
of parameters. Instead, the model converges faster in the
than relative) coordinate.
training process since useful priors have been introduced. In
Mathematically, let the entire feature map have a spatial
addition, the Earth-specific positional bias does not increase
resolution of Npl × Nlat × Nlon where Npl , Nlat , and Nlon
the FLOPs of the model.
indicate the size along the axes of height (by pressure lev-
els), latitude, and longitude, respectively. Swin transformer Design choices. We briefly discuss other design
partitions these neurons into Mpl × Mlat × Mlon windows, choices. Due to the large computational overhead, we do
and each window has a size of Wpl × Wlat × Wlon . The not perform exhaustive ablative or diagnostic studies on
Earth-specific position bias matrix contains Mpl × Mlat the hyper-parameters and we believe there exist configura-
sub-matrices (Mlon does not appear because different lon- tions that lead to higher accuracy. First, we use 8 (2 + 6)
gitudes share the same bias – the longitude indices are encoder and decoder layers, which is significantly fewer
cyclic and spacing is evenly distributed along this axis), than the standard Swin transformer. This is to reduce the
2 2
each of which contains Wpl × Wlat × (2Wlon − 1) learn- complexity in both time and memory. If one has a larger
able parameters. When the attention is computed between GPU memory and a more powerful cluster, increasing the
network depth can lead to higher accuracy. Second, it is
3. Note that, along the longitude dimension, the leftmost and right- possible to reduce the number of parameters used in the
most indices are actually close to each other. In the shifted-window
mechanism, if half windows appear at both leftmost and rightmost Earth-specific positional bias by parameter sharing or other
positions, they are directly merged into one window. techniques. However, we do not consider it as a key issue,
8 TECHNICAL REPORT

Z500 We point out that hierarchical temporal aggregation

Lead Time 24 hours
600 Lead Time 6 hours
makes both the training and testing stages more efficient.
Lead Time 3 hours For training, it avoids performing recursive optimization
500 Lead Time 1 hour
as many existing works [11], [12], [14] did, e.g., FourCast-
RMSE (m2/s2)
400 Net [14] computed both f (A) and f (f (A)) and produced
two loss terms – although the iterative errors are indeed
300 suppressed, it requires 2× GPU memory for the same model
and thus reduced the model size which is one of the critical
200 factors of improvement. In addition, it avoids training a re-
100 cursive neural network which may be unstable. For testing,
especially when the forecast range is large, it reduces the
0 number of forecasts as well as the time complexity.
24 72 120 168
Forecast Time (hours) The four individual models are trained for 100 epochs
using the Adam optimizer. Each full training procedure
Fig. 4: The curves showing cumulative forecast errors when takes 16 days on 192 NVIDIA Tesla-V100 GPUs. We find
one performs up to 7-day forecast with the base lead time that all models have not yet arrived at full convergence
being 1 hour, 3 hour, 6 hours, and 24 hours, respectively. at the end of 100 epochs, but the limited computational
The statistics are performed in the March 2018 subset. budget prevents us from continuing the training procedure.
A weight decay of 3 × 10−6 and a scheduled DropPath with
a drop ratio of 0.2 are adopted to avoid over-fitting.
because it is unlikely to deploy the weather forecast model
to edge device with limited storage. Third, it is possible and
promising to feed the weather states of more time points 4 R ESULTS
into the model, which changes all tensors from 3D to 4D.
We report the forecast results of Pangu-Weather on two
While we believe such a modification can lead to accuracy
datasets. The first one is the held-out part of ERA5 for an
gain, the limited computational budget prevents us from
overall evaluation of global, deterministic weather forecast.
this trial.
The second one is the 4th version of International Best
Track Archive for Climate Stewardship (IBTrACS) dataset
3.4 Hierarchical Temporal Aggregation for evaluating the ability at tracking tropical cyclones, a
special case of extreme weather forecast.
When the goal is to make medium-range weather forecast
We compare Pangu-Weather to the strongest methods in
(e.g., the forecast time is up to 5 days) yet the lead time of
both worlds of NWP and AI, namely, operational IFS offered
the basic forecast model is relatively short (e.g., FourCastNet
by ECMWF (downloaded from the TIGGE archive [16])5 and
trained a model with a lead time of 6 hours), the system
FourCastNet [14]. For tropical cyclones tracking, we also
must execute the model many times iteratively, and the
download ECMWF-HRES forecast as a stronger competi-
cumulative forecast errors can grow continuously. As shown
tor against Pangu-Weather in IBTrACS. To the best of our
in Figure 4, we mimic FourCastNet [14] to execute the 6-
knowledge, no prior AI-based methods have ever reported
hour model 28 times to achieve up to 7-day forecast, and
quantitative results for tropical cyclones tracking.
we find that the forecast accuracy rapidly goes down as
iteration goes on. Not surprisingly, the forecast accuracy
drop becomes dramatic if the basic lead time is set to 4.1 Deterministic Forecast
be 1 hour (i.e., the model is executed 168 times), yet the
The deterministic forecast of Pangu-Weather is performed
drop is largely alleviated if the lead time is 24 hours (i.e.,
on the unperturbed initial states from ERA5. The forecast
executed 7 times). This implies that, for medium-range and
resolution of Pangu-Weather is determined by the training
even long-range forecast, the system can benefit much from
data (i.e., ERA5), where the spatial resolution is 0.25◦ ×0.25◦ ,
suppressing the cumulative forecast errors.
comparable to the control forecast of ECMWF ENS prod-
For this purpose, we exploit a straightforward yet effec-
uct [3] and same as FourCastNet [14], yet the spacing of
tive strategy named hierarchical temporal aggregation. We
forecast (the minimal forecast time) is 1 hour (i.e., Pangu-
train four individual models for 1-hour, 3-hour, 6-hour, and
Weather can provide hour-by-hour forecast), 6× smaller
24-hour prediction, respectively. We do not continue enlarg- than that of FourCastNet [14].
ing the lead time, because it largely increases the difficulty
Following the prior AI-based methods, the accuracy
of training the base model4 . At the testing stage, given a
of deterministic forecast is computed by two quantitative
forecast goal, we use the greedy algorithm to guarantee
metrics, namely, the latitude-weighted Root Mean Square
the minimal number of iterations. For example, for 7-day
Error (RMSE) and latitude-weighted Anomaly Correlation
forecast, we execute 24-hour forecast 7 times, while for a 23-
Coefficient (ACC). For a specified time point t, the RMSE
hour forecast, we execute 6-hour forecast 3 times, followed
by 3-hour forecast 1 time and 1-hour forecast 2 times. 5. We failed to download part of forecast results of the surface
variables from TIGGE, due to the unavailability of ECMWF’s Data
4. We find that, based on the current deep network, it is difficult to Handling Systems from September to November, so we compare our
perform long-term (say, 28-day) forecast. We conjecture that, if more results to operational IFS by (i) fetching the numbers reported in
powerful methods are used (e.g., using time-aware inputs, increasing WeatherBench [10] and (ii) extracting the quantities from the plots in
the computational complexity, etc.), the model may gain such abilities. the FourCastNet paper [14].
BI et al., PANGU-WEATHER: A 3D HIGH-RESOLUTION MODEL FOR FAST AND ACCURATE GLOBAL WEATHER FORECAST 9

T500 Q500 U500 V500

0.9 8
2.5 7
0.8 7
2.0 0.7 6 6

RMSE (g/kg)

RMSE (m/s)

RMSE (m/s)
0.6 5
RMSE (K)

5
1.5
0.5 4 4
1.0 0.4 3 3
0.3 2 2
0.5
0.2 1 1

24 72 120 168 24 72 120 168 24 72 120 168 24 72 120 168

Forecast Time (hours) Forecast Time (hours) Forecast Time (hours) Forecast Time (hours)
T500 Q500 U500 V500
1.0 1.0 1.0 1.0

0.9 0.9
0.95 0.9
0.8
ACC

ACC

ACC
0.8
0.9 0.8
0.7
0.7
0.85
0.7 0.6
0.6
24 72 120 168 24 72 120 168 24 72 120 168 24 72 120 168
Forecast Time (hours) Forecast Time (hours) Forecast Time (hours) Forecast Time (hours)

Pangu-Weather Operational IFS

Fig. 5: The comparison of forecast accuracy in terms of latitude-weighted RMSE (lower is better) and ACC (higher is better)
of four upper-air variables at the pressure level of 500hPa. Here, T, Q, U and V stand for temperature, specific humidity,
u-component and v -component of wind speed, respectively.

and ACC of any variable v (e.g., 2m temperature or 500hPa variables (i.e., geopotential, specific humidity, tempera-
geopotential) are defined as follows: ture, u-component and v -component of wind speed) at
sP 13 pressure levels (i.e., 50hPa, 100hPa, 150hPa, 200hPa,
Nlat PNlon v v 2
i=1 j=1 L(i)(Âi,j,t − Ai,j,t ) 250hPa, 300hPa, 400hPa, 500hPa, 600hPa, 700hPa,
RMSE(v, t) = , (2) 850hPa, 925hPa, and 1000hPa), with a spatial resolution
Nlat × Nlon
of 0.25◦ × 0.25◦ . This is to maximally ease the comparison
to operational IFS [16] and FourCastNet [14], the best NWP
L(i)Â0v 0v
P
i,j i,j,t Ai,j,t and AI-based methods.
ACC(v, t) = qP ,
0v 2
P 0v 2 The testing environment is established on the weather
i,j L(i)(Âi,j,t ) × i,j L(i)(Ai,j,t )
(3) data in 20186 . Following the protocol of operational IFS, we
where L(i) = Nlat × PNcos φi
stands for the weight at choose 2 time points (00:00 UTC and 12:00 UTC) each day as
lat
i0 =1
cos φi0 the initial time7 and produce hourly forecast for the upcom-
latitude φi and A0 denotes the difference between A and the ing week, namely, forecast time being 1h, 2h, . . . , 168h = 7d.
climatology (i.e., long-term mean of weather states, which is Quantitative comparisons are mainly made between Pangu-
estimated on the training data over 39 years). Note that we Weather and operational IFS, while the comparisons against
omitted the range of summation in Eqn (3) for simplicity. other AI-based methods (e.g., [10], [11], [12], [13], [14], [15])
In what follows, we report these two metrics on upper- are incomplete due to the difference in spatial resolutions,
air atmospheric variables and surface weather variables to
show the superiority of Pangu-Weather. We also provide 6. We also test our system in the 2020 and 2021 data, while we cannot
extensive visualization and diagnostic results for qualitative provide comparative results since no prior works have reported results
on these data. The property of forecast results is mostly similar to that
studies. observed in the 2018 data.
7. The test points on Jan 1st, 2018 are excluded due to the overlap
4.1.1 Upper-air Atmospheric Variables with training data. All test points in December 2018 are unavailable
due to a server error of ECMWF. In addition, for the T850 variable, all
As in the training procedure (see Section 3.2 for data prepa- test points in October 2018 are not used due to an unexpected error of
ration), Pangu-Weather forecasts five important upper-air data download from the TIGGE archive.
10 TECHNICAL REPORT

T2M U10 V10

4.5
2.50 4.0 4.0
2.25 3.5 3.5
2.00 3.0 3.0

RMSE (m/s)

RMSE (m/s)
RMSE (K)

1.75
2.5 2.5
1.50
2.0 2.0
1.25
1.5 1.5
1.00
1.0 1.0
0.75
0.5 0.5
0.50
24 72 120 168 24 72 120 168 24 72 120 168
Forecast Time (hours) Forecast Time (hours) Forecast Time (hours)
T2M U10 V10
1.00 1.0 1.0

0.9 0.9
0.95
0.8 0.8
ACC

ACC

ACC
0.90 0.7 0.7

0.6 0.6
0.85
0.5
0.5
0.80 0.4
24 72 120 168 24 72 120 168 24 72 120 168
Forecast Time (hours) Forecast Time (hours) Forecast Time (hours)

Pangu-Weather Operational IFS FourCastNet

Fig. 6: The comparison of forecast accuracy in terms of latitude-weighted RMSE (lower is the better) and ACC (higher is
the better) of three surface variables. Here, T2M, U10, V10 stand for 2m temperature, u-component and v -component of
10m wind speed, respectively.

test subsets, post-processing methods, etc. Note that all Specifically, we investigate 500hPa geopotential (Z500)
prior AI-based methods reported inferior forecast accuracy and 850hPa temperature (T850), the variables that were
compared to operational IFS, while our method significantly widely reported in prior AI-based methods. The quantita-
outperforms operational IFS, claiming clear advantages over tive comparison for these two variables is shown in Fig-
these candidates. For two specific variables, Z500 and T850, ure 1. As shown, the forecast accuracy of Pangu-Weather is
we directly compare Pangu-Weather against FourCastNet consistently higher than that of operational IFS and Four-
by fetching the numerical values from the plots in the CastNet, the previous best AI-based method (yet weaker
paper – this introduces some errors which are negligible than operational IFS). Quantitatively, for Z500, the 3-day
compared to the accuracy gap between Pangu-Weather and and 5-day RMSEs (in m2 /s2 ) of operational IFS are 152.8
FourCastNet. and 333.7, respectively, and Pangu-Weather reduces them
The comparative results between Pangu-Weather and to 134.5 and 296.7 (133.9 and 294 if the December 2018
operational IFS are shown in Figure 1 (top) and Figure 5, data are included). For T850, the 3-day and 5-day RMSEs
where Pangu-Weather enjoys consistent gains (in all forecast (in K) of operational IFS are 1.37 and 2.06, respectively, and
times and for all variables) in forecast accuracy compared Pangu-Weather reduces them to 1.14 and 1.79 (1.13 and
to operational IFS. The advantage becomes more significant 1.77 if the December 2018 data are included), claiming an
as forecast time increases, implying that AI-based methods over 10% relative error drop. The relative drop of RMSE
are better at capturing effective (though non-interpretable) is more than 10% in all scenarios, which also reflects in
patterns for medium-range weather forecast. Specifically, we a ‘forecast time gain’ of 10–15 hours. When compared to
note that the ‘forecast time gain’ of Pangu-Weather over FourCastNet, we observe even more significant accuracy
operational IFS (i.e., the difference between forecast times gains – the relative reduction of RMSE is more than 30%
at the same forecast accuracy) is more than 12 hours for all in the above scenarios, and the ‘forecast time gain’ is also
variables and more than 24 hours for specific humidity – enlarged to more than 36 hours.
this implies that AI-based methods are significantly better
at forecasting specific variables.
BI et al., PANGU-WEATHER: A 3D HIGH-RESOLUTION MODEL FOR FAST AND ACCURATE GLOBAL WEATHER FORECAST 11

Fig. 7: Visualization of 3-day weather forecast produced by Pangu-Weather (top), operational IFS (middle), and the ERA5
ground-truth (bottom). The left and right columns show the maps of 500hPa geopotential (Z500) and 850hPa temperature
(T850), respectively. The input time point (i.e. the forecast is performed on) is 00:00 UTC, September 1st, 2018.

4.1.2 Surface Weather Variables forecast time increases. The ‘forecast time gain’ of all these
variables is about 18 hours, slightly longer than the gain in
Pangu-Weather forecasts four important surface variables,
forecasting upper-air variables.
i.e., 2m temperature, u-component and v -component of 10m
wind speed, and mean sea level pressure. Compared to the We investigate the forecast accuracy of separate vari-
upper-air variables, these surface variables have close and ables. For 2m temperature (T2M), the 3-day and 5-day
complex relationship to topography and human activities RMSEs (in K) are 1.34 and 1.75 for operational IFS, 1.39 and
(e.g., urban heat island effect) and thus are more difficult 2.00 for FourCastNet, and Pangu-Weather reduces them to
to forecast. The testing environment is established in a 1.05 and 1.53 (1.06 and 1.52 with a 6-hour test interval),
similar way of forecasting upper-air variables8 . Quantitative respectively. For u-component of 10m wind speed (U10),
comparisons are made between Pangu-Weather and previ- the 3-day and 5-day RMSEs (in m/s) are 1.94 and 2.90 for
ous best NWP method (i.e., operational IFS) and AI-based operational IFS, 2.24 and 3.41 for FourCastNet, and Pangu-
method (i.e., FourCastNet [14]), where the numerical results Weather reduces them to 1.61 and 2.53 (1.61 and 2.55 for
of FourCastNet and operational IFS are fetched from the the test data in 2018 with a 6-hour interval), respectively.
plots in the paper9 . The comparative results are shown in We omit the numerical comparison for v -component of 10m
Figure 6. Again, Pangu-Weather outperforms both competi- wind speed (V10) since it is almost the same as that of U10.
tors in terms of forecast accuracy in all forecast times, for all We have the forecast results for the variable of mean sea
variables, and the advantage becomes more significant as level pressure (MSLP) but we cannot provide the quantita-
tive comparison due to the unavailability of data from the
8. The test points on Jan 1st, 2018 are excluded due to the overlap ECMWF server. We believe that our forecast is the best can-
with training data. didate because of two reasons. On the one hand, According
9. For a fair comparison to FourCastNet, we follow the protocol to to prior experiences [50], a model that better forecasts on
set the test interval (the gap between neighborhood test time points) to other surface variables (i.e., T2M, U10, V10) also enjoys a
be 9 days for T2M and 2 days for U10 and V10, albeit we can produce
forecast results every single hour. We also report the RMSE values using higher forecast accuracy on MSLP. On the other hand, we
a fixed 6-hour interval in the following part. make use of our forecast of MSLP for tracking tropical cy-
12 TECHNICAL REPORT

clones – as shown in Section 4.2.2, Pangu-Weather achieves and Q500) and two surface variables (T2M and U10) in
much better results, quantitatively and qualitatively, than Figure 8. Note that we have applied a greedy algorithm
operational IFS in forecasting 88 named tropical cyclones in based on hierarchical temporal aggregation as elaborated
the year of 2018. in Section 3.4. For some variables (e.g., Q500 and T2M), we
In addition, we evaluate Pangu-Weather on Weather- observe a clear trend that forecast accuracy drops with the
Bench [10], a benchmark for low-resolution weather fore- number of iterations, e.g., 3 calls are required for a 72-hour
cast. For this purpose, we simply down-sample the forecast forecast, while 8 calls are required for a 71-hour forecast
results of Pangu-Weather by 22.5× into a coarse grid with a (i.e., 71 = 24 + 24 + 6 + 6 + 6 + 3 + 1 + 1), and thus the
spatial resolution of 5.625◦ ×5.625◦ , and compare the results 71-hour accuracy is much lower due to cumulative forecast
to the down-sampled ERA5 ground-truth. Quantitatively, errors. While we can easily improve the accuracy by moving
operational IFS reported 3-day/5-day RMSEs of T2M10 be- the time point back (e.g., performing 72-hour forecast using
ing 1.35/1.77 on WeatherBench, and Pangu-Weather im- 1-hour-earlier weather states for 71-hour forecast), we just
proves the results to 1.04/1.51, respectively. offer the original forecast results here to show this im-
portant phenomenon. This calls for an advanced temporal
4.1.3 Visualization aggregation algorithm in the future – one possibility lies in
In Figure 7, we first visualize the 72-hour forecast of Pangu- integrating the time axis into input data and making use
Weather on two upper-air variables, namely, Z500 and T850, of 4D deep neural networks, but this implies much heavier
and compare the results to operational IFS and the ERA5 computational overheads.
ground-truth. Both forecast results are sufficiently close to 4.1.5 Computational Costs
the ground-truth, yet one can detect the differences between
A clear advantage of AI-based methods lies in the inference
them. Pangu-Weather produce smoother contour lines, im-
speed. FourCastNet [14] claimed a 45,000× speedup over
plying that the model tends to forecast similar values for
the traditional NWP method, and Pangu-Weather is com-
neighboring regions – this is a typical property of deep
parable with FourCastNet (see the next paragraph). Con-
neural networks in learning from large-scale datasets. In
sidering the advantage in forecast accuracy, Pangu-Weather
comparison, operational IFS tends to preserve small-scale
has the potentials of replacing conventional NWP, enabling
structures, yet such predictions are not guaranteed to be
real-time weather forecast to be performed any time (e.g.,
correct. As shown in the previous part, Pangu-Weather
once a second), rather than the current status that weather
enjoys the advantage of overall forecast accuracy.
forecast is performed merely a few times per day. A few side
In Figure 1 (top left), we also visualize the 72-hour
benefits are expected. (i) It largely increases the timeliness of
forecast of Pangu-Weather on two surface variables,√ namely, short-range weather forecast which is important in warning
2m temperature (T2M) and 10m wind speed ( u2 + v 2 ).
about short-term extreme weathers, e.g., cloudbursts. (ii) It
As can be seen, Pangu-Weather produces high-resolution
enables large-member ensemble forecast which is important
forecasts that are (i) very close to the ERA5 ground-truth
for meteorologists to pay attentions to the sensitive weather
(also refer to the previous part for quantitative results) and
factors or variables.
(ii) sufficient to preserve most of small-scale structures of
The inference speed of Pangu-Weather is comparable to
surface variables.
that of FourCastNet [14], implying that using holistic 3D
deep neural networks for inference is slightly more costly
4.1.4 Diagnostic Studies
than using 2D counterparts, yet the accuracy is much higher.
We investigate the monthly averaged 5-day latitude- In a system-level comparison, FourCastNet requires 280ms
weighted ACC of four upper-air variables, namely, geopo- for inferring a 24-hour forecast on an NVIDIA Tesla-A100
tential (Z), specific humidity (Q), temperature (T), and u- GPU (312 TeraFLOPS), while Pangu-Weather needs 1,400ms
component of wind speed (U), all at the pressure level on an NVIDIA Tesla-V100 GPU (120 TeraFLOPS). Taking
of 500hPa. The comparison between Pangu-Weather and GPU performance into consideration, Pangu-Weather is
operational IFS is shown in Figure 1 (top right). Pangu- about 50% slower than FourCastNet, while still being one
Weather outperforms operational IFS in every single month, of the fastest systems for high-resolution, global weather
demonstrating the stability of forecast. More importantly, forecast.
the advantage of Pangu-Weather becomes more significant To further accelerate Pangu-Weather, we can train mod-
in the worst performed months (e.g., April and May), els with larger lead times (e.g., 72 hours) so as to reduce the
implying that AI-based methods have learned useful and number of temporal aggregations. We expect to explore this
complementary knowledge from large data. We conjecture direction in the near future.
that such knowledge may correspond to (i) unknown or
unformulated atmospheric procedures or (ii) better manip- 4.2 Results on Extreme Weather Events
ulations with missing factors. Studying these factors may be
Extreme weather forecast plays a vital role of global weather
an interesting topic for meteorologists.
forecast. Despite rare occurrence, extreme weather events
A clear advantage of Pangu-Weather lies in its ability like hurricanes can bring tremendous casualty and eco-
of performing hourly weather forecast. We plot the hourly nomical loss. Therefore, it is expected that weather forecast
RMSE and ACC values for two upper-air variables (Z500 systems can warn about upcoming extreme weather events
that complement daily weather reports.
10. We only show the comparison on T2M, because WeatherBench
evaluated the forecast results of T850 and Z500 on the 2017 subset In this subsection, we investigate the ability of Pangu-
which is part of our training data. Weather in forecasting extreme weather events and compare
BI et al., PANGU-WEATHER: A 3D HIGH-RESOLUTION MODEL FOR FAST AND ACCURATE GLOBAL WEATHER FORECAST 13

Z500 Q500 T2M U10

500 0.8 3.5
2.00
0.7 3.0
400 1.75
0.6
RMSE (m2/s2)

2.5

RMSE (g/kg)

RMSE (m/s)
1.50

RMSE (K)
300 0.5 2.0
1.25
0.4
200 1.00 1.5
0.3
0.75 1.0
100 0.2
0.50 0.5
0.1
0 0.25
24 48 72 96 120144168 24 72 120 168 24 72 120 168 24 72 120 168
Forecast Time (hours) Forecast Time (hours) Forecast Time (hours) Forecast Time (hours)

Z500 Q500 T2M U10

1.00 1.00 1.00 1.00
0.98 0.95 0.95
0.98
0.90 0.90
0.96
0.85 0.96 0.85
ACC

ACC

ACC
0.94
0.80 0.80
0.92 0.94 0.75
0.75
0.90 0.70
0.70 0.92
0.88 0.65
0.65
24 72 120 168 24 72 120 168 24 72 120 168 24 72 120 168
Forecast Time (hours) Forecast Time (hours) Forecast Time (hours) Forecast Time (hours)
Pangu-Weather

Fig. 8: Hourly forecast results for two upper-air variables (500hPa geopotential, Z500, and 500hPa specific humidity, Q500)
and two surface variables (2m temperature, T2M, and u-component of 10m wind speed, U10). The forecast time ranges
from 1 hour to 7 days (168 hour), and both latitude-weighted RMSE (lower is better) and ACC (higher is better) are
reported. The input time points are chosen from the 2018 data – since no comparison is made, we only exclude the data on
January 1st due to the overlaps with the training set.

U500 Q500 U10

0.025 0.025 0.025
Relative Quantile Error

Relative Quantile Error

0.000 0.000 0.000

0.025 0.025 0.025

0.050 0.050 0.050

24 72 120 168 24 72 120 168 24 72 120 168
Forecast Time (hours) Forecast Time (hours) Forecast Time (hours)
Pangu-Weather Operational IFS FourCastNet
Fig. 9: Plots of RQE values with respect to forecast time for two upper-air variables (500hPa u-component of wind speed,
U500, and 500hPa specific humidity, Q500) and one surface variables (u-component of 10m wind speed, U10).
14 TECHNICAL REPORT

the ability to that of conventional NWP methods, i.e., oper- Mean Direct Position Error
ational IFS. We first introduce a quantitative metric named Pangu-Weather
relative quantile error to measure the overall tendency in 250 ECMWF HRES
Section 4.2.1, and then study a special and important case,
i.e., tracking tropical cyclones, in Section 4.2.2. 200

Error (km)
150
4.2.1 Overall Tendency in Predictions of Extremes 100
We use a similar approach to [51] to compare the values 50
of top-level quantiles calculated on the forecast result and
24 48 72 96 120
ground-truth. Mathematically, we set D = 50 percentiles, (788) (637) (492) (340) (214)
denoted as q1 , . . . , qD . Following FourCastNet [14], we set Forecast Time (hours)
q1 = 90%, qD = 99.99%, and the intermediate ones are lin- North Atlantic Ocean Northeastern Pacific Ocean Northwestern Pacific Ocean
250 300
early distributed between q1 and qD in the logarithmic scale. 300
250
250 200
Then, the corresponding quantiles, denoted as Q1 , . . . , QD , 200

Error (km)

Error (km)
200
are computed individually for each pair of weather variable 150
150
150
and forecast time, e.g., for all 3-day forecasts of U10, pixel- 100
100 100
wise values are summarized from all frames for statistics. Fi- 50 50 50
nally, the relative quantile error (RQE) is used for measuring 24 48 72 96 120 24 48 72 96 120 24 48 72 96 120
(134) (110) (75) (48) (26) (224) (192) (155) (119) (77) (243) (187) (145) (87) (50)
the difference between the ground-truth and any weather Forecast Time (hours)
North Indian Ocean
Forecast Time (hours)
South Indian Ocean
Forecast Time (hours)
South Pacific Ocean
forecast system: 250
250
250
200
200 200

Error (km)

Error (km)
D 150
X Q̂d − Qd 150 150
RQE = , (4) 100 100

d=1
Qd 100

50 50 50
24 48 72 96 120 24 48 72 96 120 24 48 72 96 120
(31) (21) (12) (6) (4) (85) (74) (65) (51) (36) (71) (53) (40) (29) (21)
Forecast Time (hours) Forecast Time (hours) Forecast Time (hours)
where Qd and Q̂d are different versions of the d-th quantile
<= 63 kt 64 - 112 kt >= 113 kt
calculated on the ERA5 ground-truth and the system being 350
250
300
investigated, e.g., Pangu-Weather. RQE can measure the 300 200
250
overall tendency, where RQE < 0 and RQE > 0 imply that 250
Error (km)

Error (km)

Error (km)
200 150
200
the forecast system tends to underestimate and overestimate 150
150 100
the intensity of extremes, respectively. 100 100

50 50
In Figure 9, we plot the RQE values for two upper-air 50
24 48 72 96 120 24 48 72 96 120 24 48 72 96 120
variables (U500, V500) and one surface variable (U10) with (145) (97) (64) (34) (13)
Forecast Time (hours)
(285) (226) (161) (95) (56)
Forecast Time (hours)
(358) (314) (267) (211) (145)
Forecast Time (hours)
respect to forecast time. Pangu-Weather is compared to both Pangu-Weather ECMWF HRES
operational IFS and FourCastNet (only for U10). As seen, all
the three methods tend to underestimate extremes, i.e., the Fig. 10: The comparison of mean direct position errors
RQE values are consistently smaller than 0. The absolute of tropical cyclone tracking between Pangu-Weather and
RQE values reported by AI-based methods generally grow ECMWF-HRES, where the results are obtained by averaging
(i.e., heavier underestimation) with forecast time, while that 88 named tropical cyclones in 2018. We show the overall
of operational IFS remains mostly unchanged. We attribute results (top), the results with respect to different basins, and
the above observation to the cumulative forecast errors the results with respect to different intensities (bottom).
of AI-based methods – compared to FourCastNet, Pangu-
Weather significantly alleviates such errors with hierarchical
temporal aggregation (see Section 3.4). Compared to op- rather than the intensity11 . Hence, in this part, we report the
erational IFS, Pangu-Weather shows higher absolute RQE averaged distance between the ground-truth and predicted
values (i.e., heavier underestimation) for U500 and lower cyclone eyes.
absolute RQE values (i.e., lighter underestimation) for Q500. To track the eye of a tropical cyclone, we follow [55] to
Regarding U10, Pangu-Weather is much better than oper- find the local minimum of mean sea level pressure (MSLP).
ational IFS for up to 3 days (72 hours) and then becomes
slightly worse due to cumulative forecast errors. 11. Due to the limited resolution of EDA systems, reanalysis data like
ERA5 always underestimate cyclones intensity (e.g., minimum pressure
and maximum wind speed) significantly [52], [53], [54]. Trained on
ERA5, it is difficult for Pangu-Weather to forecast the intensity accu-
4.2.2 Tracking Tropical Cyclones rately (e.g., the predicted minimum pressure is often 50hPa higher than
the ground-truth), while the path tracking accuracy is reasonable (see
We study a special case of extreme weather forecast, namely, the later results). In the future, if higher-resolution, unbiased weather
data (especially tropical cyclone data) are provided, it is very likely that
tracking tropical cyclones. Note that we follow the conven- Pangu-Weather can be directly trained or fine-tuned on these data for
tions to focus on forecasting the eye of tropical cyclones more accurate intensity forecast.
BI et al., PANGU-WEATHER: A 3D HIGH-RESOLUTION MODEL FOR FAST AND ACCURATE GLOBAL WEATHER FORECAST 15

Track Forecast for Hurricane Michael from 2018-10-08 00UTC

40°N

30°N

Pangu-Weather Forecast
ECMWF HRES Forecast 20°N
Ground Truth
90°W 80°W 70°W 60°W
Track Forecast for Typhoon Ma-on from 2022-08-22 12UTC
30°N

20°N

Pangu-Weather Forecast 10°N

ECMWF HRES Forecast
Ground Truth
100°E 110°E 120°E 130°E

Fig. 11: Left: the tracking of cyclone eyes for Hurricane Michael (2018-13) and Typhoon Ma-on (2022-09) by Pangu-Weather
and ECMWF-HRES, with a comparison to the ground-truth (by IBTrACS). Right: an illustration of the tracking process,
where we use Pangu-Weather as an example. It locates cyclone eye by checking four variables (from forecast results),
namely, mean sea level pressure, 10m wind speed, thickness between 850hPa and 200hPa, and vorticity of 850hPa). The
displayed figures correspond to the forecast of these variables at a forecast time of 72 hours, and the forecast of cyclone
eye is indicated using the tail of arrows.

Track Forecast from 2018-09-29 12UTC Track Forecast from 2018-09-30 00UTC Track Forecast from 2018-09-30 12UTC Track Forecast from 2018-10-01 00UTC Track Forecast from 2018-10-01 12UTC Track Forecast from 2018-10-02 00UTC
50°N 50°N 50°N 50°N 50°N 50°N

40°N 40°N 40°N 40°N 40°N 40°N

30°N 30°N 30°N 30°N 30°N 30°N

20°N 20°N 20°N 20°N 20°N 20°N

10°N 10°N 10°N 10°N 10°N 10°N

Pangu-Weather Forecast Pangu-Weather Forecast Pangu-Weather Forecast Pangu-Weather Forecast Pangu-Weather Forecast Pangu-Weather Forecast
ECMWF HRES Forecast 0° ECMWF HRES Forecast 0° ECMWF HRES Forecast 0° ECMWF HRES Forecast 0° ECMWF HRES Forecast 0° ECMWF HRES Forecast 0°
Ground-Truth Ground-Truth Ground-Truth Ground-Truth Ground-Truth Ground-Truth
120°E 130°E 140°E 150°E 120°E 130°E 140°E 150°E 120°E 130°E 140°E 150°E 120°E 130°E 140°E 150°E 120°E 130°E 140°E 150°E 120°E 130°E 140°E 150°E

Fig. 12: The dynamic tracking results of cyclone eyes for Typhoon Kong-rey (2018-25) by Pangu-Weather and ECMWF-
HRES, with a comparison to the ground-truth (by IBTrACS). We show six time points with the first one being 12:00 UTC,
September 29th, 2018, and the time gap between neighboring points being 12 hours. The historical (observed) path of
cyclone eyes is shown in dashed. Mind the significant difference between Pangu-Weather and ECMWF-HRES (Pangu-
Weather is significantly better) at the middle four time points.
16 TECHNICAL REPORT

Specifically, we follow [54] to set the lead time to be 6 hours. • Typhoon Kong-rey (2018-25)12 is one of the most
Given the starting time point and the corresponding (initial) powerful tropical cyclones worldwide in 2018. It
position of a cyclone eye, we iteratively call for forecasting caused 4 fatalities and $171.5 million damage. As
the 6-hour-later weather states and look for a local mini- shown in Figure 1, ECMWF-HRES forecasts that
mum of MSLP that satisfies the following conditions: Kong-rey would land on China, but it actually
did not. Pangu-Weather, instead, produces accurate
• There is a maximum of 850hPa relative vorticity that
tracking results which almost coincide with the
is larger than 5 × 10−5 within a radius of 278km
ground-truth. Also, Figure 12 shows the tracking
for the Northern Hemisphere, or a minimum that is
results of Pangu-Weather and ECMWF-HRES at dif-
smaller than 5 × 10−5 for the Southern Hemisphere.
ferent time points – the forecast of Pangu-Weather
• There is a maximum of thickness between 850hPa
barely changes with time, yet ECMWF-HRES arrives
and 200hPa within a radius of 278km when the
at the conclusion that Kong-rey would not land on
cyclone is extratropical.
China more than 48 hours later than Pangu-Weather.
• The maximum 10m wind speed is larger than 8m/s
• Typhoon Yutu (2018-26)13 is an extremely powerful
within a radius of 278km when the cyclone is on
tropical cyclone that caused catastrophic destruction
land.
in the Mariana Islands and the Philippines. It also
Once the cyclone eye is located, the tracking algorithm ties Kong-rey as the most powerful tropical cyclone
continues to find the next position in a vicinity of 445km. worldwide in 2018, resulting in 30 fatalities and
The tracking algorithm terminates when no local minimum $854.1 million damage. As shown in Figure 3.1,
of MSLP is found to satisfy the above conditions. Pangu-Weather makes the correct forecast (Yutu goes
We refer to the International Best Track Archive for to the Philippines) as early as 6 days before landing,
Climate Stewardship (IBTrACS) project [56], [57] which while the forecast of ECMWF-HRES is dramatically
contains the best available estimations for tropical cyclones. incorrect (Yutu makes a big turn and heads to the
We directly apply the above tracking algorithm on the de- northeast). ECMWF-HRES forecasts the correct direc-
terministic forecast results of Pangu-Weather. We compare tion more than 48 hours later than Pangu-Weather.
the tracking results to ECMWF-HRES, a strong competitor • Hurricane Michael (2018-13)14 is the strongest hur-
of cyclone tracking based on high-resolution (9km × 9km) ricane of the 2018 Atlantic hurricane season. Michael
operational weather forecast – clearly, a higher-resolution became a Category-5 hurricane and landed on
forecast is more accurate in locating tropical cyclone eyes. Florida on October 10th, 2018, resulting in 74 fa-
The ECMWF-HRES forecast of cyclone eyes are directly talities and $25.5 billion damage. As shown in Fig-
downloaded from the TIGGE archive [16]. For a fair com- ure 11, with a starting time that is more than 3
parison, we choose the tropical cyclones in 2018 (the year days earlier than landing, Both Pangu-Weather and
of the above quantitative study of deterministic forecasts) ECMWF-HRES forecast the landfall on Florida, but
that appeared in both the IBTrACS project and the ECMWF- the delay of predicted landing time is only 3 hours
HRES forecasts. This results in a dataset (which we call for Pangu-Weather but 18 hours for ECMWF-HRES.
TC2018) with 88 named tropical cyclones. In addition, Pangu-Weather shows great advantages
We quantitatively compare the forecast accuracy of in tracking Michael after it landed, while the tracking
Pangu-Weather and ECMWF-HRES in TC2018. The 3-day of ECMWF-HRES is much shorter and shifts to the
and 5-day mean direct position errors (for cyclone eyes) east obviously.
of Pangu-Weather are 120.29km and 195.65km, respec- • Typhoon Ma-on (2022-09)15 is a severe tropical storm
tively, which are significantly smaller than 162.28km and that impacted the Philippines and China. Ma-on
272.10km reported by ECMWF-HRES. Figure 10 plots the landed over Maconacon, Philippines on August 23rd
mean direct position errors with respect to forecast time. and made a second landfall over Maoming, China
One can see a clear advantage of Pangu-Weather over on August 25th, resulting in 7 fatalities and $9.13
ECMWF-HRES over the entire dataset and within subsets million damage. As shown in Figure 11, when the
of different basins or different intensities. Inheriting the starting time point is about 3 days earlier than
property of deterministic forecast, the advantage becomes landing, ECMWF-HRES produces a wrong forecast
more significant when forecast time gets larger. that Ma-on would land on Zhuhai, China, while the
The tracking results of some representative cases are forecast of Pangu-Weather is about right.
shown in Figures 1 and 11. We study four representative
The much better tracking results are directly owed to the
cases, three in 2018 and one in 2022. For Michael and Ma-
higher deterministic forecast accuracy of Pangu-Weather. In
on, we set the starting time point to be the earliest one in
the right part of Figure 11, we show how Pangu-Weather
the ECMWF-HRES forecast, while for Kong-rey and Yutu,
tracks Hurricane Michael and Typhoon Ma-on following
the starting points are postponed for a few days for better
the specified tracking algorithm. Among the four variables,
visualization. We use Pangu-Weather to forecast the entire
mean sea level pressure and 10m wind speed are directly
cyclone path (i.e., until the cyclone dissipates), and compare
produced by deterministic forecast, and thickness and vor-
the tracking results to ECMWF-HRES and the ground-truth.
Again, Pangu-Weather produces much more accurate track-
12. https://en.wikipedia.org/wiki/Typhoon Kong-rey (2018)
ing results compared to ECMWF-HRES, and the advantage 13. https://en.wikipedia.org/wiki/Typhoon Yutu
becomes large as forecast time increases. Below, we analyze 14. https://en.wikipedia.org/wiki/Hurricane Michael
these cases one-by-one. 15. https://en.wikipedia.org/wiki/Tropical Storm Ma-on (2022)
BI et al., PANGU-WEATHER: A 3D HIGH-RESOLUTION MODEL FOR FAST AND ACCURATE GLOBAL WEATHER FORECAST 17

Z500 Q500 T2M U10

500 0.8 3.5
2.0
400 0.7 1.8 3.0
RMSE (m2/s2)

RMSE (g/kg)

RMSE (m/s)
1.6 2.5

RMSE (K)
300 0.6
1.4
2.0
200 0.5 1.2
1.0 1.5
100 0.4
0.8 1.0
24 48 72 96 120 144 168 24 72 120 168 24 72 120 168 24 72 120 168
Forecast Time (hours) Forecast Time (hours) Forecast Time (hours) Forecast Time (hours)

Z500 Q500 T2M U10

1.00 0.95 0.99
0.95
0.98 0.90 0.98
0.90
0.96 0.97
0.85 0.85
0.96
0.94
ACC

ACC

ACC
0.80 0.95 0.80
0.92 0.75
0.75 0.94
0.90 0.93 0.70
0.70
0.88 0.92 0.65

24 72 120 168 24 72 120 168 24 72 120 168 24 72 120 168

Forecast Time (hours) Forecast Time (hours) Forecast Time (hours) Forecast Time (hours)

Pangu-Weather Control Forecast Pangu-Weather Ensemble Mean

Fig. 13: Comparison of control forecast and ensemble forecast results of Pangu-Weather. We display the latitude-weighted
RMSE (lower is better) and ACC (higher is better) for two upper-air variables (500hPa geopotential, Z500, and 500hPa
specific humidity, Q500) and two surface variables (2m temperature, T2M, and u-component of 10m wind speed, U10).

ticity are mainly derived from geopotential and wind speed. mainly considered two types of uncertainty added to by
This indicates that Pangu-Weather can produce interme- (i) initial weather states and (ii) model parameters. The
diate results that support cyclone tracking, which further purpose is for diagnosing errors in observed or reanalyzed
assists meteorologists in understanding and exploiting the data (e.g., caused by assimilation [35]) and bias of both NWP
tracking results. and AI-based models (e.g., biasing towards false patterns).
In summary, the advantages of Pangu-Weather in track- The methodology of ensemble forecast mainly involves
ing tropical cyclone eyes are mainly inherited from the adding noise to either initial weather states or model param-
good practice of deterministic forecast, in particular, the eters and observing the change of forecast results. Either
forecast of mean sea level pressure that is critical for locating of them requires performing the inference multiple times.
cyclone eyes. Arguably, the proposed 3DEST architecture Pangu-Weather, as an AI-based method, enjoys a much
incorporates 3D information to improve the accuracy of faster inference speed than conventional NWP methods, e.g.,
important variables such as geopotential, wind speed, and more than 10,000× faster than operational IFS. This offers
mean sea level pressure. On the other hand, we notice that an opportunity of performing large-member ensemble fore-
Pangu-Weather still heavily underestimates the intensity of cast in relatively low computational costs. In what follows,
tropical cyclones, arguably due to the same weakness of we offer a preliminary study of ensemble forecast based
the ERA5 data. In the future, we expect higher-resolution, on Pangu-Weather, yet we believe that meteorologists can
more accurate training data can be established for fine- offer professional knowledge to further utilize the ability of
tuning Pangu-Weather for these specific scenarios of ex- ensemble forecast.
treme weather forecast. We also welcome meteorologists to In this paper, we mainly investigate the first line that
offer expertise to improve the forecast of intensity. adds perturbations to initial weather states. For simplicity,
we follow FourCastNet [14] to set the perturbations to be
4.3 Ensemble Forecast random Perlin noise, while we believe that richer meteoro-
A core goal of ensemble weather forecast is to investigate logic knowledge can assist us in developing more advanced
the uncertainty of forecast systems, i.e., the change of fore- ensemble methods (e.g., based on singular vectors [58]).
cast results with respect to small perturbations. Researchers Mathematically, let the initial weather state be A∗t , and we
18 TECHNICAL REPORT

randomly generate S = 99 Perlin noise vectors of the same ACKNOWLEDGMENTS

size of A∗t , denoted as P1 , . . . , PS . The initial states are We would like to thank ECMWF for offering the ERA5
perturbed into A∗t + ηPs , where η = 0.2 is the coefficient dataset and the TIGGE archive. Without such selfless ded-
that controls the noise amplitude, and s = 0, 1, . . . , S where ication, this research would never become possible. We
s = 0 implies that no noise is added, i.e., P0 ≡ 0. We feed all thank NOAA National Centers for Environmental Informa-
the S + 1 initial states to the trained model and average the tion for the IBTrACS dataset. We thank other members of
outputs as the final ensemble forecast result. Experiments the Pangu team for instructive discussions and support in
are still performed on the ERA5 weather data in 2018, where computational resource. Our appreciation also goes to the
deterministic forecast is taken as a natural baseline. As Integration Verification team of Huawei Cloud EI that offers
shown in Figure 13, the accuracy of 100-member ensemble us a platform of high-performance parallel computing.
forecast is slightly worse than single-member deterministic
forecast in short-range (e.g., 1-day) weather forecast, but is R EFERENCES
significantly higher than deterministic forecast when fore-
[1] P. Bauer, A. Thorpe, and G. Brunet, “The quiet revolution of
cast time is longer than 5 days. This aligns with our intuition numerical weather prediction,” Nature, vol. 525, no. 7567, pp. 47–
and the observations of prior work [14], indicating that 55, 2015.
large-member ensemble forecast is especially useful when [2] W. C. Skamarock, J. B. Klemp, J. Dudhia, D. O. Gill, D. M. Barker,
W. Wang, and J. G. Powers, “A description of the advanced re-
single-model accuracy becomes lower, yet it risks introduc- search wrf version 2,” National Center For Atmospheric Research
ing unexpected noise that may cause accuracy drop when Boulder Co Mesoscale and Microscale . . . , Tech. Rep., 2005.
the deterministic forecast is accurate enough. In addition, [3] F. Molteni, R. Buizza, T. N. Palmer, and T. Petroliagis, “The
ensemble forecast brings more benefits to the non-smooth ecmwf ensemble prediction system: Methodology and validation,”
Quarterly journal of the royal meteorological society, vol. 122, no. 529,
variables such as 500hPa specific humidity (Q500) and 10m pp. 73–119, 1996.
surface wind speed (U10), e.g., the latitude-weighted RMSEs [4] H. Ritchie, C. Temperton, A. Simmons, M. Hortal, T. Davies,
of 7-day forecast for Z500 and U10 are reduced from 500.3 D. Dent, and M. Hamrud, “Implementation of the semi-lagrangian
method in a high-resolution version of the ecmwf forecast model,”
and 3.48 to 450.6 and 2.96, with relative drops of 10% and Monthly Weather Review, vol. 123, no. 2, pp. 489–514, 1995.
15%, respectively. [5] P. Bauer, T. Quintino, N. Wedi, A. Bonanni, M. Chrust, W. Decon-
Temporarily, we do not study another line (i.e., adding inck, M. Diamantakis, P. Düben, S. English, J. Flemming et al., The
ecmwf scalability programme: Progress and plans. European Centre
perturbations to model parameters) because Pangu-Weather for Medium Range Weather Forecasts, 2020.
is based on deep neural networks that contains hundreds [6] T. Palmer, G. Shutts, R. Hagedorn, F. Doblas-Reyes, T. Jung, and
of millions of parameters, unlike conventional NWP mod- M. Leutbecher, “Representing model uncertainty in weather and
els [6], [7] that contain much fewer yet physically mean- climate prediction,” Annual Review of Earth and Planetary Sciences,
vol. 33, no. 1, pp. 163–193, 2005.
ingful parameters. In addition, the learned parameters are [7] M. R. Allen, J. Kettleborough, and D. Stainforth, “Model error
highly sensitive to a few random factors during the training in weather and climate forecasting,” in ECMWF Predictability of
procedure (e.g., random seeds, data sampling strategies, etc.) Weather and Climate Seminar. European Centre for Medium Range
Weather Forecasts, Reading, UK, 2002, pp. 279–304.
and thus are difficult to be perturbed for specific purposes.
[8] M. G. Schultz, C. Betancourt, B. Gong, F. Kleinert, M. Langguth,
In the future, with the guidance from meteorologists, we L. H. Leufen, A. Mozaffari, and S. Stadtler, “Can deep learning
expect to fine-tune the base Pangu-Weather models into a beat numerical weather prediction?” Philosophical Transactions of
series of ‘child models’ for manipulating different factors. the Royal Society A, vol. 379, no. 2194, p. 20200097, 2021.
[9] S. Scher and G. Messori, “Weather and climate forecasting with
neural networks: using general circulation models (gcms) with
different complexity as a study ground,” Geoscientific Model De-
5 C ONCLUSIONS AND F UTURE R EMARKS velopment, vol. 12, no. 7, pp. 2797–2809, 2019.
[10] S. Rasp, P. D. Dueben, S. Scher, J. A. Weyn, S. Mouatadid, and
N. Thuerey, “Weatherbench: a benchmark data set for data-driven
In this paper, we present Pangu-Weather, an AI-based sys- weather forecasting,” Journal of Advances in Modeling Earth Systems,
tem for numerical weather forecast. The technical contribu- vol. 12, no. 11, p. e2020MS002203, 2020.
tion involves (i) designing the 3D Earth-specific transformer [11] J. A. Weyn, D. R. Durran, and R. Caruana, “Can machines learn
(3DEST) architecture and (ii) applying the hierarchical tem- to predict weather? using deep learning to predict gridded 500-
hpa geopotential height from historical weather data,” Journal of
poral aggregation strategy. By training deep neural net- Advances in Modeling Earth Systems, vol. 11, no. 8, pp. 2680–2693,
works on 39 years of global weather data, Pangu-Weather, 2019.
for the first time, surpasses the conventional NWP methods [12] J. A. Weyn, D. R. Durran, R. Caruana, and N. Cresswell-Clay,
“Sub-seasonal forecasting with a large ensemble of deep-learning
in terms of both accuracy and speed. Being efficient in in- weather prediction models,” Journal of Advances in Modeling Earth
ference, Pangu-Weather opens a window for meteorologists Systems, vol. 13, no. 7, p. e2021MS002502, 2021.
to integrate their knowledge to AI-based methods for more [13] R. Keisler, “Forecasting global weather with graph neural net-
exciting applications. works,” arXiv preprint arXiv:2202.07575, 2022.
[14] J. Pathak, S. Subramanian, P. Harrington, S. Raja, A. Chattopad-
Looking into the future, we expect that computational hyay, M. Mardani, T. Kurth, D. Hall, Z. Li, K. Azizzadenesheli
resource is the key to further improving the accuracy of et al., “Fourcastnet: A global data-driven high-resolution weather
weather forecast. According to our experiments, the training model using adaptive fourier neural operators,” arXiv preprint
arXiv:2202.11214, 2022.
procedure has not yet arrived at full convergence, and [15] Y. Hu, L. Chen, Z. Wang, and H. Li, “Swinvrnn: A data-driven en-
there is much room left in terms of (i) incorporating more semble forecasting model via learned distribution perturbation,”
observation factors, (ii) integrating the time dimension and arXiv preprint arXiv:2205.13158, 2022.
training 4D deep networks, and (iii) simply using deeper [16] P. Bougeault, Z. Toth, C. Bishop, B. Brown, D. Burridge, D. H.
Chen, B. Ebert, M. Fuentes, T. M. Hamill, K. Mylne et al., “The
and/or wider networks. All of these require more powerful thorpex interactive grand global ensemble,” Bulletin of the Ameri-
GPUs with larger memory and higher FLOPs. can Meteorological Society, vol. 91, no. 8, pp. 1059–1072, 2010.
BI et al., PANGU-WEATHER: A 3D HIGH-RESOLUTION MODEL FOR FAST AND ACCURATE GLOBAL WEATHER FORECAST 19

[17] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast [41] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo,
for unsupervised visual representation learning,” in Proceedings of “Swin transformer: Hierarchical vision transformer using shifted
the IEEE/CVF conference on computer vision and pattern recognition, windows,” in Proceedings of the IEEE/CVF International Conference
2020, pp. 9729–9738. on Computer Vision, 2021, pp. 10 012–10 022.
[18] H. Bao, L. Dong, and F. Wei, “Beit: Bert pre-training of image [42] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo,
transformers,” arXiv preprint arXiv:2106.08254, 2021. “Convolutional lstm network: A machine learning approach for
[19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- precipitation nowcasting,” Advances in neural information processing
training of deep bidirectional transformers for language under- systems, vol. 28, 2015.
standing,” arXiv preprint arXiv:1810.04805, 2018. [43] X. Shi, Z. Gao, L. Lausen, H. Wang, D.-Y. Yeung, W.-k. Wong,
[20] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- and W.-c. Woo, “Deep learning for precipitation nowcasting: A
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Lan- benchmark and a new model,” Advances in neural information
guage models are few-shot learners,” Advances in neural informa- processing systems, vol. 30, 2017.
tion processing systems, vol. 33, pp. 1877–1901, 2020. [44] S. Agrawal, L. Barrington, C. Bromberg, J. Burge, C. Gazen, and
[21] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- J. Hickey, “Machine learning for precipitation nowcasting from
wal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning radar images,” arXiv preprint arXiv:1912.12132, 2019.
transferable visual models from natural language supervision,” [45] S. Ravuri, K. Lenc, M. Willson, D. Kangin, R. Lam, P. Mirowski,
in International Conference on Machine Learning. PMLR, 2021, pp. M. Fitzsimons, M. Athanassiadou, S. Kashem, S. Madge et al.,
8748–8763. “Skilful precipitation nowcasting using deep generative models
[22] A. K. Betts, D. Z. Chan, and R. L. Desjardins, “Near-surface biases of radar,” Nature, vol. 597, no. 7878, pp. 672–677, 2021.
in era5 over the canadian prairies,” Frontiers in Environmental [46] V. Lebedev, V. Ivashkin, I. Rudenko, A. Ganshin, A. Molchanov,
Science, vol. 7, p. 129, 2019. S. Ovcharenko, R. Grokhovetskiy, I. Bushmarinov, and D. Solo-
mentsev, “Precipitation nowcasting with satellite imagery,” in
[23] Q. Jiang, W. Li, Z. Fan, X. He, W. Sun, S. Chen, J. Wen, J. Gao, and
Proceedings of the 25th ACM SIGKDD international conference on
J. Wang, “Evaluation of the era5 reanalysis precipitation dataset
knowledge discovery & data mining, 2019, pp. 2680–2688.
over chinese mainland,” Journal of hydrology, vol. 595, p. 125660,
[47] C. K. Sønderby, L. Espeholt, J. Heek, M. Dehghani, A. Oliver,
2021.
T. Salimans, S. Agrawal, J. Hickey, and N. Kalchbrenner, “Met-
[24] H. Hersbach, B. Bell, P. Berrisford, S. Hirahara, A. Horányi, net: A neural weather model for precipitation forecasting,” arXiv
J. Muñoz-Sabater, J. Nicolas, C. Peubey, R. Radu, D. Schepers preprint arXiv:2003.12140, 2020.
et al., “The era5 global reanalysis,” Quarterly Journal of the Royal [48] H. Hersbach, B. Bell, P. Berrisford, G. Biavati, A. Horányi,
Meteorological Society, vol. 146, no. 730, pp. 1999–2049, 2020. J. Muñoz Sabater, J. Nicolas, C. Peubey, R. Radu, I. Rozum et al.,
[25] P. Lynch, “The origins of computer weather prediction and climate “Era5 hourly data on pressure levels from 1979 to present,”
modeling,” Journal of computational physics, vol. 227, no. 7, pp. Copernicus climate change service (c3s) climate data store (cds), vol. 10,
3431–3444, 2008. 2018.
[26] E. Kalnay, Atmospheric modeling, data assimilation and predictability. [49] ——, “Era5 hourly data on single levels from 1979 to present,”
Cambridge university press, 2003. Copernicus Climate Change Service (C3S) Climate Data Store (CDS),
[27] D. J. Stensrud, Parameterization schemes: keys to understanding nu- vol. 10, 2018.
merical weather prediction models. Cambridge University Press, [50] D. Richardson, J. Bidlot, L. Ferranti, A. Ghelli, C. Gibert, T. Hew-
2009. son, M. Janousek, F. Prates, and F. Vitart, Verification statistics and
[28] D. Randall, M. Khairoutdinov, A. Arakawa, and W. Grabowski, evaluations of ECMWF forecasts in 2008-2009. ECMWF Reading,
“Breaking the cloud parameterization deadlock,” Bulletin of the UK, 2009.
American Meteorological Society, vol. 84, no. 11, pp. 1547–1564, 2003. [51] B. Fildier, W. D. Collins, and C. Muller, “Distortions of the rain
[29] R. Pincus, H. W. Barker, and J.-J. Morcrette, “A fast, flexible, ap- distribution with warming, with and without self-aggregation,”
proximate technique for computing radiative transfer in inhomo- Journal of Advances in Modeling Earth Systems, vol. 13, no. 2, p.
geneous cloud fields,” Journal of Geophysical Research: Atmospheres, e2020MS002256, 2021.
vol. 108, no. D13, 2003. [52] S. Bourdin, S. Fromang, W. Dulac, J. Cattiaux, and F. Chauvin,
[30] A. Arakawa, “The cumulus parameterization problem: Past, “Intercomparison of four tropical cyclones detection algorithms
present, and future,” Journal of climate, vol. 17, no. 13, pp. 2493– on era5,” EGUsphere, pp. 1–43, 2022.
2525, 2004. [53] P. Malakar, A. Kesarkar, J. Bhate, V. Singh, and A. Deshamukhya,
[31] A. Betts, “Non-precipitating cumulus convection and its param- “Comparison of reanalysis data sets to comprehend the evolution
eterization,” Quarterly Journal of the Royal Meteorological Society, of tropical cyclones over north indian ocean,” Earth and Space
vol. 99, no. 419, pp. 178–196, 1973. Science, vol. 7, no. 2, p. e2019EA000978, 2020.
[32] H.-L. Kuo, “Further studies of the parameterization of the in- [54] L. Magnusson, S. Majumdar, R. Emerton, D. Richardson,
fluence of cumulus convection on large-scale flow,” Journal of M. Alonso-Balmaseda, C. Baugh, P. Bechtold, J.-R. Bidlot,
Atmospheric Sciences, vol. 31, no. 5, pp. 1232–1240, 1974. A. Bonanni, M. Bonavita, N. Bormann, A. Brown, P. Browne,
[33] T. Nakaegawa, “High-performance computing in meteorology un- H. Carr, M. Dahoui, G. D. Chiara, M. Diamantakis, D. Duncan,
der a context of an era of graphical processing units,” Computers, S. English, R. Forbes, A. J. Geer, T. Haiden, S. Healy, T. Hewson,
vol. 11, no. 7, p. 114, 2022. B. Ingleby, M. Janousek, C. Kuehnlein, S. Lang, S.-J. Lock,
[34] H. Olafsson and J.-W. Bao, Uncertainties in Numerical Weather T. McNally, K. Mogensen, F. Pappenberger, I. Polichtchouk,
Prediction. Elsevier, 2020. F. Prates, C. Prudhomme, F. Rabier, P. de Rosnay, T. Quintino,
and M. Rennie, “Tropical cyclone activities at ecmwf,” no. 888, 10
[35] I. M. Navon, “Data assimilation for numerical weather prediction:
2021. [Online]. Available: https://www.ecmwf.int/node/20228
a review,” Data assimilation for atmospheric, oceanic and hydrologic
[55] P. White, “Newsletter no. 102 - winter 2004/05,” 01 2005. [Online].
applications, pp. 21–65, 2009.
Available: https://www.ecmwf.int/node/14623
[36] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. [56] K. R. Knapp, M. C. Kruk, D. H. Levinson, H. J. Diamond, and
521, no. 7553, pp. 436–444, 2015. C. J. Neumann, “The international best track archive for climate
[37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi- stewardship (ibtracs) unifying tropical cyclone data,” Bulletin of
cation with deep convolutional neural networks,” in Advances in the American Meteorological Society, vol. 91, no. 3, pp. 363–376, 2010.
Neural Information Processing Systems, 2012. [57] K. R. Knapp, H. J. Diamond, J. P. Kossin, M. C. Kruk, C. Schreck
[38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning et al., “International best track archive for climate stewardship (ib-
for image recognition,” in Computer Vision and Pattern Recognition, tracs) project, version 4,” NOAA National Centers for Environmental
2016. Information, 2018.
[39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. [58] E. P. Diaconescu and R. Laprise, “Singular vectors in atmospheric
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” sciences: A review,” Earth-science reviews, vol. 113, no. 3-4, pp. 161–
Advances in neural information processing systems, vol. 30, 2017. 175, 2012.
[40] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly
et al., “An image is worth 16x16 words: Transformers for image
recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.