Tutorial Notes: WRF Software: John Michalakes, Dave Gill Ncar WRF Software Architecture Working Group
Tutorial Notes: WRF Software: John Michalakes, Dave Gill Ncar WRF Software Architecture Working Group
• Introduction
• Computing Overview
• Software Overview
• Examples
Resources
• Implementation of WRF
Architecture Top-level Control,
– Hierarchical organization Memory Management, Nesting,
driver
– Multiple dynamical cores Parallelism, External APIs
– Plug compatible physics
– Abstract interfaces (APIs) to external ARW solver NMM solver
packages
mediation
– Performance-portable
Physics Interfaces
• Designed from beginning to be
adaptable to today’s computing
Plug-compatible physics
environment for NWP model
Plug-compatible physics
Plug-compatible physics
Plug-compatible physics
Plug-compatible physics
WRF Supported Platforms
Vendor Hardware OS Compiler
Apple G5 MacOS IBM
X1, X1e UNICOS Cray
Cray Inc.
XT3/XT4 (Opteron) Linux PGI
Alpha Tru64 Compaq
HP/Compaq Linux Intel
Itanium-2
HPUX HP
Power-3/4/5/5+ AIX IBM
IBM Blue Gene/L IBM
Linux
Opteron Pathscale, PGI
NEC SX-series Unix Vendor
Itanium-2 Linux Intel
SGI
MIPS IRIX SGI
Sun UltraSPARC Solaris Sun
Xeon and Athlon Linux and
various Intel, PGI
Itanium-2 and Opteron Windows CCS
APPLICATION
Computing Overview
APPLICATION
SYSTEM
Computing Overview
Patches
APPLICATION Tiles
WRF Comms
Processes
SYSTEM Threads
Messages
Processors
HARDWARE Nodes
Networks
Patches
APPLICATION Tiles
WRF Comms
Processors
HARDWARE Nodes
Networks
SYSTEM
Hardware has not changed much…
Processes
Threads
Messages
Processors
HARDWARE Nodes
Networks
A computer in 1960
6-way superscalar
36-bit floating point precision
IBM 7090
~144 Kbytes
Patches
APPLICATION Tiles
WRF Comms
SYSTEM
Hardware has not changed much…
Processes
Threads
Messages
Processors
HARDWARE Nodes
Networks
A computer in 1960
6-way superscalar
36-bit floating point precision
IBM 7090
~144 Kbytes
~50,000 flop/s
48hr 12km WRF CONUS in 600 years
Patches
APPLICATION Tiles
WRF Comms
SYSTEM
Hardware has not changed much…
Processes
Threads
Messages
Processors
HARDWARE Nodes
Networks
A computer in 1960
6-way superscalar
36-bit floating point precision
IBM 7090
~144 Kbytes
~50,000 flop/s
48hr 12km WRF CONUS in 600 years
A computer in 2002
4-way superscalar
SYSTEM
Hardware has not changed much…
Processes
Threads
Messages
Processors
HARDWARE Nodes
Networks
A computer in 1960
6-way superscalar
36-bit floating point precision
IBM 7090
~144 Kbytes
~50,000 flop/s
48hr 12km WRF CONUS in 600 years
A computer in 2002
4-way superscalar
~5,000,000,000 flop/s
48 12km WRF CONUS in 52 Hours
Patches
APPLICATION Tiles
WRF Comms
Processors
HARDWARE Nodes
Networks
~100,000,000,000 flop/s
48 12km WRF CONUS in under 15 minutes
Patches
APPLICATION
Processes
SYSTEM Threads
Messages
Processors
HARDWARE Nodes
Networks
• Processor:
– A device that reads and executes instructions in sequence to produce perform operations on data
that it gets from a memory device producing results that are stored back onto the memory device
• Node: One memory device connected to one or more processors.
– Multiple processors in a node are said to share-memory and this is “shared memory parallelism”
– They can work together because they can see each other’s memory
– The latency and bandwidth to memory affect performance
• Cluster: Multiple nodes connected by a network
– The processors attached to the memory in one node can not see the memory for processors on
another node
– For processors on different nodes to work together they must send messages between the nodes.
This is “distributed memory parallelism”
• Network:
– Devices and wires for sending messages between nodes
– Bandwidth – a measure of the number of bytes that can be moved in a second
– Latency – the amount of time it takes before the first byte of a message arrives at its destination
Patches
APPLICATION
Processes
SYSTEM Threads
Messages
HARDWARE “The only thing one does directly with hardware is pay for it.”
Processors
Nodes
Networks
• Process:
– A set of instructions to be executed on a processor
– Enough state information to allow process execution to stop on a processor
and be picked up again later, possibly by another processor
• Processes may be lightweight or heavyweight
– Lightweight processes, e.g. shared-memory threads, store very little state;
just enough to stop and then start the process
– Heavyweight processes, e.g. UNIX processes, store a lot more (basically
the memory image of the job)
Patches
APPLICATION Tiles
WRF Comms
Processors
HARDWARE Nodes
Networks
Processors
HARDWARE Nodes
Networks
• MPI is used to start up and pass messages between multiple heavyweight processes
– The mpirun command controls the number of processes and how they are mapped onto
nodes of the parallel machine
– Calls to MPI routines send and receive messages and control other interactions between
processes
– http://www.mcs.anl.gov/mpi
• OpenMP is used to start up and control threads within each process
– Directives specify which parts of the program are multi-threaded
– OpenMP environment variables determine the number of threads in each process
– http://www.openmp.org
• The number of processes (number of MPI processes times the number of threads in
each process) usually corresponds to the number of processors
Examples
setenv OMP_NUM_THREADS 4
mpirun –np 4 wrf.exe
setenv OMP_NUM_THREADS 2
mpirun –np 8 wrf.exe
setenv OMP_NUM_THREADS 1
mpirun –np 16 wrf.exe
Examples (cont.)
• Note, since there are 4 nodes, we can never have fewer than 4 MPI
processes because nodes do not share memory
setenv OMP_NUM_THREADS 4
mpirun –np 32
Patches
APPLICATION Tiles
WRF Comms
Application: WRF
Processes
SYSTEM Threads
Messages
Processors
HARDWARE Nodes
Networks
Processes
SYSTEM Threads
Messages
Processors
Logical 1 Patch, divided
HARDWARE
domain into multiple tiles
Nodes
Networks
Inter-processor
Model domains are decomposed for parallelism on two-levels
Patch: section of model domain allocated to a distributed memory node
communication
Tile: section of a patch allocated to a shared-memory processor within a node; this is
also the scope of a model layer subroutine.
Distributed memory parallelism is over patches; shared memory parallelism is over
tiles within patches
Example code fragment that requires communication
Distributed Memory Communications between patches
(module_diffusion.F )
Processors
HARDWARE Nodes
Networks
• Halo updates
*
* + *
*
Processors
HARDWARE Nodes
Networks
• Halo updates
*
* * + *
*
Processors
HARDWARE Nodes
Networks
• Halo updates
• Periodic boundary updates
• Parallel transposes
• Nesting scatters/gathers
Review
Distributed Shared
Memory Memory
Parallel Parallel
• Architecture
• Directory structure
• Model Layer Interface
• Data Structures
• I/O
• Registry
WRF Software Architecture
Registry
Registry
• Driver Layer
– Allocates, stores, decomposes model domains, represented abstractly as single data objects
– Contains top-level time loop and algorithms for integration over nest hierarchy
– Contains the calls to I/O, nest forcing and feedback routines supplied by the Mediation Layer
– Provides top-level, non package-specific access to communications, I/O, etc.
– Provides some utilities, for example module_wrf_error, which is used for diagnostic prints and error stops
WRF Software Architecture
Registry
• Mediation Layer
– Provides to the Driver layer
• Solve solve routine, which takes a domain object and advances it one time step
• I/O routines that Driver calls when it is time to do some input or output operation on a domain
• Nest forcing and feedback routines
• The Mediation Layer and not the Driver knows the specifics of what needs to be done
– The sequence of calls to Model Layer routines for doing a time-step is known in Solve routine
– Responsible for dereferencing driver layer data objects so that individual fields can be passed to Model layer Subroutines
– Calls to message-passing are contained here as part of solve routine
WRF Software Architecture
Registry
• Model Layer
– Contains the information about the model itself, with machine architecture and implementation aspects abstracted out and
moved into layers above
– Contains the actual WRF model routines that are written to perform some computation over an arbitrarily sized/shaped
subdomain
– All state data objects are simple types, passed in through argument list
– Model Layer routines don’t know anything about communication or I/O; and they are designed to be executed safely on
one thread – they never contain a PRINT, WRITE, or STOP statement
– These are written to conform to the Model Layer Subroutine Interface (more later) which makes them “tile-callable”
WRF Software Architecture
Registry
wrf (main/wrf.F)
integrate (frame/module_integrate.F)
Call Structure superimposed on Architecture
wrf (main/wrf.F)
integrate (frame/module_integrate.F)
solve_interface
solve_interface (share/solve_interface.F)
Call Structure superimposed on Architecture
wrf (main/wrf.F)
integrate (frame/module_integrate.F)
solve_interface
solve_interface (share/solve_interface.F)
solve_em (dyn_em/solve_em.F)
solve_em (dyn_em/module_small_step_em.F)
advance_uv (dyn_em/solve_em.F)
advance_uv (dyn_em/module_small_step_em.F)
advance_uv (dyn_em/module_small_step_em.F)
advance_uv (dyn_em/module_small_step_em.F)
advance_uv (dyn_em/module_small_step_em.F)
microphysics_driver
microphysics_driver (phys/module_microphysics_driver.F)
(phys/module_microphysics_driver.F)
Call Structure superimposed on Architecture
wrf (main/wrf.F)
integrate (frame/module_integrate.F)
solve_interface
solve_interface (share/solve_interface.F)
solve_em (dyn_em/solve_em.F)
solve_em (dyn_em/module_small_step_em.F)
advance_uv (dyn_em/solve_em.F)
advance_uv (dyn_em/module_small_step_em.F)
advance_uv (dyn_em/module_small_step_em.F)
advance_uv (dyn_em/module_small_step_em.F)
advance_uv (dyn_em/module_small_step_em.F)
microphysics_driver
microphysics_driver (phys/module_microphysics_driver.F)
(phys/module_microphysics_driver.F)
KFCPS (phys/module_ra_kf.F
KFCPS (phys/module_ra_kf.F
KFCPS (phys/module_ra_kf.F
KFCPS (phys/module_ra_kf.F
KFCPS (phys/module_ra_kf.F
KFCPS (phys/module_ra_kf.F
WSM5 (phys/module_mp_wsm5.F
WRF Model Layer Interface
Threads
Message
Data formats,
Passing
Config WRF Tile-callable
• Restrictions on model layer subroutines Module Subroutines
Parallel I/O
!$OMP
!$OMP DODO PARALLEL
PARALLEL
DO
DO ijij == 1,
1, numtiles
numtiles
its
its == i_start(ij)
i_start(ij) ;; ite
ite == i_end(ij)
i_end(ij)
jts
jts == j_start(ij)
j_start(ij) ;; jte
jte == j_end(ij)
j_end(ij)
• Mediation layer / Model Layer Interface CALL
CALL model_subroutine(
model_subroutine( arg1,
ids
ids ,, ide
ide ,, jds
arg1, arg2,
jds ,, jde
arg2, .. ..
jde ,, kds
kds ,, kde
kde
..
,,
ims
ims ,, ime
ime ,, jms
jms ,, jme
jme ,, kms
kms ,, kme
kme ,,
• Model layer routines are called from mediation layer in its
its ,, ite
ite ,, jts
jts ,, jte
jte ,, kts
kts ,, kte
kte ))
END
END DODO
loops over tiles, which are multi-threaded .. .. ..
data types
• Domain, memory, and run dimensions passed template
template for
for model
model layer
layer subroutine
subroutine
IMPLICIT NONE
• Patch dimensions
• Start and end indices of local
distributed memory subdomain
• Available from mediation layer
(solve) and driver layer; not usually
needed or used at model layer
Data Structures
Data structures
jde = 4
v1,4 v2,4 v3,4
Computation over
u1,3 m1,3 u2,3 m2,3 u3,3 m3,3 u4,3
mass points runs
v1,3 v2,3 v3,3 only ids..ide-1
and jds..jde-1
Likewise, vertical
u1,2 m1,2 u2,2 m2,2 u3,2 m3,2 u4,2
computation over
v1,2 v2,2 v3,2 unstaggered fields
run kds..kde-1
ids = 1 ide = 4
What does data look like in WRF?
head_grid 1 1
2 3 2
3
4
4
What does data look like in WRF?
• Why?
– Automates time consuming, repetitive, error-prone programming
– Insulates programmers and code from package dependencies
– Allow rapid development
– Documents the data
%compile wrf
registry program:
Registry/Registry
tools/registry
CPP
inc/*.incl
____________
WRF source
*/*.F
Fortran90
wrf.exe
Registry Data Base
The ‘h’ and ‘i’ specifiers may be followed by an optional integer string consisting of
‘0’, ‘1’, ‘2’, ‘3’, ‘4’, and/or ‘5’. Zero denotes that the variable is part of the principal
input or history I/O stream. The characters ‘1’ through ‘5’ denote one of five auxiliary
input or history I/O streams.
State Entry: Defining Variable-set for an I/O stream
irh -- The state variable will be included in the input, restart, and history I/O streams
irh13 -- The state variable has been added to the first and third auxiliary history output
streams; it has been removed from the principal history output stream, because zero is not
among the integers in the integer string that follows the character 'h'
rh01 -- The state variable has been added to the first auxiliary history output stream; it is
also retained in the principal history output
i205hr -- Now the state variable is included in the principal input stream as well as
auxiliary inputs 2 and 5. Note that the order of the integers is unimportant. The variable is
also in the principal history output stream
ir12h -- No effect; there is only 1 restart data stream and ru added to it.
Rconfig entry
• Example
• Steps
– Add a new state variable and definition of a new surface layer package
that will use the variable to the Registry
– Add to variable stream for an unused Auxiliary Input stream
– Adapt physics interface to pass new state variable to physics
– Setup namelist to input the file at desired interval
Example: Input periodic SSTs
• Result:
– 2-D variable named nsst defined and available in solve_em
– Dimensions: ims:ime, jms:jme
– Input and output on the AuxInput #3 stream will include the variable under
the name NEW_SST
Example: Input periodic SSTs
CASE (SFCLAYSCHEME)
. . .
CASE (NEWSFCSCHEME) ! <- This is defined by the Registry “package” entry
IF (PRESENT(nsst)) THEN
CALL NEWSFCCHEME( &
nsst, &
ids,ide, jds,jde, kds,kde, &
ims,ime, jms,jme, kms,kme, &
i_start(ij),i_end(ij), j_start(ij),j_end(ij), kts,kte )
ELSE
CALL wrf_error_fatal('Missing argument for NEWSCHEME in surface driver')
ENDIF
. . .
END SELECT sfclay_select
• Note the PRESENT test to make sure new optional variable nsst is available
Example: Input periodic SSTs
&time_control
. . .
auxinput3_inname = "sst_input"
auxinput3_interval_mo = 0
auxinput3_interval_d = 0
auxinput3_interval_h = 12
auxinput3_interval_m = 0
auxinput3_interval_s = 0
. . .
/
. . .
&physics
sf_sfclay_physics = 4, 4, 4
• Run code with.sst_input
. . file in run-directory
/
Example: Working with WRF Software
• Compute local sum and local max and the local indices of the local
maximum
• Compute global sum, global max, and indices of the global max
• On the process that contains the maximum value, obtain the latitude and
longitude of that point; on other processes set to an artificially low value.
• The use parallel reduction to store that result on every process
glat = grid%xlat(idex,jdex)
glon = grid%xlong(idex,jdex)
ELSE
glat = -99999.
glon = -99999.
ENDIF