0% found this document useful (0 votes)
170 views23 pages

Calculation of Phonon Dispersions On The Grid Using Quantum ESPRESSO

Riccardo di Meo, Andrea Dal Corso, Paolo Giannozzi and Stefano Cozzini

Uploaded by

FreeWill
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
170 views23 pages

Calculation of Phonon Dispersions On The Grid Using Quantum ESPRESSO

Riccardo di Meo, Andrea Dal Corso, Paolo Giannozzi and Stefano Cozzini

Uploaded by

FreeWill
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Calculation of Phonon Dispersions on the Grid Using Quantum ESPRESSO

Riccardo di Meo1 , Andrea Dal Corso2,3 , Paolo Giannozzi3,4 and Stefano Cozzini3

The Abdus Salam International Centre for Theoretical Physics, Trieste, Italy 2 SISSA-ISAS, Trieste, Italy CNR-INFM DEMOCRITOS National Simulation Center, Trieste, Italy 4 Dipartimento di Fisica, Universit` di Udine, Udine, Italy a

Lecture given at the Joint EU-IndiaGrid/CompChem Grid Tutorial on Chemical and Material Science Applications Trieste, 15-18 September 2008

LNS0924010

[email protected]

Abstract We describe an application of the Grid infrastructure to realistic rst-principle calculations of the phonon dispersions of a relatively complex crystal structure. The phonon package in the Quantum ESPRESSO distribution is used as a computational engine. The calculation is split into many subtasks scheduled to the Grid by an interface, written in phython, that uses a client/server approach. The same interface takes care of collecting the results and re-scheduling the subtasks that were not successful. This approach allows the calculation of the complete phonon dispersions in materials described by a relatively large unit cell, that would require a sizable amount of CPU time on a dedicated parallel machine. Our approach decouples the application from the underlying computational platform and can be easily used on dierent computational infrastructures.

Contents
1 Introduction 2 Phonon calculation 3 Computational analysis of the phonon calculation 4 Grid enabling procedure 167 168 169 171

5 Results 175 5.1 Grid results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.2 HPC results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 5.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 6 Conclusions 183

Calculation of Phonon Dispersion on the Grid Using Quantum ESPRESSO 167

Introduction

The ecient and reliable execution of realistic scientic applications on the Grid infrastructures is still far from being trivial. Grid computing paradigm denes such infrastructures as a collection of geographically distributed computational resources glued together by a software (called middleware) in order to allow users to access and use them in a easy and transparent ways. Dierent types of middleware exist, like for instance the BOINC toolkit, which animates many Grids that usually rely on volunteer computing (like Seti@Home or Protein@Home), or e.g. other Grids based on Java or proprietary solutions. The European Union funded EGEE (Enabling Grids for E-sciencE) project [1], oers nowaday one of the largest distributed computing infrastructure (around 100K cpus) for scientists. EGEE developed its own middleware that made the use of widely geographically distributed networks more accessible. However such middleware still does not provide the features required to run parallel simulations reliably, eciently and with minimum eort from users side. Such applications require the creation of additional software to deal with as many technical details as possibile, thus leaving for the researcher free to focus on the scientic aspect of the problem. In this paper we report our experience with realistic computations of material properties using the Quantum ESPRESSO (Q/E) [2] package. Q/E is an integrated suite of computer codes for electronic-structure calculations and materials modelling [3], implementing Density-Functional Theory in a plane-wave basis set. Typical CPU and memory requirements for Q/E vary by orders of magnitude depending on the type of system and on the calculated physical property, but in general, both CPU and memory usage quickly increase with the number of atoms in the unit cell. As a consequence, running Q/E on the Grid is not a trivial task: only tightly-coupled MPI parallelization with memory distribution across processors [4] allows to solve large problems, i.e. systems requiring a large number of atoms in the unit cells. The resulting MPI program requires fast communications and low latency and does not t with the standard MPI execution environment provided by the EGEE computational infrastructure. The default MPI implementation is in fact based on MPICH over Ethernet and does not provide enough performance in terms of latency and bandwidth, especially in the case of farm-like Computing Elements. Running Q/E on the Grid is still possible, but limited at present to calculations that t into the memory of a single Computing Element. There are,

168

R. di Meo, A. Dal Corso, P. Giannozzi and S. Cozzini

however, some cases in which loosely-coupled parallelization available on the Grid can be really useful. Q/E currently implements, using MPI, at least one such case [4]: parallelization over images, i.e. points in the conguration space, used for the calculation of transition paths and energy barriers. The number of images is however typically small, in the order of 10 at most. A case that looks more promising for Grid execution is the calculation of the full phonon dispersions in crystals. Medium-size systems (a few tens of atoms) will easily t into a typical Computing Element, but many independent calculations, in the order of hundreds, will be needed, thus making the overall computational requirement quite large. In this paper we describe in some detail how to run such a calculation, the porting procedure and the results obtained so far. The paper is organized as follows: In section 2 we describe in some detail how to compute the full phonon dispersions in crystals using Q/E. Section 3 reports the computational analysis and requirements of the phonon calculations. Section 4 describes the subsequent implementation on the Grid by means of a client/server architecture. Section 5 reports a benchmarking analysis where computational eciency attained on the Grid is compared against High Performance Computing (HPC) facilities. Finally, section 6 draws some conclusions and outlines future perspectives.

Phonon calculation

Phonons in crystals [5] are extended vibrational modes, propagating with a wave-vector q. They are characterized by a vibrational frequency (q) and by the displacements of the atoms in one unit cell. The q wave-vector is the equivalent of the Bloch vector for the electronic states and is inside the rst Brillouin zone, i.e. the unit cell of the reciprocal lattice. Phonon frequencies form dispersion bands quite in the same way as electronic states. For a system with N atoms in the unit cell, there are 3N phonons for a given q. The dynamical matrix contains information on the vibrational properties of a crystal: phonon frequencies are the square roots of its eigenvalues while the atomic displacements are related to its eigenvectors. Q/E calculates the dynamical matrix of a solid using Density-Functional Perturbation Theory [6]. In this approach, one calculates the charge response to lattice distorsions of denite wave-vector q. The starting point is the electronic structure of the undistorted crystal, obtained from a conventional Density-Functional Theory self-consistent (scf) calculation. A different charge response must be calculated for each of the 3N independent

Calculation of Phonon Dispersion on the Grid Using Quantum ESPRESSO 169

atomic displacements, or for any equivalent combination thereof. Q/E uses atomic displacements along symmetry-dependent patterns the irreps (shortend for irreducible representations). The irreps are sets of displacement patterns that transform into themselves under small group of q, i.e. the symmetry operations that leave both q and the crystal unchanged. Since the irreps of the small group of q are typically 1- to 3-dimensional, only a few displacement patterns belong to one irrep and only the responses to these patterns need to be simultaneously calculated. This procedure allows to exploit symmetry [3] in an eective way, while keeping the calculation of the charge response within each irrep independent from the others. Once the charge response to one irrep is self-consistently calculated, the contribution of this irrep to the dynamical matrix is calculated and stored. When all atomic displacements (or all irreps) have been processed, the dynamical matrix for the given q is obtained. In order to calculate the full phonon dispersions, and thus all quantities depending on integrals over the Brillouin Zone, one needs dynamical matrices for any qvector. In practice, one can store the needed information in real space under the form of Interatomic Force Constants [7]. These are obtained by inverse Fourier Transform of dynamical matrices, calculated for a nite uniform mesh of qvectors. The number of needed qvectors is relatively small, since Interatomic Force Constants are short-ranged quantities, or can be written as the sum of a known long-ranged dipolar term, plus a shortranged part. Once Interatomic Force Constants in real space are available, the dynamical matrix can be reconstructed at any desired value of q with little eort. Alternatively, one can compute a nite number of qvectors and plot or interpolate the resulting phonon dispersion branches. We stress here that phonon calculations at dierent q are independent.

Computational analysis of the phonon calculation

Crystals with unit cells containing a few tens of atoms, up to 100, t into a single modern computing element and require relatively short execution time (minutes to hours) for the scf step (code pw.x of the Q/E distribution). The memory requirement of a phonon calculation is somewhat larger than that for the scf calculation, but of the same order of magnitude. Instead a full-edged phonon calculation for a system of N atoms per unit cell for a

170

R. di Meo, A. Dal Corso, P. Giannozzi and S. Cozzini

uniform mesh with nq qvectors will require a CPU time at least as large as 3N nq , times the CPU time for the scf step. For systems with a few tens of atoms in the unit cell, this multiplicative factor can be in the order of thousands and more. As a consequence, phonon calculations for crystals of more than a few atoms are considered HPC applications. In Q/E the nq dynamical matrices for a uniform mesh of q vectors are calculated by the ph.x code. The ph.x code can split the phonon calculation on the mesh of nq points into nq runs, one for each of the nq symmetry- inequiv alent q vector of the mesh. In order to better distribute the computation, we have modied the ph.x code to calculate separately also the contribution of each irrep to the dynamical matrix and to save it into a le. Assuming that there are Nirr (qi ) irreps for q = qi the complete phonon calculation can nq be split into up to Ntot = i=1 Nirr (qi ) separate calculations, that can be simultaneously executed into dierent Computing Elements. The CPU time required by each response calculation, i.e. for each irrep at one qvector, is roughly proportional to (and typically larger than) the dimension of the irrep, times the CPU time required by the starting scf calculation for the undistorted system, times the ratio NG /Nq between the dimension number of symmetry operations of the point group NG of the crystal and the dimension Nq of the small group of q. The latter factor is of no importance in low symmetry crystals for which NG = Nq = 1 but it can be quite large (up to 48) for a general q in a highly symmetric solid. In the latter case, once the dynamical matrix at q has been calculated, we can calculate without eort also the dynamical matrices of the star of q that contains NG /Nq points so the total amount of time for computing the nq points of the mesh is independent upon the symmetry. However, depending on the system, the eciency of the Grid partitioning can vary from cases in which each ph.x run requires approximately the same CPU time as the scf calculation, to particularly unfortunate cases in which a ph.x run will require up to 50 times the CPU time of a single scf run. The contribution of each irrep to the dynamical matrix typically a few Kbytes of data is written to a .xml le which can easily be transferred among dierent machines. After collecting all the .xml les in a single machine, a nal run of ph.x can collect the dynamical matrices in the nq q points. This has been implemented with the following approach. The les with the contribution of a given irrep to the dynamical matrix are written in the directory outdir/prefix.phsave and are called data-file.xml.q.irrep where q identies the q point and irrep the irrep. When the ph.x code

Calculation of Phonon Dispersion on the Grid Using Quantum ESPRESSO 171

is started by setting the input variable recover to .true. it will check in the directory outdir/prefix.phsave for the existence of les called data-file.xml.q.irrep and when a le is found the contribution to the dynamical matrix of the corresponding irrep is not recalculated but read from a le. When the les data-file.xml.q.irrep for all q and all irrep are present in the directory, ph.x will only collect the dynamical matrices and diagonalize them. Four new input variables control the phonon run: start q and last q choose among the nq points those actually calculated in the current run. start irr and last irr choose the irreps. In the present version, the ph.x code evaluates the displacement patterns of each irrep by an algorithm based on random numbers. In order to guarantee the consistency of irreps across dierent machines we calculate the displacement patterns in a preparatory run and write them to an .xml le that is then sent to all the computing elements. The le data-file.xml.q contains the displacement patterns for all irreps of a given q and the le data-file.xml contains a few information to control the phonon run, the most important being the cartesian coordinates of the nq q points. If these les are found in the outdir/prefix.phsave directory the mesh of q points and the displacement patterns are not recalculated by ph.x.

Grid enabling procedure

A typical phonon calculation can be performed in three dierent ways: (i) in a single serial run, (ii) in nq independent calculations, and (iii) in Ntot calculations independently. Figure 1 shows the three dierent approaches. For all the approaches a preparatory calculation is required: a self consistent calculation performed by Q/E pw.x tool nds the electronic structure of the undistorted system. The quantities calculated by this run are saved in outdir/prefix.save. In the rst approach ph.x performs all the Ntot calculations in sequence, one after the other in a single (serial) execution. All les data-file.xml.q.irrep are written in the same directory and a single instance of ph.x has to be called; the run could be split into several chunks in order to be executed on HPC resources with limited wall-time assigned by the queue systems but this recover procedure, which was already implemented in ph.x, will not concern us here. In the second approach the nq points are distributed using the variables

172

R. di Meo, A. Dal Corso, P. Giannozzi and S. Cozzini

Figure 1: Flowchart depicting the dierent approaches to the calculation of the phonon
dispersions.

start q and last q. This means that nq independent jobs are sent, each one calculating its own dynamical matrix. At this point a plot of the phonon at arbitrary q points can be done using specic Q/E tools q2r.x and matdyn.x. In the third approach an additional preparatory run of ph.x, done by setting start irr=0 and last irr=0, is required in order to calculate the displacement patterns of all irreps at each q and to write the les data-file.xml.q in the outdir/prefix.phsave directory. The outdir directory is then saved to a storage location accessible by means of some protocol from the available computational resources. Up to Ntot ph.x calculations can now be started simultaneously each with dierent values of the start q, last q, start irr and last irr input variables. The output of each ph.x run is one or more les data-file.xml.q.irrep. When all these les become available they can be collected into one outdir/prefix.phsave directory. A nal run of ph.x, which does not set the variables start q, last q, start irr or last irr, collects the results and diagonalizes the dynamical matrices. Note that, in this last step, the ph.x code will try to recalculate the irrep if the corresponding data-file.xml.q.irrep le is not

Calculation of Phonon Dispersion on the Grid Using Quantum ESPRESSO 173

found in the outdir/prefix.phsave directory. With respect to the sequential run, the two distributed approaches described above require additional steps to collect and recompose the dynamical matrices: the management of input and output in dierent directories should be done in an automatic way in order to avoid errors in the execution. In the third approach, also a nal step has to be done at the end of the independent calculations to collect all the dynamical matrices. Its computational weight is negligible but it should be considered in the workow. The Grid porting consists in a tool that manages all the procedures sketched above as much as possible in an automatic way. This is implemented by means of a client-server architecture: a server will be contacted by dierent clients and will assign them slices of work. The architecture is designed so as to be portable on dierent computational infrastructures. Our goal is to make the tool easily usable on heterogeneous and geographically distributed resources which users can have at disposal. For this reason we decoupled as much as possible the management software from the procedures needed to obtain computational resources. We successfully tested this approach on at least two computational infrastructures: a local cluster accessible through a batch queue system and a gLite/EGEE Grid infrastructure. Server and client software is written in python. They communicate by means of the XMLRPC protocol: the server in case of a Grid infrastructure is executed on a resolved host, usually the User Interface (UI: the machine where Grid users submit their jobs to the Grid. It is always provided with both inbound and outbound connectivity). Clients are then executed on the computational nodes, the so-called Worker Nodes (WNs). The user launches the server specifying a set of parameters (location of the binaries and data les on the Grid as well all the input needed) and then submits jobs to the computational infrastructure he wants to use. The only requirement is that the computational nodes should be able to contact back the server: this outbound connectivity request is always satised in case of EGEE/gLite Grid infrastructure. As soon as a job lands on a Worker Node, the client starts executing: it rst contacts back the server and requests the location of the needed data and the executable. Dierent protocols are supported in order to be able to store the data on dierent kinds of storage. In case of Grid computing gsiftp and lfc are both supported so data can be stored on Grid storage facilities like Storage Element and LFC Catalog.

174

R. di Meo, A. Dal Corso, P. Giannozzi and S. Cozzini

Once data and executable are available, the client asks for a chunk of work to be done. This will require to assign suitable values to the four variables start q, last q, start irr and last irr. Once the calculation is performed, the output (i.e. the les data-file.xml.q.irrep produced in the run and possibly the output of ph.x) is sent back to the server which assigns another task. Client activity is terminated when it hits the walltime limit imposed by the infrastructure (or if no active server is found). The server will stop when all the data needed to compute the required dynamical matrices have been sent back by the clients. At that point, the user will nd in the servers directory data-file.xml.q.irrep les for all nq q points and for all irreps. The dynamical matrices can then be recomposed locally with a nal call to ph.x. In order to start the execution of our system the user has to provide: The ph.x binary, compiled to suit the computational infrastructure, located in a storage location reachable by the WNs. The output data les of the self-consistent computation from pw.x and of the preparatory ph.x run, again located in a storage reachable by the WNs. A template of the input le of ph.x which will be used by the server to create ad hoc input les for each client requesting tasks to be executed. The server is then launched on the UI, where it will run until the end of the simulation, with the appropriate parameters. It is left to the user to submit an appropriate number of jobs in order to recruit computational resources where to distribute the computations. Such a task is a trivial one: users just have to submit many times the same job and this task can be easily automatized. The server keeps track of the status of the clients connected to it: if a client fails for any reason, the server will mark the task assigned to such client as available again and assign it to another client as soon as it becomes available. The fault tolerant mechanism is very important in Grid infrastructure where no strict control is possible on computational nodes and job failure rate is still high. We observed a 60% percent failure rate out of 2140 clients that successfully connected to the server (as discussed in the next section).

Calculation of Phonon Dispersion on the Grid Using Quantum ESPRESSO 175

We nally note that the above procedure runs the ph.x code serially on a large number of CPUs: each client (associated with a single job) lands on a dierent machine. Being SMP/Multicore architecture now very common as Grid computational resources it could happen very often that such serial jobs share the resources with other tasks submitted by other users. This has for sure some drawbacks in terms of performance and eciency. We will discuss in the conclusions possible alternative solutions to this problem.

Results

The porting procedure and the client/server architecture was developed using a few small examples. In this section we will discuss a relatively large phonon calculation which allowed us to evaluate the eciency of the novel distributed approach against the standard HPC one. A realistic phonon calculation on the Grid is given by the Al2 O3 system depicted here below (Fig. 2). This solid can be described as a distorted hexagonal lattice, characterized by primitive lattice vectors a, b, c, whose length is 5.579, 5.643, 13.67 Angstrom respectively, at angles ab = 120, ac = 90 and bc = 89.5. The unit cell contains 8 formula units, i.e. 40 atoms. The size of the phonon calculation is 3N = 120 nq linear-response calculations. In this example we choose to sample only a line in the reciprocal space from (q = 1) to the zone boundary along the reciprocal lattice vector bc a(bc) using a 21 1 1 mesh so that nq = 11. The point group contains only the identity so every q vector requires the same amount of CPU time, with the exception of the point for which we calculate also the response to the electric eld in order to evaluate the dielectric constant and the Born eective charges. The total time needed to complete such a calculation on a single modern workstation is a few weeks. In Fig. 2b we show the output of the calculation performed on both HPC and Grid infrastructure. We performed here the same calculation using the two dierent distributed approaches (over q and over irreps) on the HPC and Grid resources available. Table 1 summarizes the full list of computational experiments performed with some details:

176

R. di Meo, A. Dal Corso, P. Giannozzi and S. Cozzini

(a) Atomic structure

(b) Phonon dispersion curve

Figure 2: The Al2 O3 system.

5.1

Grid results

Let us now rst discuss the performance gures obtained using the client/server mechanism presented above on the Grid (provided by the EGEE project through the CompChem Virtual Organization [8]) where the distribution was done over irreps. The experiment was repeated three times varying the number of irreps assigned as a task to each client: this number was respectively 1, 4 and 6 for grid1, grid2 and grid3. The three experiments were performed one after

Table 1: List of phonon calculations on HPC and Grid : q/irreps distribution.


code grid1 grid2 grid3 4cpu 8cpu 16cpu kind GRID GRID GRID HPC HPC HPC cpus 1 1 1 4 8 16 computational nodes heterogeneous heterogeneous heterogeneous opteron dual-core 2.4Ghz 2 x opteron dual-core 2.4Ghz 4 x opteron dual-core 2.4Ghz distributed over irreps over irreps, in group of 4 over irreps, in group of 6 over q over q over q

Calculation of Phonon Dispersion on the Grid Using Quantum ESPRESSO 177

the other without overlaps, in order to maximize the resources available for each one. The server was activated on the User Interface and about 3000 independent jobs where submitted to the Grid scheduler in a few shots of 500 jobs each separated by 12/24 hours. This was done in order to avoid heavy load on the Workload Management System (WMS)1 and to allow a better overall scheduling policy. As reported in the previous section a high failure rate of 60% of jobs that contacted the server was reported2 . We discovered that between 30% and 40% of them failed to download the data required for the simulation while most of the others disappeared without signaling back to the server (which happens when the client hangs, or gets killed by the queue system). We are currently searching for the reasons behind such a large number of failures. Figure 3 gives information about the evolution of the simulation. It reports the number of active processors and the number of irreps completed as function of the duration of the experiments. It is worth to note the amount of resources collected during the three experiments: all of them were able to recruit up to more than 130 active clients at the same time3 ; since the access to the Grid is granted on best eort basis we consider this an excellent result, which is hardly possible on small and medium HPC clusters without privileged policies. Concerning the overall performance we can see that grid3 completed the simulation in 58 hours where grid2 and grid1 took 67 hours and 127 hours, respectively. grid3 was also way more ecient if we consider the total numbers of hours per processor requested to complete the experiment (see gure 3c): it used slightly less resources than grid2 and about half the resources used by grid1 (4064 hours/processors for grid3 versus 7100 hours/processors for grid1). grid1 overall performance was aected by a hardware failure of the server that blocked the execution for more than 20 hours. This means that all clients connected at the time of the failure to the server were lost. Once the server was restarted and jobs resubmitted simulation resumed but a lot of calculations were actually lost. It is however evident that performances of the latter two simulations outperform grid1. Reasons of this dierence can be understood looking at detailed information about the time required to
1 2

The service handling the scheduling and queuing of jobs on the EGEE-type Grids. A failure is counted each time a client exits before being able to complete any task. 3 We consider a client active when it is able to complete at least one irrep.

178

R. di Meo, A. Dal Corso, P. Giannozzi and S. Cozzini

Table 2: Initialization time and scf time for Grid runs averaged for q = 1 and remaining points.
(a) grid1 q 1 2-11 CPU time (days) % of the time init. (h) 859.99 252.51 141 64 phqscf (h) 157.54 174.89 79 36 q 1 2-11 CPU time (days) % of the time (b) grid2 init. (h) 249.47 70.60 40 31 phqscf (h) 182.36 192.40 88 69

(c) grid3 q 1 2-11 CPU time (days) % of the time init. (h) 143.37 46.48 25 23 phqscf (h) 157.73 189.55 86 77

complete the irreps as reported in table 2. We report the time and the number of jobs required to compute the full 120 irreps. We distinguish the q = 0 vector from all the others due to the dierent initialization phase as already mentioned above. For the other q we present here the averages over the number of q. Intervals are measured by the internal timing mechanism of the ph.x program. The initialization time reported here is given by the init phq plus e solve routine just for q = 0 vector. The linear response calculation is then done by the phqscf routine which accounts for all the self-consistent iterations for each irreps. It is evident from gure 3a that in the client/server approach a considerable amount of time is spent in initialization procedures. This step for each q is performed, respectively, 120 times, 30 times and 20 times for grid1, grid2 and grid3. This slows down the simulations considerably: the initialization phase alone has a major part in the CPU spent for grid1 (141 days, accounting for 64% of the walltime). Grouping the irreps assigned to each client the problem alleviates but still has a prominent role for grid2 and grid3 (contributing for, respectively, 40 and 25 days, or 31% and 23% of the total time). Grouping the irreps is not however the only possible way of reducing

Calculation of Phonon Dispersion on the Grid Using Quantum ESPRESSO 179

160 140 120 Clients present 100

clients irreps

1400 1200 1000 Clients present Points done 800

140 120 100 80 60 40 20 0 0

clients irreps

350 300 250 200 150 100 50 0 Points done

80 600 60 40 20 0 0 20 40 60 80 100 120 Time (hours) 400 200 0 140

10

20

30

40

50

60

70

Time (hours)

(a) grid1
160 140 120 Clients present

(b) grid2
250

clients irreps

200 Points done 100 80 60 40 50 20 0 0 10 20 30 40 Time (hours) 50 60 0 100

150

(c) grid3

Figure 3: Number of irreducible representations computed and clients present over time
for the Grid simulations.

the gap between our Grid oriented approach and the HPC one and early studies seem to suggest that the initialization time when distributing over the irreps might be further reduced, which would greatly contribute to the performances. We nally note that the overhead caused by data transfer from storage elements and worker nodes (which is non-existent in a HPC environment) and vice versa is almost negligible in this context and does not aect at all the overall performances, despite the large amount of data downloaded by each client. This result is due to the fact that Grid storage allows multiple copies of the same data (replica) managed by a catalog (LFC central service). Such replica can be stored close to the computing elements used by the clients in the course of the simulations. This means that in many cases the data are just downloaded by a storage element on the same local area network.

180

R. di Meo, A. Dal Corso, P. Giannozzi and S. Cozzini

5.2

HPC results

In the case of the q distribution approach performed on HPC resources we just submitted nq = 11 independent calculations through the resource man ager (queue system of our HPC cluster). The code was compiled enabling MPI communications and we repeated the calculation using dierent numbers of cpus: 4 (a full node dual core opteron node), 8 cpus and 16 cpus (respectively two and four nodes connected via inniband network). Table 3: CPU time spent during on the initialization and scf time for the simulations distributed over q.
(a) 4cpu (b) 8cpu

q 1 2-11

jobs 8 6

init. (h) 8.16 5.19 10 8 q 1 2-11

phqscf (h) 62.02 60.06 110 92 jobs 2 2

q 1 2-11

jobs 4 4

init. (h) 2.41 1.36 5 4

phqscf (h) 32.13 30.28 109 96

CPU time (days) % of the time

CPU time (days) % of the time init. (h) 0.81 0.28 2 2 phqscf (h) 16.01 15.39 113 98

(c) 16cpu

CPU time (days) % of the time

Once the jobs terminated we resubmitted them until all 120 irreps were calculated. As for the Grid, table 3 gives a detailed report of the total amount of time needed to complete 4, 8 and 16 cpus experiments. Data are presented as in table 2 with an additional column where we report the number of jobs needed to complete the full 120 irreps. We note that at any needed restart (i.e. a new job) the initialization phase is to be repeated: therefore splitting the calculation over 10 hours chunks causes some overhead but considerably less if compared with the Grid approach. Tables 3a, 3b and 3c gives us indications about the overall scalability of

Calculation of Phonon Dispersion on the Grid Using Quantum ESPRESSO 181

the ph.x code, which is acceptable: there is moreover a super-linear speed-up in the initialization phase of the calculation when increasing the number of cpus; this is just because less jobs are needed and therefore the initialization phase is computed less times. The performances of the code when moving from 4 to 16 CPUs is also very good with an overhead of only a few percent, which implies almost perfect scalability in our range of testing. Table 4: Walltime estimate for q distribution on HPC platform. # cpu total # of jobs max # of concurrent jobs # of submissions average time waiting (m) max.running time avail. estimated time 8 39 10 4 46 12h 110h 50h 4 88 10 9 22 16 22 8 3 136 40h

From the times above reported we can estimate the average duration of a complete computational experiment, taking into account the time spent waiting on a queue (based on the scheduling system information) and user policies implemented. Since the maximum wall-time allowed in our queues is 12 hours and the maximum number of jobs running is 10 (with an additional constraint that no more than 128 cpus can be used at once), we estimated the results reported in table 4. These numbers make clear that it is far more convenient to run on 8 or 16 cpus with respect to 4. An unexpected side eect of the scheduling policy is that the time required to complete the simulation on 16cpu is only slightly smaller than on 8cpu, since the superlinear speed-up observed in the initialization phase in this case counterbalances the time spent waiting.

5.3

Comparison

We can now compare the computational eciency of the two distributed approaches so far presented. We base this comparison on two important parameters: the total wall time required to complete the simulation that can give us the actual duration of experiment, i.e. time to result. We stress here that the duration of the experiment depends upon many dierent parameters, not all easy to keep under control especially in a distributed Grid

182

R. di Meo, A. Dal Corso, P. Giannozzi and S. Cozzini

environment. It therefore turns out that such value can be quite volatile and only a rough estimate can be given. We couple therefore such value with the total amount of global cputime needed (hours per processor) in order to have a more dened estimate; we also note that it is the value that supercomputer centers use to charge the use of computational infrastructures. Concerning the duration of the experiment, we observe that a Grid infrastructure can easily compete with the HPC infrastructure for such kind of computational experiment: for instance, the HPC 8cpu simulation is estimated to complete just 8 hours before grid3. Looking at the global cputime needed we see that grid3 is even better than HPC resources: the global amount of computing time is even slightly less than hpc8 and hpc16. A complete comparison of all the experiment performed is collected in Fig. 3. We note here that while q-distributed simulations took about 115 days of CPU, grid1 took about 220 days4 .

300

250

grid1 grid2 grid3

Wall time (days)

200

150

100

50

0 0 20 40 60 80 Time (h) 100 120 140

Figure 4: Resources spent on the Grid. The reason is to be sought in the overhead caused by the initialization
Without counting the CPU time spent by extra clients working on the same task at once, which accounts for the discrepancy between table 2 and gure 3.
4

Calculation of Phonon Dispersion on the Grid Using Quantum ESPRESSO 183

when splitting the computation over the irreducible representations, which is also the reason behind the performance dierences between the various Grid experiments.

Conclusions

As discussed in the previous section our approach to distribute phonon computation proved to be an interesting approach when compared with the standard HPC one. Our solution, decoupling as much as possible the scientic workow from the recruitment of computational resources, is easily portable and perfectly interoperable with dierent computational platforms. In this way, relieving the simulation from the requirement of a very fast and reliable network connection between nodes we have been able to use decoupled and un-specialized resources, like dierent Worker Nodes connected through the internet, to perform a task that before could be tackled only with very performing infrastructures. The detailed performance comparison performed in the previous section highlights that times needed to complete a realistic phonon calculation on Grid infrastructure are comparable with times needed on HPC platform. We stress here the fact that computational resources are made available with a completely dierent policy in a Grid infrastructure with respect to an HPC environment: from this perspective our results are even more interesting and promising. There is, however, room for making the procedure even more ecient, with a twofold approach. On one side, we can reduce the required initialization time by directly modifying the scientic code. On the other side, we can also pack more irreps together, to better t the available queues. We nally note that, being the multicore/SMP architecture widespread as computational resources in Grid infrastructure it would be desirable to be able to run each process on SMP node, enabling thus parallel computation of ph.x using the shared memory approach (the same way we performed the client/server experiment on HPC platform). We therefore integrated in our setup another tool, developed by ICTP: the reserve SMP nodes, which allows, through a mechanism known as job reservation to run codes parallelized through shared memory on the Grid, thus getting a speedup in the execution and, more importantly, allowing for larger simulations to be tackled (since a larger share of the memory on the destination machine can be reserved for the execution).

184

R. di Meo, A. Dal Corso, P. Giannozzi and S. Cozzini

Employing such utility we have been able to run the ph.x executable in parallel on the Grid nodes with MPI over shmem, for instance, using two cores per node, eectively doubling the available memory for the application (as well as the computing resources), however the results obtained with other scientic applications suggest that a larger number of CPUs can be easily recruited as well. Research about the conjuncted use of reserve smp nodes and ph.x is still ongoing. Acknowledgments. We thank Eduardo Ariel Menendez Proupin for suggesting the physical problem considered in this paper.

Calculation of Phonon Dispersion on the Grid Using Quantum ESPRESSO 185

References
[1] Enabling Grids for E-sciencE, http://www.eu-egee.org [2] Quantum Espresso Home Page, http://www.quantum-espresso.org [3] P. Giannozzi et al., http://arxiv.org/abs/0906.2569 [4] P. Giannozzi and C. Cavazzoni, (2009). Nuovo Cimento C 32. In press [5] See for instance: C. Kittel, Introduction to Solid State Theory, 8th edition, Wiley, New York (2005) [6] P. Giannozzi, S. de Gironcoli, P. Pavone and S. Baroni, (1991). Phys. Rev. B 43, 7231 [7] S. Baroni, S. de Gironcoli, A. Dal Corso and P. Giannozzi, (2001). Rev. Mod. Phys. 73, 515 [8] CompChem Virtual Organization, http://compchem.unipg.it/start.php

You might also like