MPI Under Condor
MPI Under Condor
MPI Under Condor
This information pertains mainly to local (UW astro department) Condor usage... You can also take advantage of the gigabit network in the undergraduate computer lab to run your simmi simmis. In fact, you can use the existing condor system to do this. So, if you ran pkdgrav/gasoline through condor before, you can now do it faster. There are a few things you need to do first... The current stable release of condor (6.6) does not use lam (the next version will...). Instead, it uses mpich. So, if you had compiled pkdgrav to run in parallel before, it is likely that you compiled it using the LAM mpi compilers.
for cshell or
export PATH=/astro/users/roskar/nbody/mpich-1.2.4/bin:$PATH
for bash shell. Make sure that the above directory is the first thing in your path by typing echo $PATH. Now, to make sure that everything is done cleanly, first type make spotless to remove any object files made with other compilers. Also delete the file ../mdl/mpi/mdl.o. Pkdgrav compiled with the intel compiler icc seems to work about 1/3 faster than binaries compiled with standard compilers. You need to do a few things to use the intel compiler. First, run the script which will set your path correctly by doing source /net/intel/compiler81/bin/iccvars.csh for the cshell or iccvars.sh for the bash shell. Next, you need to alter some lines in the "SP1/2 defines" section in your Makefile. Make sure they read as follows:
SPX_CFLAGS = -O3 -xK -ipo -I$(SPX_MDL) $(CODEDEF) SPX_LD_FLAGS = -L/astro/users/roskar/nbody/mpich-1.2.4/lib SPX_LIBMDL = $(SPX_MDL)/mdl.o -lm SPX_MDL_CFLAGS = -O3 -xK -ipo
You will use the mpicc compiler. This compiler makes sure the proper mpi libraries are used, but invokes another C compiler and linker to do the actual work. So, you need to set two more environment variables:
setenv MPICH_CC icc setenv MPICH_CLINKER icc
Now you're ready to go - type make mpi. The new pkdgrav/gasoline executable is condor-mpi-compatible and built with the Intel compiler.
The universe keyword just tells condor that you want to run in parallel, and machine_count * gasp * tells it how many computers you want. Don't worry about telling it which machines to run on: only the astrolab computers are configured to accept MPI jobs. Set up the rest of your condor job configuration just like you would for a regular condor job, except make sure that the binary you specify is the newly-compiled mpi binary. The last thing you need to know is that you can only submit MPI jobs from the server carrion.astro.washington.edu. So, you'll need to log in to carrion and then submit your job.
A few caveats
Your jobs will get kicked off if someone starts using any of the machines you are running on. So, if you try running on all 18 computers during the weekday, your job likely won't run for very long, because someone is bound to sit down at one of the keyboards very soon. Also, your jobs can get preempted by other users with a better priority, just as if they were serial jobs. In any case, checkpoint often. You may also notice that it takes a while for your jobs to start - this is because the condor negotiator needs to reserve several machines for you to use, and this process can take a few negotiation cycles. During the day, jobs running on 4 machines seem to sometimes get kicked off within an hour. I've found that setting my timesteps such that checkpoints are written once every 30 minutes works OK. I'm still a bit skeptical about the usefuleness of running condor-mpi during the weekdays...
Another problem, which to me appears to be a fundamental bug in the way condor handles MPI resource allocation, is that if you run multiple MPI jobs they eventually appear to start competing with one another. The DedicatedScheduler user is the "user" that actually gets taxed for allocating resources. This means that if your prioirity is worse than that of the DedicatedScheduler, then it could kick your job off - to run your other job! I've seen it get caught up in somewhat of a stalemate a few times, which is an enormous waste of compute resources, since the DedicatedScheduler just hangs on to resources without ever having jobs run on them for very long. I "solve" this by only running one job at a time. This certainly destroys some of the Condor distributed computing model, which makes you dream of running many many jobs at once, but I have not been able to come up with a better alternative. It doesn't make sense to simply run your job on the maximum number of possible machines. Eventually, the cost of communication between the nodes will allow for only a marginal improvement of performance. So, run your simulation for just a few steps using different numbers of processors to see what number of machines is sensible to use. This way you also won't hog the entire undergrad lab with just one job. It also appears that Condor-MPI jobs do not copy the environment properly, even if the getenv = true is set. You may need to set some environment variables by hand, especially if you are using pkdgrav since the PKDGRAV_CHECKPOINT_FDL variable is required for proper checkpointing. This is done by using environment = variable1=value1;variable2=value2;... etc. Remember that Condor has a limit of 10240 characters for the environment variables, so make sure that you are not accidentally removing old ones by adding new ones. Enjoy the improved performance of your condor jobs...!!!