Tight-LAM-Integration-Notes
From GridWiki
Background
Author --Chris Dagdigian 17:34, 7 June 2006 (EDT)
This page documents the experience of following the process outlined here:
"Loose and tight integration of the LAM/MPI library into SGE" ... with the goal of trying to achieve tight LAM-MPI integration on a 12-node Linux cluster running SGE-6.0u8
People completly new to "Loose" vs "Tight" Parallel Environment integration may want to read this article: "Parallel Environments (PEs) - Loose vs. Tight Integration"
Get, patch, build and install the LAM-MPI software
The wget utility is used to download the LAM-MPI soure code:
# wget http://www.lam-mpi.org/download/files/lam-7.1.2.tar.gz
Uncompress and open the tar archive:
# zcat lam-7.1.2.tar.gz | tar xvf -
Discover that the LAM-MPI 7.1.2 source code has been updated so that it does not need the patch that Reuti recommends for avoiding a race condition within hboot. The existing HOWTO recommends manually patching the hboot.c code to add the "setsid()" line but this has already been done within the lam-7.1.2 distribution.
The code in lam-7.1.2/tools/hboot/hboot.c now looks like this with no changes necessary:
else if (pid == 0) { /* child */
/* Put this setsid() here mainly for SGE --
their tight integration with LAM/MPI does
something like this:
lamboot -> qrsh (to a remote node) -> hboot
-> qrsh -> lamd
Without having a setsid() here, there is a
race condition between when hboot quits and
SGE thinks the job is over (and therefore
starts killing things) and when the second
qrsh is able to establish itself and/or the
lamd and tell SGE that the job is, in fact,
*not* over. So putting a setsid() here in
the child, then the hboot child (and
therefore the vulnerable period of the 2nd
qrsh) escape being killed by SGE while
still making progress on the overall
lamboot.
*/
setsid();
This is excellent news and means that we can built LAM-MPI from source with no patching or file editing required for tight Grid Engine integration.
Decide how you want to build LAM-MPI. In my case there are a few things about my environment and preferred cluster setup that will affect how I'm going to build the LAM-MPI binaries for tight integration:
- My network interconnect is nothing fancy, just plain Gigabit Ethernet. No exotic communication required.
- I don't have a Fortran compiler installed and don't care about Fortran support in my LAM-MPI environment
- I want to build and install these files into a dedicated location so that it does not interfere with existing installations of MPICH2 and a loosely integrated LAM-MPI installation. The path I've chosen to install into will be called "/opt/class/tight-lammpi" - this path is shared on all cluster nodes.
Time to build lam-7.1.2 from source and install it into its dedicated home:
cd ./lam-7.1.2 ./configure --prefix=/opt/class/tight-lammpi --without-fc make make install
Note: Make sure you have not overridden the RSH binary command (substituting passwordless SSH for instance ...) -- leave the RSH method alone both when compiling and setting LAM related environment variables. This will allow us to invoke remote commands via SGE's 'qrsh' program instead. This is one reason why I have multiple LAM-MPI installations on this cluster -- a different installation is for non-SGE use and is hard-coded to use "ssh -q" as the underlying transport mechanism. I made this mistake while writing this Wiki entry, the first time I compiled LAM I added the "--with-rsh=ssh" line to the configure command. The end result was a LAM installation that used direct SSH calls instead of the SGE 'qrsh' based methods required for tight integraton with Grid Engine.
After building and installing, we now have a newly populated LAM-MPI folder:
[root@galaxy-demo lam-7.1.2]# ls -l /opt/class/tight-lammpi/ total 24 drwxr-xr-x 2 root root 4096 Jun 7 14:42 bin drwxr-xr-x 2 root root 4096 Jun 7 14:42 etc drwxr-xr-x 3 root root 4096 Jun 7 14:42 include drwxr-xr-x 3 root root 4096 Jun 7 14:42 lib drwxr-xr-x 8 root root 4096 Jun 7 14:42 man drwxr-xr-x 3 root root 4096 Jun 7 14:42 share
Get and install the helper scripts mentioned in the HOWTO
The online HOWTO linked at the top of this page includes a download link for all of the scripts mentioned. We want to download those scripts and make them available. A sensible location would be somewhere inside the custom location where the newly built LAM-MPI binaries and files were installed. In this example, that directory was "/opt/class/tight-lammpi/":
# cd /opt/class/tight-lammpi # mkdir helper-scripts # cd ./helper-scripts # wget http://gridengine.sunsource.net/howto/lam-integration/sge-lam-integration-scripts.tgz # zcat sge-lam-integration-scripts.tgz | tar xvf - # ls -l total 24 -rwxr-xr-x 1 502 users 577 Feb 22 2005 lamd_wrapper drwxr-xr-x 2 502 users 4096 Feb 24 2005 lam_loose_qrsh drwxr-xr-x 2 502 users 4096 Feb 20 2005 lam_loose_rsh drwxr-xr-x 2 502 users 4096 Feb 24 2005 lam_tight_qrsh -rw-r--r-- 1 502 users 358 Feb 22 2005 mpihello.c -rw-r--r-- 1 root root 3059 Mar 26 2005 sge-lam-integration-scripts.tgz
Following the HOWTO's suggestions for Additional changes to LAM/MPI we install the wrapper script in place of our newly built binary:
# cd /opt/class/tight-lammpi/helper-scripts # mv /opt/class/tight-lammpi/bin/lamd /opt/class/tight-lammpi/bin/lamd_binary # cp ./lamd_wrapper /opt/class/tight-lammpi/bin/ # chmod +x /opt/class/tight-lammpi/bin/lamd_wrapper # cd /opt/class/tight-lammpi/bin # ln -s lamd_wrapper lamd
Customize the startlam.sh and stoplam.sh scripts
The startlam.sh helper script that was downloaded needs to be adjusted to match where the helper scripts reside on the system. For startlam.sh look at the lines of code around #111:
rsh_wrapper=$SGE_ROOT/lam_tight_qrsh/rsh
rsh_wrapper=/home/reuti/lam_tight_qrsh/rsh
if [ ! -x $rsh_wrapper ]; then
echo "$me: can't execute $rsh_wrapper" >&2
echo " maybe it resides at a file system not available at this machine" >&2
exit 1
fi
Obviously you want to set "rsh_wrapper=" to point to the correct location on your system. If you look at the HOWTO scripts, all of them try to default to a location of "$SGE_ROOT/lam_tight_qrsh" so you may be able to get away with copying your files into this path.
One other change may be required, right at the end of startlam.sh:
if [ -z "`which lamboot 2>/dev/null`" ] ; then
export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
fi
lamboot $machines
The PATH line will need to be edited to reflect where LAM-MPI was installed on the local system.
A similar minor check/edit is needed on stoplam.sh:
export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
The PATH setting will need to be altered to reflec the local installation of the LAM-MPI setup.
Set up the Grid Engine Parallel Environment
We need to make a few changes from the PE config shown in the HOWTO in order to create a PE suitable for use with Grid Engine 6. Slot count will be set to 12 since the cluster used in this example consists of 12 single-CPU server systems running Linux.
Using the command qconf -ap lam_tight_qrsh a PE with the following configuration was created, note how the PE is explicitly calling the helper scripts that were downloaded and installed in the previous step:
pe_name lam_tight_qrsh slots 12 user_lists NONE xuser_lists NONE start_proc_args /opt/class/tight-lammpi/helper-scripts/lam_tight_qrsh/startlam.sh -catch_rsh $pe_hostfile stop_proc_args /opt/class/tight-lammpi/helper-scripts/lam_tight_qrsh/stoplam.sh allocation_rule $round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots min
To confirm the new PE exists, list the available Grid Engine parallel environments. The new PE should join the "make" PE which is configured by default:
# qconf -spl lam_tight_qrsh make
Associate the newly created Parallel Environment with a Grid Engine 6.x Cluster Queue
Another change that differs from the HOWTO. In SGE 5.x, the PE configuration internally listed the queues that the PE was available for. This behavior is switched with Grid Engine 6.x -- now we have to edit the queue configuration in order to tell SGE that the queue is able to support the PE that was just created.
Issue the command qconf -mq all.q and add the "lam_tight_qrsh" PE to the existing pe_list parameter. This is how Grid Engine is told that the "all.q" cluster queue is capable of supporting jobs that request our newly created PE.
Verify the configuration change via the qconf -sq command:
# qconf -sq all.q | grep pe_list pe_list make lam_tight_qrsh
Built a test MPI application
The helper scripts downloaded in a previous step contain a simple test program called "mpihello.c". The compiler script is the "mpiCC" script located in the bin/ directory of the new LAM-MPI installation:
$ cp /opt/class/tight-lammpi/helper-scripts/mpihello.c ./mpihello.c $ /opt/class/tight-lammpi/bin/mpiCC ./mpihello.c $ mv ./a.out ./mpihello
Do this as a non-root user as we don't want to run LAM programs as root. Also remember to build or at least copy the binary to a location that is shared cluster-wide. All the nodes need to be able to access the binary.
Just for kicks, try to run the program. You should see a standard "Hey! there is no LAM environnent set up yet!" error message:
$ ./mpihello ----------------------------------------------------------------------------- It seems that there is no lamd running on the host galaxy-demo. This indicates that the LAM/MPI runtime environment is not operating. The LAM/MPI runtime environment is necessary for MPI programs to run (the MPI program tired to invoke the "MPI_Init" function). Please run the "lamboot" command the start the LAM/MPI runtime environment. See the LAM/MPI documentation for how to invoke "lamboot" across multiple machines. -----------------------------------------------------------------------------
Create a simple SGE job script
Note: Do not run jobs as root. We want to test and run parallel applications as normal, non-special users.
If we submit a script to our new PE, we know that all of the "LAM stuff" such as lamboot and lamhalt are going to be taken care of automatically by the configured PE start and stop scripts. This means our actual job script just has to call the correct mpirun program along with the parallel binary we are trying to launch:
#!/bin/sh #$ -cwd #$ -N MPIHELLO /opt/class/tight-lammpi/bin/mpirun C ./mpihello
Submit the job
[bioadmin@galaxy-demo ~/tight-mpi-test]$ qsub -pe lam_tight_qrsh 10 ./mpi-tester.sh
Your job 10 ("MPIHELLO") has been submitted
[bioadmin@galaxy-demo ~/tight-mpi-test]$
[bioadmin@galaxy-demo ~/tight-mpi-test]$
[bioadmin@galaxy-demo ~/tight-mpi-test]$ qstat -f
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
all.q@n0000. BIP 0/1 5.08 lx24-x86 a
----------------------------------------------------------------------------
all.q@n0001 BIP 0/1 0.11 lx24-x86
----------------------------------------------------------------------------
all.q@n0002 BIP 1/1 0.06 lx24-x86
10 0.55500 MPIHELLO bioadmin r 06/07/2006 17:31:06 1
----------------------------------------------------------------------------
all.q@n0003 BIP 1/1 0.04 lx24-x86
10 0.55500 MPIHELLO bioadmin r 06/07/2006 17:31:06 1
----------------------------------------------------------------------------
all.q@n0004 BIP 1/1 0.03 lx24-x86
10 0.55500 MPIHELLO bioadmin r 06/07/2006 17:31:06 1
----------------------------------------------------------------------------
all.q@n0005 BIP 1/1 0.02 lx24-x86
10 0.55500 MPIHELLO bioadmin r 06/07/2006 17:31:06 1
----------------------------------------------------------------------------
all.q@n0006 BIP 1/1 0.02 lx24-x86
10 0.55500 MPIHELLO bioadmin r 06/07/2006 17:31:06 1
----------------------------------------------------------------------------
all.q@n0007 BIP 1/1 0.01 lx24-x86
10 0.55500 MPIHELLO bioadmin r 06/07/2006 17:31:06 1
----------------------------------------------------------------------------
all.q@n0008 BIP 1/1 0.03 lx24-x86
10 0.55500 MPIHELLO bioadmin r 06/07/2006 17:31:06 1
----------------------------------------------------------------------------
all.q@n0009 BIP 1/1 0.02 lx24-x86
10 0.55500 MPIHELLO bioadmin r 06/07/2006 17:31:06 1
----------------------------------------------------------------------------
all.q@n0010 BIP 1/1 0.02 lx24-x86
10 0.55500 MPIHELLO bioadmin r 06/07/2006 17:31:06 1
----------------------------------------------------------------------------
all.q@n0011 BIP 1/1 0.09 lx24-x86
10 0.55500 MPIHELLO bioadmin r 06/07/2006 17:31:06 1
Signs of success
Using the command "ps -e f -o pid,ppid,pgrp,command --cols=80" to check the process table on a node running a parallel task. If you see something like the following:
789 1 789 /export/home/bigcluster/sge/bin/lx24-x86/sge_execd 1206 789 1206 \_ sge_shepherd-8 -bg 1207 1206 1207 \_ /export/home/bigcluster/sge/utilbin/lx24-x86/rshd -l 1209 1207 1209 \_ /export/home/bigcluster/sge/utilbin/lx24-x86/qrsh_start 1252 1209 1252 \_ tcsh -c lamd_binary -H 198.18.0.11 -P 32789 -n 5 -o 1295 1252 1252 \_ lamd_binary -H 198.18.0.11 -P 32789 -n 5 -o 0 - 1296 1295 1252 \_ ./mpihello 1204 1 1204 /export/home/bigcluster/sge/bin/lx24-x86/qrsh -V -inherit -nostdin 1208 1204 1204 \_ /export/home/bigcluster/sge/utilbin/lx24-x86/rsh -n -p 32791 n0
The above output shows successful tight integration. All of the LAM related programs including the lamd_binary and our "mpihello" program are running under the control of the Grid Engine sge_shepherd daemon.
Signs of failure
LAM seems to be pretty chatty installed as outlined in this document and the SGE start and stop scripts are also pretty verbose. Chances are if the parallel program fails to run at all there will be a pretty clear indication as to what happened. Just check all of the ".o", ".e", ".po" and ".pe" files that Grid Engine creates by default.
Subtle signs of failure
The following example shows the test program running OK to completion, but not under the complete control of Grid Engine.
Using the command " ps -e f -o pid,ppid,pgrp,command --cols=80" to check the process table on a node running a parallel task. If you see something like the following:
795 1 795 /export/home/bigcluster/sge/bin/lx24-x86/sge_execd 892 795 892 \_ sge_shepherd-6 -bg 963 892 963 \_ -sh /export/home/bigcluster/sge/default/spool/n0008/jo 964 963 963 \_ mpirun C ./mpihello 952 1 952 /opt/class/tight-lammpi/bin/lamd -H 198.18.0.9 -P 32779 -n 0 - 965 952 952 \_ ./mpihello
The above example shows that the LAM daemon binary ("lamd") is not a child process of the Grid Engine sge_sherpherd daemon. This means that tight integration has failed as it is clear that LAM is running in its own little world. What we want to see is "lamd" running under the control and supervision of a SGE daemon. This may be similar to output seen in a loose LAM-MPI environment where the lamboot program launches the LAM daemons independently of Grid Engine.
