From GridWiki
Jump to: navigation, search


Author --Chris Dagdigian 17:34, 7 June 2006 (EDT)

This page documents the experience of following the process outlined here:

"Loose and tight integration of the LAM/MPI library into SGE" ... with the goal of trying to achieve tight LAM-MPI integration on a 12-node Linux cluster running SGE-6.0u8

People completly new to "Loose" vs "Tight" Parallel Environment integration may want to read this article: "Parallel Environments (PEs) - Loose vs. Tight Integration"

Get, patch, build and install the LAM-MPI software

The wget utility is used to download the LAM-MPI soure code:

# wget http://www.lam-mpi.org/download/files/lam-7.1.2.tar.gz

Uncompress and open the tar archive:

# zcat lam-7.1.2.tar.gz | tar xvf -

Discover that the LAM-MPI 7.1.2 source code has been updated so that it does not need the patch that Reuti recommends for avoiding a race condition within hboot. The existing HOWTO recommends manually patching the hboot.c code to add the "setsid()" line but this has already been done within the lam-7.1.2 distribution.

The code in lam-7.1.2/tools/hboot/hboot.c now looks like this with no changes necessary:

                else if (pid == 0) {            /* child */
                        /* Put this setsid() here mainly for SGE --
                           their tight integration with LAM/MPI does
                           something like this:

                           lamboot -> qrsh (to a remote node) -> hboot
                           -> qrsh -> lamd

                           Without having a setsid() here, there is a
                           race condition between when hboot quits and
                           SGE thinks the job is over (and therefore
                           starts killing things) and when the second
                           qrsh is able to establish itself and/or the
                           lamd and tell SGE that the job is, in fact,
                           *not* over.  So putting a setsid() here in
                           the child, then the hboot child (and
                           therefore the vulnerable period of the 2nd
                           qrsh) escape being killed by SGE while
                           still making progress on the overall

This is excellent news and means that we can built LAM-MPI from source with no patching or file editing required for tight Grid Engine integration.

Decide how you want to build LAM-MPI. In my case there are a few things about my environment and preferred cluster setup that will affect how I'm going to build the LAM-MPI binaries for tight integration:

  1. My network interconnect is nothing fancy, just plain Gigabit Ethernet. No exotic communication required.
  2. I don't have a Fortran compiler installed and don't care about Fortran support in my LAM-MPI environment
  3. I want to build and install these files into a dedicated location so that it does not interfere with existing installations of MPICH2 and a loosely integrated LAM-MPI installation. The path I've chosen to install into will be called "/opt/class/tight-lammpi" - this path is shared on all cluster nodes.

Time to build lam-7.1.2 from source and install it into its dedicated home:

cd ./lam-7.1.2
./configure --prefix=/opt/class/tight-lammpi --without-fc
make install

Note: Make sure you have not overridden the RSH binary command (substituting passwordless SSH for instance ...) -- leave the RSH method alone both when compiling and setting LAM related environment variables. This will allow us to invoke remote commands via SGE's 'qrsh' program instead. This is one reason why I have multiple LAM-MPI installations on this cluster -- a different installation is for non-SGE use and is hard-coded to use "ssh -q" as the underlying transport mechanism. I made this mistake while writing this Wiki entry, the first time I compiled LAM I added the "--with-rsh=ssh" line to the configure command. The end result was a LAM installation that used direct SSH calls instead of the SGE 'qrsh' based methods required for tight integraton with Grid Engine.

After building and installing, we now have a newly populated LAM-MPI folder:

[root@galaxy-demo lam-7.1.2]# ls -l /opt/class/tight-lammpi/
total 24
drwxr-xr-x  2 root root 4096 Jun  7 14:42 bin
drwxr-xr-x  2 root root 4096 Jun  7 14:42 etc
drwxr-xr-x  3 root root 4096 Jun  7 14:42 include
drwxr-xr-x  3 root root 4096 Jun  7 14:42 lib
drwxr-xr-x  8 root root 4096 Jun  7 14:42 man
drwxr-xr-x  3 root root 4096 Jun  7 14:42 share

Get and install the helper scripts mentioned in the HOWTO

The online HOWTO linked at the top of this page includes a download link for all of the scripts mentioned. We want to download those scripts and make them available. A sensible location would be somewhere inside the custom location where the newly built LAM-MPI binaries and files were installed. In this example, that directory was "/opt/class/tight-lammpi/":

# cd /opt/class/tight-lammpi
# mkdir helper-scripts
# cd ./helper-scripts
# wget http://gridengine.sunsource.net/howto/lam-integration/sge-lam-integration-scripts.tgz
# zcat sge-lam-integration-scripts.tgz | tar xvf -
# ls -l
total 24
-rwxr-xr-x  1  502 users  577 Feb 22  2005 lamd_wrapper
drwxr-xr-x  2  502 users 4096 Feb 24  2005 lam_loose_qrsh
drwxr-xr-x  2  502 users 4096 Feb 20  2005 lam_loose_rsh
drwxr-xr-x  2  502 users 4096 Feb 24  2005 lam_tight_qrsh
-rw-r--r--  1  502 users  358 Feb 22  2005 mpihello.c
-rw-r--r--  1 root root  3059 Mar 26  2005 sge-lam-integration-scripts.tgz

Following the HOWTO's suggestions for Additional changes to LAM/MPI we install the wrapper script in place of our newly built binary:

# cd /opt/class/tight-lammpi/helper-scripts
# mv /opt/class/tight-lammpi/bin/lamd /opt/class/tight-lammpi/bin/lamd_binary
# cp ./lamd_wrapper /opt/class/tight-lammpi/bin/
# chmod +x /opt/class/tight-lammpi/bin/lamd_wrapper
# cd /opt/class/tight-lammpi/bin
# ln -s lamd_wrapper lamd

Customize the startlam.sh and stoplam.sh scripts

The startlam.sh helper script that was downloaded needs to be adjusted to match where the helper scripts reside on the system. For startlam.sh look at the lines of code around #111:

   if [ ! -x $rsh_wrapper ]; then
      echo "$me: can't execute $rsh_wrapper" >&2
      echo "     maybe it resides at a file system not available at this machine" >&2
      exit 1

Obviously you want to set "rsh_wrapper=" to point to the correct location on your system. If you look at the HOWTO scripts, all of them try to default to a location of "$SGE_ROOT/lam_tight_qrsh" so you may be able to get away with copying your files into this path.

One other change may be required, right at the end of startlam.sh:

if [ -z "`which lamboot 2>/dev/null`" ] ; then
    export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
lamboot $machines

The PATH line will need to be edited to reflect where LAM-MPI was installed on the local system.

A similar minor check/edit is needed on stoplam.sh:

export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH

The PATH setting will need to be altered to reflec the local installation of the LAM-MPI setup.

Set up the Grid Engine Parallel Environment

We need to make a few changes from the PE config shown in the HOWTO in order to create a PE suitable for use with Grid Engine 6. Slot count will be set to 12 since the cluster used in this example consists of 12 single-CPU server systems running Linux.

Using the command qconf -ap lam_tight_qrsh a PE with the following configuration was created, note how the PE is explicitly calling the helper scripts that were downloaded and installed in the previous step:

pe_name           lam_tight_qrsh
slots             12
user_lists        NONE
xuser_lists       NONE
start_proc_args   /opt/class/tight-lammpi/helper-scripts/lam_tight_qrsh/startlam.sh -catch_rsh $pe_hostfile
stop_proc_args    /opt/class/tight-lammpi/helper-scripts/lam_tight_qrsh/stoplam.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

To confirm the new PE exists, list the available Grid Engine parallel environments. The new PE should join the "make" PE which is configured by default:

# qconf -spl

Associate the newly created Parallel Environment with a Grid Engine 6.x Cluster Queue

Another change that differs from the HOWTO. In SGE 5.x, the PE configuration internally listed the queues that the PE was available for. This behavior is switched with Grid Engine 6.x -- now we have to edit the queue configuration in order to tell SGE that the queue is able to support the PE that was just created.

Issue the command qconf -mq all.q and add the "lam_tight_qrsh" PE to the existing pe_list parameter. This is how Grid Engine is told that the "all.q" cluster queue is capable of supporting jobs that request our newly created PE.

Verify the configuration change via the qconf -sq command:

# qconf -sq all.q | grep pe_list
pe_list               make lam_tight_qrsh

Built a test MPI application

The helper scripts downloaded in a previous step contain a simple test program called "mpihello.c". The compiler script is the "mpiCC" script located in the bin/ directory of the new LAM-MPI installation:

$ cp /opt/class/tight-lammpi/helper-scripts/mpihello.c ./mpihello.c
$ /opt/class/tight-lammpi/bin/mpiCC ./mpihello.c
$ mv ./a.out ./mpihello

Do this as a non-root user as we don't want to run LAM programs as root. Also remember to build or at least copy the binary to a location that is shared cluster-wide. All the nodes need to be able to access the binary.

Just for kicks, try to run the program. You should see a standard "Hey! there is no LAM environnent set up yet!" error message:

$ ./mpihello 

It seems that there is no lamd running on the host galaxy-demo.

This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for MPI programs to run
(the MPI program tired to invoke the "MPI_Init" function).

Please run the "lamboot" command the start the LAM/MPI runtime
environment.  See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.

Create a simple SGE job script

Note: Do not run jobs as root. We want to test and run parallel applications as normal, non-special users.

If we submit a script to our new PE, we know that all of the "LAM stuff" such as lamboot and lamhalt are going to be taken care of automatically by the configured PE start and stop scripts. This means our actual job script just has to call the correct mpirun program along with the parallel binary we are trying to launch:


#$ -cwd                                                                                                        

/opt/class/tight-lammpi/bin/mpirun C ./mpihello

Submit the job

[bioadmin@galaxy-demo ~/tight-mpi-test]$ qsub -pe lam_tight_qrsh 10 ./mpi-tester.sh
Your job 10 ("MPIHELLO") has been submitted
[bioadmin@galaxy-demo ~/tight-mpi-test]$ 
[bioadmin@galaxy-demo ~/tight-mpi-test]$ 
[bioadmin@galaxy-demo ~/tight-mpi-test]$ qstat -f
queuename                      qtype used/tot. load_avg arch          states
all.q@n0000.                   BIP   0/1       5.08     lx24-x86      a
all.q@n0001                    BIP   0/1       0.11     lx24-x86      
all.q@n0002                    BIP   1/1       0.06     lx24-x86      
     10 0.55500 MPIHELLO   bioadmin     r     06/07/2006 17:31:06     1        
all.q@n0003                    BIP   1/1       0.04     lx24-x86      
     10 0.55500 MPIHELLO   bioadmin     r     06/07/2006 17:31:06     1        
all.q@n0004                    BIP   1/1       0.03     lx24-x86      
     10 0.55500 MPIHELLO   bioadmin     r     06/07/2006 17:31:06     1        
all.q@n0005                    BIP   1/1       0.02     lx24-x86      
     10 0.55500 MPIHELLO   bioadmin     r     06/07/2006 17:31:06     1        
all.q@n0006                    BIP   1/1       0.02     lx24-x86      
     10 0.55500 MPIHELLO   bioadmin     r     06/07/2006 17:31:06     1        
all.q@n0007                    BIP   1/1       0.01     lx24-x86      
     10 0.55500 MPIHELLO   bioadmin     r     06/07/2006 17:31:06     1        
all.q@n0008                    BIP   1/1       0.03     lx24-x86      
     10 0.55500 MPIHELLO   bioadmin     r     06/07/2006 17:31:06     1        
all.q@n0009                    BIP   1/1       0.02     lx24-x86      
     10 0.55500 MPIHELLO   bioadmin     r     06/07/2006 17:31:06     1        
all.q@n0010                    BIP   1/1       0.02     lx24-x86      
     10 0.55500 MPIHELLO   bioadmin     r     06/07/2006 17:31:06     1        
all.q@n0011                    BIP   1/1       0.09     lx24-x86      
     10 0.55500 MPIHELLO   bioadmin     r     06/07/2006 17:31:06     1

Signs of success

Using the command "ps -e f -o pid,ppid,pgrp,command --cols=80" to check the process table on a node running a parallel task. If you see something like the following:

  789     1   789 /export/home/bigcluster/sge/bin/lx24-x86/sge_execd
 1206   789  1206  \_ sge_shepherd-8 -bg
 1207  1206  1207      \_ /export/home/bigcluster/sge/utilbin/lx24-x86/rshd -l
 1209  1207  1209          \_ /export/home/bigcluster/sge/utilbin/lx24-x86/qrsh_start
 1252  1209  1252              \_ tcsh -c lamd_binary -H -P 32789 -n 5 -o
 1295  1252  1252                  \_ lamd_binary -H -P 32789 -n 5 -o 0 -
 1296  1295  1252                      \_ ./mpihello
 1204     1  1204 /export/home/bigcluster/sge/bin/lx24-x86/qrsh -V -inherit -nostdin 
 1208  1204  1204  \_ /export/home/bigcluster/sge/utilbin/lx24-x86/rsh -n -p 32791 n0

The above output shows successful tight integration. All of the LAM related programs including the lamd_binary and our "mpihello" program are running under the control of the Grid Engine sge_shepherd daemon.

Signs of failure

LAM seems to be pretty chatty installed as outlined in this document and the SGE start and stop scripts are also pretty verbose. Chances are if the parallel program fails to run at all there will be a pretty clear indication as to what happened. Just check all of the ".o", ".e", ".po" and ".pe" files that Grid Engine creates by default.

Subtle signs of failure

The following example shows the test program running OK to completion, but not under the complete control of Grid Engine.

Using the command " ps -e f -o pid,ppid,pgrp,command --cols=80" to check the process table on a node running a parallel task. If you see something like the following:

  795     1   795 /export/home/bigcluster/sge/bin/lx24-x86/sge_execd
  892   795   892  \_ sge_shepherd-6 -bg
  963   892   963      \_ -sh /export/home/bigcluster/sge/default/spool/n0008/jo
  964   963   963          \_ mpirun C ./mpihello
  952     1   952 /opt/class/tight-lammpi/bin/lamd -H -P 32779 -n 0 -
  965   952   952  \_ ./mpihello

The above example shows that the LAM daemon binary ("lamd") is not a child process of the Grid Engine sge_sherpherd daemon. This means that tight integration has failed as it is clear that LAM is running in its own little world. What we want to see is "lamd" running under the control and supervision of a SGE daemon. This may be similar to output seen in a loose LAM-MPI environment where the lamboot program launches the LAM daemons independently of Grid Engine.