Tight-HP-MPI-Integration-Notes

From GridWiki
Jump to: navigation, search

Introduction

The integration has been tested both with GE 6.1 and 6.2.

  • GE 6.1 uses rsh:
    • Limitations:
      • There is a theoretical limit of 256 connections: Our tests indicate that only one connection is generated against each slave node per job.
      • No prompt in the session: not aware of any problem that this could cause in batch jobs
  • GE 6.2 uses an internal communication protocol:
    • Limitations:
      • No prompt
      • No X11 forwarding

HP-MPI Installation

We install HP-MPI in a shared directory (/opt/cesga) so all nodes can access it:

 cn141 # rpm -ivh --test --prefix /opt/cesga/hp-mpi-2.3/ hpmpi-2.03.00.00-20081120r.ia64.rpm 
 Preparing...                ########################################### [100%]
 cn141 # rpm -ivh --prefix /opt/cesga/hp-mpi-2.3/ hpmpi-2.03.00.00-20081120r.ia64.rpm 
 Preparing...                ########################################### [100%]
    1:hpmpi                  ########################################### [100%]

Local installation in each node is also possible.

GE Configuration

Set up the Grid Engine Parallel Environment (PE) using the command: qconf -ap mpi

  • mpi: Basic PE using $fill_up allocation rule (all slots allocated in a node before moving to the next one)
pe_name           mpi
slots             9999
user_lists        NONE
xuser_lists       NONE
start_proc_args   <YOUR_SGE_ROOT>/mpi/startmpi.sh -catch_rsh \
                  $pe_hostfile
stop_proc_args    <YOUR_SGE_ROOT>/mpi/stopmpi.sh
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min
accounting_summary FALSE
  • mpi_rr: PE with $round_robin allocation (one slot per node if possible)
pe_name            mpi_rr
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    <YOUR_SGE_ROOT>/mpi/startmpi.sh -catch_rsh \
                   $pe_hostfile
stop_proc_args     <YOUR_SGE_ROOT>/mpi/stopmpi.sh
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
  • mpi_1p: 1 slot per node (mandatory)
pe_name            mpi_1p
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    <YOUR_SGE_ROOT>/mpi/startmpi.sh -catch_rsh \
                   $pe_hostfile
stop_proc_args     <YOUR_SGE_ROOT>/mpi/stopmpi.sh
allocation_rule    1
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
  • mpi_2p, mpi_4p, mpi_8p, mpi_16p, etc.: 2, 4, 8, 16, etc. slots per node (depending on how many different allocation rules you want to allow and how many CPUs each node has)

If you are using Modules to set up the hp-mpi environment then in the module use:

setenv          MPI_REMSH	"$env(TMPDIR)/rsh"

In other case in the global/user environment export the MPI_REMSH variable:

export MPI_REMSH=$TMPDIR/rsh

Modify the starter_method and pe_list options of the queue you want to use: qconf -mq hp-mpi-queue

starter_method        /opt/cesga/sistemas/sge/tools/util/job_starter.sh
pe_list               mpi mpi_rr mpi_1p mpi_2p mpi_4p mpi_8p mpi_16p

This is the script used as the job starter:

#!/bin/bash
# Author: JLC
# Purpose:
#   Wrapper to start the job script and the scripts it runs in the slave nodes using qrsh
#
# Changelog
#   05-11-2008 JLC
#     First version
#   11-11-2008 JLC
#     Added reading of user's profile
#   22-01-2009 JLC
#     Added debug functions and adapted to work with GE6.2
#   26-01-2009 JLC
#     Bug fix: set CPU variable in case profile has not done it
#   02-02-2009 JLC
#     Bug fix: export CPU variable in other case the sge's module does not see it
#

# Our queues are configured using shell_start_mode=posix_compliant and shell=/bin/bash
# We add the -l option to force the shell to re-read the profile.d scripts and set the correct env
# For qrsh sessions we want to load the right environment

#
# DEBUG
#
DEBUG="false" # Set it to "true" to get debug information
DEBUG_USER="uscfajlc"
# Place it on an NFS directory so you can easily get all the outputs
DEBUG_LOG="/home/usc/fa/jlc/mpi/job_starter_${HOSTNAME}_${JOB_ID}_$$.out" 

# Log output of command
debug_cmd (){
        if [[ $LOGNAME = $DEBUG_USER && $DEBUG = "true" ]]; then
                $* >> $DEBUG_LOG 2>&1
        fi
}
# Log function
debug_log (){
        if [[ $LOGNAME = $DEBUG_USER  && $DEBUG = "true" ]]; then
                echo "$*" >> $DEBUG_LOG 2>&1
        fi
}

debug_cmd env
debug_cmd module list

# We want to know the CPU type even in case /etc/profile has not been loaded
test -z "$CPU"  &&  CPU=`/bin/uname -m 2> /dev/null`
export CPU

# Export SGE_BINARY_PATH: zz-cesga.sh uses it to detect the right SGE module
export SGE_BINARY_PATH

if [[ $JOB_SCRIPT = QRSH || $JOB_SCRIPT = INTERACTIVE ]];then
        debug_log "QRSH|INTERACTIVE: Reading profile"
        # In GE6.2 PROFILEREAD=true is passed to the SLAVE. This prevents the re-reading of the profile scripts.
        # To avoid it:
        unset PROFILEREAD
        #
        # Read profile and profile.d scripts
        #
        if test -z "$PROFILEREAD" ; then
            test -r /etc/profile.d/sh.ssh   && . /etc/profile.d/sh.ssh
            test -r /etc/SuSEconfig/profile && . /etc/SuSEconfig/profile
            if test -z "$SSH_SENDS_LOCALE" ; then
                if test -r /etc/sysconfig/language -a -r /etc/profile.d/sh.utf8 ; then
                    tmp="$(. /etc/sysconfig/language; echo $AUTO_DETECT_UTF8)"
                    test "$tmp" = "yes" && . /etc/profile.d/sh.utf8
                    unset tmp
                fi
            fi
        fi
        if test -d /etc/profile.d -a -z "$PROFILEREAD" ; then
            for s in /etc/profile.d/*.sh ; do
                test -r $s && . $s
            done
            unset s
        fi

        if test -f $HOME/.profile ; then
                test -r $HOME/.profile && . $HOME/.profile
        fi

        if test -f $HOME/.bash_profile ; then
                test -r $HOME/.bash_profile && . $HOME/.bash_profile
        fi

        if test -f $HOME/.bash_login ; then
                test -r $HOME/.bash_login && . $HOME/.bash_login
        fi

        debug_log "After loading profile scripts"
        debug_cmd env
        debug_cmd module list

        # Finally we run the command
        echo $*
        $*
else
        echo /bin/bash -l $*
        /bin/bash -l $*
fi

Additional considerations in Itanium

The qrsh version included in GE6.2 that adds support for builtin communication (IJS) uses threads. Memory consumption is about 2 times the stack limit (set for example with ulimit). In case there is no stack limit it uses 32MB as a reference and multiplies it by 2.

The number of threads expanded by each qrsh depend of the shell from which it is called:

  • If the shell is recognised as “sh” then it expand 3 threads.
  • In other case, if the shell is NOT recognised as “sh”(for example, “bash”) then it expand 4 threads. In this case one of them is in charge of the shell and consume 30MB of memory extra. This problem could be avoided using the option –noshell.

This way, if the stack is unlimited it consumes 73MB of virtual memory and if we have a stack limit of 256MB then each qrsh consumes more than 500MB of virtual memory.

This could lead to problems when running jobs because GE kills them due to the fact that they surpass the memory limit of s_vmem in the master node just because the qrsh processes are consuming too much memory (one qrsh process is launched per slave node).

For an optimal memory consumption in Itanium a stack value of 1MB can be used. In this case each qrsh consumes 10MB of virtual memory (4MB of them physical memory). For that we set this limit inside the rsh script before qrsh is launched:

...
ulimit -s 1024
...
   if [ $minus_n = 1 ]; then
      exec $me -n $rhost $cmd
   else
      exec $me $rhost $cmd
   fi
...

Additional Scripts Needed

  • <YOUR_SGE_ROOT>/mpi/rsh
#!/bin/sh
#
# Author: Sun, modifications by JLC
# Purpose:
#   Wrapper to run rsh commands through qrsh
#
#
#___INFO__MARK_BEGIN__
##########################################################################
#
#  The Contents of this file are made available subject to the terms of
#  the Sun Industry Standards Source License Version 1.2
#
#  Sun Microsystems Inc., March, 2001
#
#
#  Sun Industry Standards Source License Version 1.2
#  =================================================
#  The contents of this file are subject to the Sun Industry Standards
#  Source License Version 1.2 (the "License"); You may not use this file
#  except in compliance with the License. You may obtain a copy of the
#  License at http://gridengine.sunsource.net/Gridengine_SISSL_license.html
#
#  Software provided under this License is provided on an "AS IS" basis,
#  WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
#  WITHOUT LIMITATION, WARRANTIES THAT THE SOFTWARE IS FREE OF DEFECTS,
#  MERCHANTABLE, FIT FOR A PARTICULAR PURPOSE, OR NON-INFRINGING.
#  See the License for the specific provisions governing your rights and
#  obligations concerning the Software.
#
#  The Initial Developer of the Original Code is: Sun Microsystems, Inc.
#
#  Copyright: 2001 by Sun Microsystems, Inc.
#
#  All Rights Reserved.
#
##########################################################################
#___INFO__MARK_END__

# could be rsh or remsh
me=`basename $0`
# just_wrap=1

# remove path to wrapping rsh from path list
if [ "x$TMPDIR" != "x" ]; then
   PATH=`echo $PATH|tr : "\012"|grep -v $TMPDIR| tr "\012" :`
   export PATH
fi

# rehash 
hash -r

if [ "x$JOB_ID" = "x" ]; then
   exec $me $*
   echo command $me not found in PATH=$PATH 
fi

# extract target hostname
if [ $# -lt 1 ]; then 
   echo $me: missing hostname
   exit 1
fi  

# Handle hostname before options
rhost=
expr "$1" : "-*" >/dev/null 2>&1

if [ $? -ne 0 ]; then
   rhost=$1
   shift
fi

ruser=
minus_n=0

# parse other rsh options
while [ "$1" != "" ]; do
   case "$1" in
      -l)
         shift
         if [ $# -lt 1 ]; then 
            echo $me: option -l needs user name as argument
            exit 1
         fi  
         ruser=$1
         ;;
      -n)
         minus_n=1
         ;;
      -*)
         echo $me: Unsupported option - $1
         exit 1
         ;;
      *)
         break;
         ;;
   esac
   shift
done

# Handle hostname after options
if [ "x$rhost" = x ]; then
   if [ $# -lt 1 ]; then 
      echo $me: missing hostname
      exit 1
   fi  
   rhost=$1
   shift
fi

# should the command to be started preceded with any starter command
if [ "x$RCMD_PREFIX" = x ]; then
   cmd="$*"
else
   cmd="$RCMD_PREFIX $*"
fi

# unset TASK_ID - it might be set if a task starts another tasks 
#                 and may not be exported in this case
if [ "x$TASK_ID" = x ]; then
   unset TASK_ID
fi

# CESGA's local variables
#
CESGA_VARS=""
if [ x$LOADEDMODULES != x ]; then
   CESGA_VARS="$CESGA_VARS -v LOADEDMODULES=$LOADEDMODULES"
fi
if [ x$OMP_NUM_THREADS != x ]; then
   CESGA_VARS="$CESGA_VARS -v OMP_NUM_THREADS=$OMP_NUM_THREADS"
fi

# qrsh from GE6.2 consumes 512MB with the default value of stack of 256MB
# reducing the stack limit to 1024 the memory usage is reduced to 43MB
ulimit -s 1024

if [ x$just_wrap = x ]; then 
   if [ $minus_n -eq 1 ]; then
      echo $SGE_ROOT/bin/$ARC/qrsh $CESGA_VARS -inherit -nostdin $rhost $cmd  
      exec $SGE_ROOT/bin/$ARC/qrsh $CESGA_VARS -inherit -nostdin $rhost $cmd 
   else
      echo $SGE_ROOT/bin/$ARC/qrsh $CESGA_VARS -inherit $rhost $cmd 
      exec $SGE_ROOT/bin/$ARC/qrsh $CESGA_VARS -inherit $rhost $cmd  
   fi
else
   echo $me $rhost $*
   if [ $minus_n = 1 ]; then
      exec $me -n $rhost $cmd
   else
      exec $me $rhost $cmd
   fi
   echo $me not found in PATH=$PATH
fi

Testing the installation

To test the installation you can use a script similar to the following one. In our case to use HP-MPI it is needed to load the hp-mpi and intel modules, this of course depends on how you choose to install HP-MPI:

#!/bin/bash

## HP-MPI
echo "== HP-MPI test =="
module load hp-mpi intel

# In case it is not done in the global/user environment
#export MPI_REMSH="$TMPDIR/rsh"

# You could need also the option "-hostfile $TMPDIR/machines" if it
# is not include in the global/user environment (MPIRUN_OPTIONS)
mpirun -np $NSLOTS my-hp-mpi-program