Tight-HP-MPI-Integration-Notes
From GridWiki
Contents |
Introduction
The integration has been tested both with GE 6.1 and 6.2.
- GE 6.1 uses rsh:
- Limitations:
- There is a theoretical limit of 256 connections: Our tests indicate that only one connection is generated against each slave node per job.
- No prompt in the session: not aware of any problem that this could cause in batch jobs
- Limitations:
- GE 6.2 uses an internal communication protocol:
- Limitations:
- No prompt
- No X11 forwarding
- Limitations:
HP-MPI Installation
We install HP-MPI in a shared directory (/opt/cesga) so all nodes can access it:
cn141 # rpm -ivh --test --prefix /opt/cesga/hp-mpi-2.3/ hpmpi-2.03.00.00-20081120r.ia64.rpm
Preparing... ########################################### [100%]
cn141 # rpm -ivh --prefix /opt/cesga/hp-mpi-2.3/ hpmpi-2.03.00.00-20081120r.ia64.rpm
Preparing... ########################################### [100%]
1:hpmpi ########################################### [100%]
Local installation in each node is also possible.
GE Configuration
Set up the Grid Engine Parallel Environment (PE) using the command: qconf -ap mpi
- mpi: Basic PE using $fill_up allocation rule (all slots allocated in a node before moving to the next one)
pe_name mpi
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args <YOUR_SGE_ROOT>/mpi/startmpi.sh -catch_rsh \
$pe_hostfile
stop_proc_args <YOUR_SGE_ROOT>/mpi/stopmpi.sh
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
- mpi_rr: PE with $round_robin allocation (one slot per node if possible)
pe_name mpi_rr
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args <YOUR_SGE_ROOT>/mpi/startmpi.sh -catch_rsh \
$pe_hostfile
stop_proc_args <YOUR_SGE_ROOT>/mpi/stopmpi.sh
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
- mpi_1p: 1 slot per node (mandatory)
pe_name mpi_1p
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args <YOUR_SGE_ROOT>/mpi/startmpi.sh -catch_rsh \
$pe_hostfile
stop_proc_args <YOUR_SGE_ROOT>/mpi/stopmpi.sh
allocation_rule 1
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
- mpi_2p, mpi_4p, mpi_8p, mpi_16p, etc.: 2, 4, 8, 16, etc. slots per node (depending on how many different allocation rules you want to allow and how many CPUs each node has)
If you are using Modules to set up the hp-mpi environment then in the module use:
setenv MPI_REMSH "$env(TMPDIR)/rsh"
In other case in the global/user environment export the MPI_REMSH variable:
export MPI_REMSH=$TMPDIR/rsh
Modify the starter_method and pe_list options of the queue you want to use: qconf -mq hp-mpi-queue
starter_method /opt/cesga/sistemas/sge/tools/util/job_starter.sh pe_list mpi mpi_rr mpi_1p mpi_2p mpi_4p mpi_8p mpi_16p
This is the script used as the job starter:
#!/bin/bash
# Author: JLC
# Purpose:
# Wrapper to start the job script and the scripts it runs in the slave nodes using qrsh
#
# Changelog
# 05-11-2008 JLC
# First version
# 11-11-2008 JLC
# Added reading of user's profile
# 22-01-2009 JLC
# Added debug functions and adapted to work with GE6.2
# 26-01-2009 JLC
# Bug fix: set CPU variable in case profile has not done it
# 02-02-2009 JLC
# Bug fix: export CPU variable in other case the sge's module does not see it
#
# Our queues are configured using shell_start_mode=posix_compliant and shell=/bin/bash
# We add the -l option to force the shell to re-read the profile.d scripts and set the correct env
# For qrsh sessions we want to load the right environment
#
# DEBUG
#
DEBUG="false" # Set it to "true" to get debug information
DEBUG_USER="uscfajlc"
# Place it on an NFS directory so you can easily get all the outputs
DEBUG_LOG="/home/usc/fa/jlc/mpi/job_starter_${HOSTNAME}_${JOB_ID}_$$.out"
# Log output of command
debug_cmd (){
if [[ $LOGNAME = $DEBUG_USER && $DEBUG = "true" ]]; then
$* >> $DEBUG_LOG 2>&1
fi
}
# Log function
debug_log (){
if [[ $LOGNAME = $DEBUG_USER && $DEBUG = "true" ]]; then
echo "$*" >> $DEBUG_LOG 2>&1
fi
}
debug_cmd env
debug_cmd module list
# We want to know the CPU type even in case /etc/profile has not been loaded
test -z "$CPU" && CPU=`/bin/uname -m 2> /dev/null`
export CPU
# Export SGE_BINARY_PATH: zz-cesga.sh uses it to detect the right SGE module
export SGE_BINARY_PATH
if [[ $JOB_SCRIPT = QRSH || $JOB_SCRIPT = INTERACTIVE ]];then
debug_log "QRSH|INTERACTIVE: Reading profile"
# In GE6.2 PROFILEREAD=true is passed to the SLAVE. This prevents the re-reading of the profile scripts.
# To avoid it:
unset PROFILEREAD
#
# Read profile and profile.d scripts
#
if test -z "$PROFILEREAD" ; then
test -r /etc/profile.d/sh.ssh && . /etc/profile.d/sh.ssh
test -r /etc/SuSEconfig/profile && . /etc/SuSEconfig/profile
if test -z "$SSH_SENDS_LOCALE" ; then
if test -r /etc/sysconfig/language -a -r /etc/profile.d/sh.utf8 ; then
tmp="$(. /etc/sysconfig/language; echo $AUTO_DETECT_UTF8)"
test "$tmp" = "yes" && . /etc/profile.d/sh.utf8
unset tmp
fi
fi
fi
if test -d /etc/profile.d -a -z "$PROFILEREAD" ; then
for s in /etc/profile.d/*.sh ; do
test -r $s && . $s
done
unset s
fi
if test -f $HOME/.profile ; then
test -r $HOME/.profile && . $HOME/.profile
fi
if test -f $HOME/.bash_profile ; then
test -r $HOME/.bash_profile && . $HOME/.bash_profile
fi
if test -f $HOME/.bash_login ; then
test -r $HOME/.bash_login && . $HOME/.bash_login
fi
debug_log "After loading profile scripts"
debug_cmd env
debug_cmd module list
# Finally we run the command
echo $*
$*
else
echo /bin/bash -l $*
/bin/bash -l $*
fi
Additional considerations in Itanium
The qrsh version included in GE6.2 that adds support for builtin communication (IJS) uses threads. Memory consumption is about 2 times the stack limit (set for example with ulimit). In case there is no stack limit it uses 32MB as a reference and multiplies it by 2.
The number of threads expanded by each qrsh depend of the shell from which it is called:
- If the shell is recognised as “sh” then it expand 3 threads.
- In other case, if the shell is NOT recognised as “sh”(for example, “bash”) then it expand 4 threads. In this case one of them is in charge of the shell and consume 30MB of memory extra. This problem could be avoided using the option –noshell.
This way, if the stack is unlimited it consumes 73MB of virtual memory and if we have a stack limit of 256MB then each qrsh consumes more than 500MB of virtual memory.
This could lead to problems when running jobs because GE kills them due to the fact that they surpass the memory limit of s_vmem in the master node just because the qrsh processes are consuming too much memory (one qrsh process is launched per slave node).
For an optimal memory consumption in Itanium a stack value of 1MB can be used. In this case each qrsh consumes 10MB of virtual memory (4MB of them physical memory). For that we set this limit inside the rsh script before qrsh is launched:
...
ulimit -s 1024
...
if [ $minus_n = 1 ]; then
exec $me -n $rhost $cmd
else
exec $me $rhost $cmd
fi
...
Additional Scripts Needed
- <YOUR_SGE_ROOT>/mpi/rsh
#!/bin/sh
#
# Author: Sun, modifications by JLC
# Purpose:
# Wrapper to run rsh commands through qrsh
#
#
#___INFO__MARK_BEGIN__
##########################################################################
#
# The Contents of this file are made available subject to the terms of
# the Sun Industry Standards Source License Version 1.2
#
# Sun Microsystems Inc., March, 2001
#
#
# Sun Industry Standards Source License Version 1.2
# =================================================
# The contents of this file are subject to the Sun Industry Standards
# Source License Version 1.2 (the "License"); You may not use this file
# except in compliance with the License. You may obtain a copy of the
# License at http://gridengine.sunsource.net/Gridengine_SISSL_license.html
#
# Software provided under this License is provided on an "AS IS" basis,
# WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
# WITHOUT LIMITATION, WARRANTIES THAT THE SOFTWARE IS FREE OF DEFECTS,
# MERCHANTABLE, FIT FOR A PARTICULAR PURPOSE, OR NON-INFRINGING.
# See the License for the specific provisions governing your rights and
# obligations concerning the Software.
#
# The Initial Developer of the Original Code is: Sun Microsystems, Inc.
#
# Copyright: 2001 by Sun Microsystems, Inc.
#
# All Rights Reserved.
#
##########################################################################
#___INFO__MARK_END__
# could be rsh or remsh
me=`basename $0`
# just_wrap=1
# remove path to wrapping rsh from path list
if [ "x$TMPDIR" != "x" ]; then
PATH=`echo $PATH|tr : "\012"|grep -v $TMPDIR| tr "\012" :`
export PATH
fi
# rehash
hash -r
if [ "x$JOB_ID" = "x" ]; then
exec $me $*
echo command $me not found in PATH=$PATH
fi
# extract target hostname
if [ $# -lt 1 ]; then
echo $me: missing hostname
exit 1
fi
# Handle hostname before options
rhost=
expr "$1" : "-*" >/dev/null 2>&1
if [ $? -ne 0 ]; then
rhost=$1
shift
fi
ruser=
minus_n=0
# parse other rsh options
while [ "$1" != "" ]; do
case "$1" in
-l)
shift
if [ $# -lt 1 ]; then
echo $me: option -l needs user name as argument
exit 1
fi
ruser=$1
;;
-n)
minus_n=1
;;
-*)
echo $me: Unsupported option - $1
exit 1
;;
*)
break;
;;
esac
shift
done
# Handle hostname after options
if [ "x$rhost" = x ]; then
if [ $# -lt 1 ]; then
echo $me: missing hostname
exit 1
fi
rhost=$1
shift
fi
# should the command to be started preceded with any starter command
if [ "x$RCMD_PREFIX" = x ]; then
cmd="$*"
else
cmd="$RCMD_PREFIX $*"
fi
# unset TASK_ID - it might be set if a task starts another tasks
# and may not be exported in this case
if [ "x$TASK_ID" = x ]; then
unset TASK_ID
fi
# CESGA's local variables
#
CESGA_VARS=""
if [ x$LOADEDMODULES != x ]; then
CESGA_VARS="$CESGA_VARS -v LOADEDMODULES=$LOADEDMODULES"
fi
if [ x$OMP_NUM_THREADS != x ]; then
CESGA_VARS="$CESGA_VARS -v OMP_NUM_THREADS=$OMP_NUM_THREADS"
fi
# qrsh from GE6.2 consumes 512MB with the default value of stack of 256MB
# reducing the stack limit to 1024 the memory usage is reduced to 43MB
ulimit -s 1024
if [ x$just_wrap = x ]; then
if [ $minus_n -eq 1 ]; then
echo $SGE_ROOT/bin/$ARC/qrsh $CESGA_VARS -inherit -nostdin $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh $CESGA_VARS -inherit -nostdin $rhost $cmd
else
echo $SGE_ROOT/bin/$ARC/qrsh $CESGA_VARS -inherit $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh $CESGA_VARS -inherit $rhost $cmd
fi
else
echo $me $rhost $*
if [ $minus_n = 1 ]; then
exec $me -n $rhost $cmd
else
exec $me $rhost $cmd
fi
echo $me not found in PATH=$PATH
fi
Testing the installation
To test the installation you can use a script similar to the following one. In our case to use HP-MPI it is needed to load the hp-mpi and intel modules, this of course depends on how you choose to install HP-MPI:
#!/bin/bash ## HP-MPI echo "== HP-MPI test ==" module load hp-mpi intel # In case it is not done in the global/user environment #export MPI_REMSH="$TMPDIR/rsh" # You could need also the option "-hostfile $TMPDIR/machines" if it # is not include in the global/user environment (MPIRUN_OPTIONS) mpirun -np $NSLOTS my-hp-mpi-program
