STARCD STAR-CCM+ Integration

From GridWiki
Revision as of 07:20, 27 January 2010 by Olesen (talk | contribs) (created page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


These notes apply to STAR-CD v4.x only.

The integration of STAR-CD in the GridEngine is mostly without any significant issues, however some adjustments are advisable to ensure smooth operation. The main difficulty is posed by the rsh/ssh and rcp/scp transports used in STAR-CD. On many newer systems, unsafe services such as telnet, rsh and rcp are disabled by default. Even if you decided to simply activate rsh/rcp on your systems (and possibly violate the corporate security policy), there are good technical reasons not to do so. This plain rsh service needs to be activated on each cluster node – leading to potential problems if it is forgotten with new installations. Beyond this, using a plain rsh will leave behind processes on the nodes when the job is removed with qdel (more details in For correct behaviour, the GridEngine rsh wrapper must be used. This wrapper, which is normally found under $SGE_ROOT/mpi/rsh, simply wraps rsh to use the GridEngine qrsh with the -inherit option. The final backend transport that is actually used for qrsh can itself be the GridEngine builtin (new with GridEngine 6.2), the GridEngine version of rsh that handles ports or it can be the standard ssh. The default GridEngine builtin version works without any known difficulties.

Parallel Environment

Here is an example of a parallel environment for a tight integration:

 pe_name            mpich
 slots              999
 user_lists         NONE
 xuser_lists        NONE
 start_proc_args    /opt/grid/mpi/ -catch_rsh $pe_hostfile
 stop_proc_args     /opt/grid/mpi/
 allocation_rule    $fill_up
 control_slaves     TRUE
 job_is_first_task  FALSE
 urgency_slots      min
 accounting_summary FALSE

The start_proc_args contains the -catch_rsh option, which links the $SGE_ROOT/mpi/rsh wrapper in $TMPDIR. Since this directory is automatically added to the path of the GridEngine jobs, the wrapper will generally be seen before any other rsh in the path. However, explicitly specify the use of the wrapper rather than relying on the correct path order for the proper behaviour of the system seems more prudent.

STARCD rsh wrapper

As described above, the GridEngine qrsh must be used, which is addressed by using an rsh wrapper. Unfortunately STARCD also contains its own rsh wrapper ($STARDIR/sbin/rsh) to handle switching between an rsh and an ssh transport, as well as a rcp wrapper ($STARDIR/sbin/rcp). It also relies upon modifying the path to include the $STARDIR/sbin. Correct behaviour of the $STARDIR/sbin/{rcp,rsh} scripts depend upon the following environment variables:


When set, they are used to specify secure alternatives to rcp and rsh.

Where's the problem?

Based on the description thus far, there don't seem to be any potential problems.

  • The STARCD rsh wrapper ($STARDIR/sbin/rsh) is seen first in the path (it is placed there within the star script).
  • The STARCD rsh wrapper strips the $STARDIR/sbin out of the path before calling the real rsh.
  • The real rsh is in fact the next one found in the path, which should be the $TMPDIR/rsh link to $SGE_ROOT/mpi/rsh that was placed there by the $SGE_ROOT/mpi/ starter with the -catch_rsh option.
  • The $TMPDIR/rsh (link to $SGE_ROOT/mpi/rsh) will call the GridEngine qrsh, which in turn uses rsh, ssh or builtin for the transport.

But what happens when there are no rsh/rcp services on the cluster? In the case, the secure equivalents must be used:

 # provide secure access
 REMOTECOPY=/usr/bin/scp; export REMOTECOPY
 REMOTETASK=/usr/bin/ssh; export REMOTETASK

Now consider what occurs:

  • The STARCD rsh wrapper resolves rsh to /usr/bin/ssh, which is then used.
  • The GridEngine qrsh mechanism will be bypassed.
  • Using qdel to kill jobs results in zombie processes!

Thus for correct GridEngine control, we seem to require that REMOTETASK be unset:


However, for the copying of files to work, we require

 REMOTECOPY=/usr/bin/scp; export REMOTECOPY

This is not only counterintuitive, but also means that testing a parallel job without the GridEngine will fail, since rsh will be used and this service is disabled on the system.


If we rely upon the standard STARCD mechanisms, the choice of REMOTETASK results either in a configuration that works will with GridEngine, but does not well when used directly, or else a configuration that works well when used directly, but which will leave behind zombie processes when used with the GridEngine.

Required Changes

For a configuration that works without the issues described above, the following changes are required:

Within the job script, add these lines before the star command is called:

 # provide secure access
 REMOTECOPY=/usr/bin/scp; export REMOTECOPY
 REMOTETASK=/usr/bin/ssh; export REMOTETASK
 # use GridEngine rsh wrapper to call GridEngine qrsh for the mpi transport
 # hp-mpi
 MPI_REMSH=$SGE_ROOT/mpi/rsh; export MPI_REMSH
 # mpich

To ensure that the values are used reliably, the following changes should be made to the $STAR/bin/star script. As always, make a backup copy first. It will be useful to determine what changes might be needed in future STARCD versions (unfortunately cd-adapco will not integrate the following suggested changes due to "unforeseeable repercussions" of introducing the RCP shell variable in the script).

Near the top of the $STAR/bin/star script, explicitly use the values of REMOTETASK, REMOTECOPY if they are set. This eliminates reliance on the order of the path. A typical diff:

--- star.orig   2009-02-26 22:14:32.000000000 +0100
+++ star        2009-04-22 14:43:38.197626000 +0200
@@ -23,13 +23,24 @@

-# Setups remote shell
+# Setup remote shell
+# <rcp>
+# </rcp>
 case `uname` in
 HP-UX) RSH=remsh;;
 *)     RSH=rsh;;
+# possibly use securer mode, even when the $STARDIR/sbin scripts are missing
+# <rcp>
+# </rcp>

Analogous to the existing RSH' shell variable, an extra RCP shell variable has been introduced. The next step requires a small amount of patience, but is simple – replace the remaining occurrences of rcp with the $RCP variable. For example,

--- star.orig   2009-02-26 22:14:32.000000000 +0100
+++ star        2009-04-22 14:43:38.197626000 +0200
@@ -1773,7 +1791,9 @@
 prepare_mmboot() {
   if [ "$PNP_MMNONFS" ]; then
     $RSH $PNP_MMHOST mkdir $PNP_MMDIR > /dev/null 2>&1
-    rcp $PNP_MMCOPY $PNP_MMHOST:$PNP_MMDIR > /dev/null 2>&1
+# <rcp>
+    $RCP $PNP_MMCOPY $PNP_MMHOST:$PNP_MMDIR > /dev/null 2>&1
+# </rcp>

These changes provide a reliable GridEngine integration for STARCD. If STARCD is also to be used (in parallel) outside of the GridEngine control, the REMOTETASK and REMOTECOPY will still need to be set in your environment for the correct behaviour, but the $STAR/sbin/{rcp,rsh} wrappers are no longer required.


No integration notes yet.