STARCD STAR-CCM+ Integration

From GridWiki
Revision as of 19:23, 9 August 2012 by LindaLopez (talk | contribs) (minor updates)
Jump to: navigation, search

STARCD

These notes apply to STAR-CD v4.x only.

The integration of STAR-CD in the GridEngine is mostly without any significant issues, however some adjustments are advisable to ensure smooth operation. The main difficulty is posed by the rsh/ssh and rcp/scp transports used in STAR-CD. On many newer systems, unsafe services such as telnet, rsh and rcp are disabled by default. Even if you decided to simply activate rsh/rcp on your systems (and possibly violate the corporate security policy), there are good technical reasons not to do so. This plain rsh service needs to be activated on each cluster node – leading to potential problems if it is forgotten with new installations. Beyond this, using a plain rsh will leave behind processes on the nodes when the job is removed with qdel (more details in http://gridengine.sunsource.net/howto/mpich-integration.html). For correct behaviour, the GridEngine rsh wrapper must be used. This wrapper, which is normally found under $SGE_ROOT/mpi/rsh, simply wraps rsh to use the GridEngine qrsh with the -inherit option. The final backend transport that is actually used for qrsh can itself be the GridEngine builtin (new with GridEngine 6.2), the GridEngine version of rsh that handles ports or it can be the standard ssh. The default GridEngine builtin version works without any known difficulties.

Parallel Environment

Here is an example of a parallel environment for a tight integration:

 pe_name            mpich
 slots              999
 user_lists         NONE
 xuser_lists        NONE
 start_proc_args    /opt/grid/mpi/startmpi.sh -catch_rsh $pe_hostfile
 stop_proc_args     /opt/grid/mpi/stopmpi.sh
 allocation_rule    $fill_up
 control_slaves     TRUE
 job_is_first_task  FALSE
 urgency_slots      min
 accounting_summary FALSE

The start_proc_args contains the -catch_rsh option, which links the $SGE_ROOT/mpi/rsh wrapper in $TMPDIR. Since this directory is automatically added to the path of the GridEngine jobs, the wrapper will generally be seen before any other rsh in the path. However, explicitly specify the use of the wrapper rather than relying on the correct path order for the proper behaviour of the system seems more prudent.


GridEngine Configuration

To reap the benefit of the tight integration, GridEngine should be configured to kill shepherded processes via the process group:

 execd_params                 ENABLE_ADDGRP_KILL

STARCD rsh wrapper

As described above, the GridEngine qrsh must be used, which is addressed by using an rsh wrapper. Unfortunately STARCD also contains its own rsh wrapper ($STARDIR/sbin/rsh) to handle switching between an rsh and an ssh transport, as well as a rcp wrapper ($STARDIR/sbin/rcp). It also relies upon modifying the path to include the $STARDIR/sbin. Correct behaviour of the $STARDIR/sbin/{rcp,rsh} scripts depend upon the following environment variables:

 REMOTECOPY
 REMOTETASK

When set, they are used to specify secure alternatives to rcp and rsh.

Where's the problem?

Based on the description thus far, there don't seem to be any potential problems.

  • The STARCD rsh wrapper ($STARDIR/sbin/rsh) is seen first in the path (it is placed there within the star script).
  • The STARCD rsh wrapper strips the $STARDIR/sbin out of the path before calling the real rsh.
  • The real rsh is in fact the next one found in the path, which should be the $TMPDIR/rsh link to $SGE_ROOT/mpi/rsh that was placed there by the $SGE_ROOT/mpi/startmpi.sh starter with the -catch_rsh option.
  • The $TMPDIR/rsh (link to $SGE_ROOT/mpi/rsh) will call the GridEngine qrsh, which in turn uses rsh, ssh or builtin for the transport.


But what happens when there are no rsh/rcp services on the cluster? In the case, the secure equivalents must be used:

 # provide secure access
 REMOTECOPY=/usr/bin/scp; export REMOTECOPY
 REMOTETASK=/usr/bin/ssh; export REMOTETASK

Now consider what occurs:

  • The STARCD rsh wrapper resolves rsh to /usr/bin/ssh, which is then used.
  • The GridEngine qrsh mechanism will be bypassed.
  • Using qdel to kill jobs results in zombie processes!

Thus for correct GridEngine control, we seem to require that REMOTETASK be unset:

 unset REMOTETASK

However, for the copying of files to work, we require

 REMOTECOPY=/usr/bin/scp; export REMOTECOPY

This is not only counterintuitive, but also means that testing a parallel job without the GridEngine will fail, since rsh will be used and this service is disabled on the system.

Summary

If we rely upon the standard STARCD mechanisms, the choice of REMOTETASK results either in a configuration that works will with GridEngine, but does not well when used directly, or else a configuration that works well when used directly, but which will leave behind zombie processes when used with the GridEngine.


Required Changes

For a configuration that works without the issues described above, the following changes are required:

Within the job script, add these lines before the star command is called:

 # provide secure access
 REMOTECOPY=/usr/bin/scp; export REMOTECOPY
 REMOTETASK=/usr/bin/ssh; export REMOTETASK
 # use GridEngine rsh wrapper to call GridEngine qrsh for the mpi transport
 # hp-mpi
 MPI_REMSH=$SGE_ROOT/mpi/rsh; export MPI_REMSH
 # mpich
 P4_RSHCOMMAND=$SGE_ROOT/mpi/rsh; export P4_RSHCOMMAND


To ensure that the values are used reliably, the following changes should be made to the $STAR/bin/star script. As always, make a backup copy first. It will be useful to determine what changes might be needed in future STARCD versions (unfortunately cd-adapco will not integrate the following suggested changes due to "unforeseeable repercussions" of introducing the RCP shell variable in the script).

Near the top of the $STAR/bin/star script, explicitly use the values of REMOTETASK, REMOTECOPY if they are set. This eliminates reliance on the order of the path. A typical diff:

--- star.orig   2009-02-26 22:14:32.000000000 +0100
+++ star        2009-04-22 14:43:38.197626000 +0200
@@ -23,13 +23,24 @@
 PNP_BUILDTIME="[2009-02-26-21:19:25]"

 #
-# Setups remote shell
+# Setup remote shell
 #
+# <rcp>
+RCP=rcp
+# </rcp>
 case `uname` in
 HP-UX) RSH=remsh;;
 *)     RSH=rsh;;
 esac
+
+# possibly use securer mode, even when the $STARDIR/sbin scripts are missing
+# <rcp>
+[ -n "$REMOTECOPY" ] && RCP=$REMOTECOPY
+[ -n "$REMOTETASK" ] && RSH=$REMOTETASK
+# </rcp>

Analogous to the existing RSH' shell variable, an extra RCP shell variable has been introduced. The next step requires a small amount of patience, but is simple – replace the remaining occurrences of rcp with the $RCP variable. For example,

--- star.orig   2009-02-26 22:14:32.000000000 +0100
+++ star        2009-04-22 14:43:38.197626000 +0200
@@ -1773,7 +1791,9 @@
 prepare_mmboot() {
   if [ "$ ... \n

Use Your Talents Give More Receive More

When I stand before God at the end of my life, I would hope that I would not have a single bit of talent left, and could say, I used everything you gave me.Erma Bombeck

[Use Your Talents Give More Receive More]

[GoodvilleNews.com - good, positive news, inspirational stories, articles]

Seeking Silence & Stillness in the Rush of Business Life

Pico Iyer -- essayist, author, travel writer and thinker -- has a unique perspective on many things. His physical domain ranges from California (where he lived as a child) and England (where he studied) to Cuba, North Korea and Ethiopia (which he visited) and Japan (where he resides). His mental domain knows no limiting boundaries. In this interview with Wharton associate dean and chief information officer Deirdre Woods and Knowledge@Wharton

[Seeking Silence & Stillness in the Rush of Business Life]

[GoodvilleNews.com - good, positive news, inspirational stories, articles]

Rainbow of Colorful Critters Discovered in Suriname

A scientific expedition into one of the worlds last pristine tropical forests has revealed incredibly diverse species and extraordinary cultural heritage, said Conservation International (CI) today, announcing the results of a scientific survey in southwest Suriname that documented nearly 1,300 species, including 46 species which may be new to science. The announcement comes as the global organization marks 25 years of science-based conservation, this month.

[Rainbow of Colorful Critters Discovered in Suriname]

[GoodvilleNews.com - good, positive news, inspirational stories, articles]

The Opposite Of Poverty Is Justice

[The Opposite Of Poverty Is Justice]

[GoodvilleNews.com - good, positive news, inspirational stories, articles]

Stop Using The Wrong Type of Intelligence

A man should hear a little music, read a little poetry, and see a fine picture every day of his life, in order that worldly cares may not obliterate the sense of the beautiful which God has implanted in the human soul. Johann Wolfgang von Goethe

[Stop Using The Wrong Type of Intelligence]

[GoodvilleNews.com - good, positive news, inspirational stories, articles]