Integrating PEST and Grid Engine

From GridWiki
Revision as of 00:38, 26 November 2007 by Timcera (talk | contribs) (New page: Parameter ESTimation (PEST) is an open source program that will find, if possible, the optimum set of parameters for any simulation system. It uses a template of the input data sets, vary...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Parameter ESTimation (PEST) is an open source program that will find, if possible, the optimum set of parameters for any simulation system. It uses a template of the input data sets, varying those parameters indicated by certain flags, and compares the output from the simulation against observations. It has a wide range of capabilities and features, most beyond my understanding. The optimization procedures require multiple model runs, lending themselves quite nicely to running on computational clusters. PEST comes with a parallel mode that uses files on a shared file system to communicate messages between parallel PEST (ppest) and the slaves (pslave).

I must admit that I have yet to use PEST for a complete project. In ramping up to use it for several projects, I worked up an integration of parallel PEST and Grid Engine.

I can't actually include the links to PEST because of spam protections on this wiki, but go to www.sspa.com/pest/ should be a good start. You can download Windows and UNIX/Linux versions.

I have a 'qsub' wrapper called 'rj'. You should be able to replace with 'qsub' or use 'rj' located at Utilities. The 'qhspf' special case is a script used to run Hydrologic Simulation Program Fortran model and can be deleted if desired.

rppest script:

#!/bin/sh

# GridEngine script to run parallel PEST

# $1 is number of slaves to run
# $2 is name of PEST case (.pst)
# $3 is run_time estimate
# $* is the model executable and any arguments

cmd_name=`basename $0`

if [ $# -le 3 ]; then
  echo "$cmd_name requires at least 4 arguments
  Example:
  $cmd_name 3 pest_case.pst run_time model [model_arguments]
    3               = number of slaves to use
    pest_case.pst   = name of PEST .pst file (no spaces)
    run_time        = overestimate of model run time in seconds
    model           = model executable
    model_arguments = optional arguments to model executable"
  exit
fi

num_slaves=$1
test_case=$2
run_time=$3
shift
shift
shift

# Remove .pst for test_case
test_case=`basename ${test_case} .pst`
test_case=`basename ${test_case} .PST`

# Check to make sure *.pst exists
if [ ! -r ${test_case}.pst ]; then
  echo "${test_case}.pst or ${test_case}.pst not found."
  exit
fi

# Check to make sure not asking for too many slaves
max_num_slaves=`qstat -g c -l arch=lx24-amd64 -q '*_core' | awk 'NR > 2 {sum = sum + $4} END {print int(sum - 1 - 0.1*sum)}'`

if [ $num_slaves -gt $max_num_slaves ]; then
  num_slaves=$max_num_slaves
  echo "Number of slaves set to maximum allowable = ${max_num_slaves}."
fi

# Want full pathname to model executable.
program=`which $1`

# Something special for qhspf.  Since qhspf normally submits the job to the queue
# need to suppress submitting job to the queue by using the -s option.
if [ $1 = 'qhspf' ]
then
  program="${program} -s "
fi

# Add in the remainder of the arguments.
shift
program="${program} $*"


# This is the model executable or script to be run by the slaves.
if [ -f runfile ]; then
  rm runfile
fi
echo "$program" > runfile

# This removes the old Run Management File.
if [ -f "${test_case}.rmf" ]; then
  rm "${test_case}.rmf"
fi

# This creates the header for the new Run Management File (rmf).
cat << EOT > "${test_case}.rmf"
prf
$num_slaves  0  2  1
EOT

# This creates the line for each slave in the new Run Management File (rnf).
for (( count=1; count <= num_slaves; count++ ))
  do
  echo "slave${count}  ./slave${count}" >> "${test_case}.rmf"
  if [ -d "slave${count}" ]; then
    rm -r "slave${count}"
  fi
  mkdir "slave${count}"
  done

# Last line in the Run Management File (rmf) is an estimate of run times.
for (( count=1; count <= num_slaves; count++ ))
  do
  echo -n "${run_time} " >> "${test_case}.rmf"
  done
echo ""

# Run the slaves first.
present_dir=`pwd`
slave_qid=''
for (( count=1; count <= num_slaves; count++ ))
  do
  cd "slave${count}"
  # rj is a separate script to run jobs.
  # I can't use < redirection because it tries to feed to rj, so have to use
  # '-i' option
  # Collect the job-id in testr variable
  testr=`rj -i ../runfile pslave | awk '{print $3}'`
  # Collect all of the job-ids in slave_qid
  slave_qid="${slave_qid} ${testr}"
  cd "$present_dir"
  done

# Run parallel PEST, block, delete all slaves from queue when ppest is done.
# Will work later at whether to restart optimization or not.
(rj -b -m ppest $test_case; qdel ${slave_qid}) &