Integrating PEST and Grid Engine

From GridWiki
Jump to: navigation, search

Parameter ESTimation (PEST) is an open source program that will find, if possible, the optimum set of parameters for any simulation system. It uses a template of the input data sets, varying those parameters indicated by certain flags, and compares the output from the simulation against observations. It has a wide range of capabilities and features, most beyond my understanding. The optimization procedures require multiple model runs, lending themselves quite nicely to running on computational clusters. PEST comes with a parallel mode that uses files on a shared file system to communicate messages between parallel PEST (ppest) and the slaves (pslave).

For our work with the Hydrologic Simulation Program Fortran (HSPF) I worked up an integration of parallel PEST and Grid Engine.

More information about PEST is available at http://pesthomepage.org. Windows and UNIX/Linux versions are available for download.

I wrote this some time ago and then discovered array jobs. I tried to rewrite this script to use array jobs, but it was getting real messy so I kept my old brute force method. If someone else has a go at it with array jobs, please post to the wiki. This script creates a new Run Management File (*.rmf), so if you are reading in the PEST manuals about all of the files you have to create, the rmf is the one you don't have to worry about.

You should set max_num_slaves and EMAIL to something that makes sense for your system.

rppest script:

#!/bin/sh

# GridEngine script to run parallel PEST

# $1 is number of slaves to run
# $2 is name of PEST case (.pst)
# $3 is run_time estimate
# $* is the model executable and any arguments

# Adjust the following to match your needs
WHOAMI=`whoami`
EMAIL=${WHOAMI}@sjrwmd.com
max_num_slaves=`qstat -g c -l arch=lx24-amd64 -q '*_core' | awk 'NR > 2 {sum = sum + $4} END {print int(sum - 1 - 0.1*sum)}'`

# Shouldn't have to change anything beyond this line.  Probably.

cmd_name=`basename $0`

if [ $# -le 3 ]; then
  echo "$cmd_name requires at least 4 arguments
  Example:
  $cmd_name 3 pest_case.pst run_time model [model_arguments]
    3               = number of slaves to use
    pest_case.pst   = name of PEST .pst file (no spaces)
    run_time        = overestimate of model run time in seconds
    model           = model executable
    model_arguments = optional arguments to model executable"
  exit
fi

num_slaves=$1
test_case=$2
run_time=$3
shift
shift
shift

# I make the following tests to allow for the pest_case.pst file to be entered
# without the pst extension.
# Remove .pst/PST for test_case
test_case=`basename ${test_case} .pst`
test_case=`basename ${test_case} .PST`

# Check to make sure .pst/.PST file exists
if [ ! -r ${test_case}.pst -o -r ${test_case}.PST ]; then
  echo "${test_case}.pst or ${test_case}.PST not found."
  exit
fi

# Check to make sure not asking for too many slaves
if [ $num_slaves -gt $max_num_slaves ]; then
  num_slaves=$max_num_slaves
  echo "Number of slaves set to maximum allowable = ${max_num_slaves}."
fi

# Want full pathname to model executable.
program=`which $1`

# Add in the remainder of the arguments.
shift
program="${program} $*"


# This is the model executable or script to be run by the slaves.
if [ -f runfile ]; then
  rm runfile
fi
echo "$program" > runfile

# This removes the old Run Management File.
if [ -f "${test_case}.rmf" ]; then
  rm "${test_case}.rmf"
fi

# This creates the header for the new Run Management File (rmf).
cat << EOT > "${test_case}.rmf"
prf
$num_slaves  0  2  1
EOT

# This creates the line for each slave in the new Run Management File (rnf).
for (( count=1; count <= num_slaves; count++ ))
  do
  echo "slave${count}  ./slave${count}" >> "${test_case}.rmf"
  if [ -d "./slave${count}" ]; then
    rm -r "./slave${count}"
  fi
  mkdir "slave${count}"
  done

# Last line in the Run Management File (rmf) is an estimate of run times.
for (( count=1; count <= num_slaves; count++ ))
  do
  echo -n "${run_time} " >> "${test_case}.rmf"
  done
echo ""

# Run the slaves first.
present_dir=`pwd`
slave_qid=''
pslave_exe=`which pslave`
for (( count=1; count <= num_slaves; count++ ))
  do
  cd "slave${count}"
  # I can't use < redirection because it tries to feed to qsub, so have to use
  # '-i' option
  # Collect the job-id in testr variable
  testr=`qsub -cwd -j y -V -o pslave.out -i ../runfile -b y ${pslave_exe} | awk '{print $3}'`
  # Collect all of the job-ids in slave_qid
  slave_qid="${slave_qid} ${testr}"
  cd "$present_dir"
  done

# Run parallel PEST, block, delete all slaves from queue when ppest is done.
# Will work later at whether to restart optimization or not.
if [ -f 'ppest.out' ]; then
  rm -f ppest.out
fi
(qsub -cwd -j y -M ${EMAIL} -V -m eas -o ppest.out -sync y -b y `which ppest` $test_case; qdel ${slave_qid}) &