Utilities

From GridWiki

Jump to: navigation, search

Contents

Load Script

The following script just creates jobs. It is useful for testing your settings. No error checking and you have to modify the -q option to fit with your system.

#!/bin/sh


# First argument is the number of jobs
# Second argument is the seconds to sleep

QSUB_OPTIONS="  \
                -q "'*&!boinc'" \
                -cwd \
                -j y \
                -V \
                -N load \
                -o `pwd`/load.out \
"

for i in `seq $1`
do
  qsub $QSUB_OPTIONS /sjr/beodata/local/bin/vanilla_job.sh sleep $2
done

Using Ganglia as Load Sensor

See Using Ganglia As Load Sensor for guidance on using Ganglia as a load sensor for Grid Engine.

Modding qstat

The output from qstat -ext, while complete, is overly verbose (and over 200 characters wide) in many cases. In SGE 5.3, this will strip out bits about the Department, Deadline, PE Master, and Array Task columns; items frequently unused. It makes a good alias (such as "eqstat", or something):

 qstat -ext | cut -c 1-33,39-45,66-92,110-191

Modding qstat Redux

A longer script, but more condensed output from Andy Schwierskott, pulled from the SGE mailing list:

  #!/bin/sh
     echo "JobId     P    S  Project     User Tot-Tkt   ovrts   otckt   dtckt   ftckt   stckt  shr"
  echo "---------------------------------------------------------------------------------------"
  qstat -ext -s rs | grep -v job-ID | sed /-------------/d | \
     gawk '{ printf("%5s %4s %4s %8s %8s %7s %7s %7s %7s %7s %7s %4s\n", \
                   $1, $2, $7, $5, $4, $13, $14, $15, $16, $17, $18, $19) }'
  echo "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -"
  qstat -ext -s p | grep -v job-ID | sed /-------------/d | \
      gawk '{ printf("%5s %4s %4s %8s %8s %7s %7s %7s %7s %7s %7s %4s\n", \
                  $1, $2, $7, $5, $4, $10, $11, $12, $13, $14, $15, $16 ); }'

User Management

If you have lots of users and groups to add at once, and your userlists map to unix groups, qconf can help automate this a bit. This awk snippet looks for entries starting with "grp", and generates a qconf entry to create a matching userlist.

 awk -F: '/^grp/{print "qconf -au ",$4,$1}' /etc/group

qcd

If your user's have the same shared file space as the cluster this alias will change directory to the current working directory of the passed in job id.

It has to be an alias (or sourced) in order to affect the current shell.

I use the tcsh, and that is why there is all of the bizarre back slashing.

Add the line below to the system shell startup file or ~/.tcshrc

# tcsh:
alias qcd cd\ \`qstat\ -j\ \!\*\|grep\ ^cwd\|awk\ \'\{print\ \$2\}\'\`

Might could revisit this to make it a little nicer with the new XML format?

qavailable_processors

This script just sums up the processors that aren't performing a task. You will have to change the qstat options to suite your configuration.

#!/bin/sh

qstat -g c -l arch=lx24-amd64 -q all.q | awk 'NR > 2 {sum = sum + $4} END {print sum}'


qtail

The following script will 'tail' the end of the standard output file of the passed in job id.

#!/bin/sh

if [ $# -ne 1 ]; then
 echo "Usage:"
 echo "  $0 <sge_job_number>"
 exit
fi

out_path=`qstat -j $1|grep ^stdout_path_list|awk '{print $2}'`

tail -f $out_path

rj: qsub wrapper script

I created a wrapper script for qsub that allows my users to not need to know the complexity of the cluster configurations or job submission. I called the wrapper script 'rj' (for run jobs) in order to distinguish it from the collection of 'q*' apps that come with Grid Engine.

It has the following features:

  • Detects whether the application is checkpointable (with Condor)
  • Detects whether the application is MPI (currently known to work with OpenMPI and possibly MPICH), and then checks and makes sure gets the option to specify number of processors
  • Sets the maximum number of parallel processors to what is available so that the job can be dispatched immediately

Realize that this is very specific to my setup, and you will need to go through it and edit to match what you want it to do. It also has a lot of history that should probably be re-written. Approaching the idea to allowing 'rj' to accept all arguments that 'qsub' accepts and pass through, defaulting or forcing those options that I need to change.


#!/bin/bash

# Default
WHOAMI=`whoami`
EMAIL=${WHOAMI}@sjrwmd.com
EXTRA_OPTIONS="  \
                 -q "'*&!boinc'" \
                 -cwd \
                 -j y \
                 -M ${EMAIL} \
                 -V \
"
PE=mpi
NP_OPTION=0
P_OPTION=0
I_OPTION=0

function usage(){
echo "
rj [-np NUM] [-i INPUT_FILE] [-mb] EXECUTABLE
    # EXECUTABLE runs on appropriate node.

rj -np NUM MPI_EXECUTABLE
     # Runs the MPI parallel MPI_EXECUTABLE on NUM nodes.
     # NUM can be a single number or a range, for example 2-8, would run the
     # MPI_EXECUTABLE on at least 2 processors up to 8 processors.

rj -i INPUT_FILE EXECUTABLE
     # Runs EXECUTABLE with interactive input supplied by INPUT_FILE.

rj [-mb] EXECUTABLE
     # -b Waits for EXECUTABLE to finish (blocks) before returning.
     # -m Sends e-mail at end of job."
exit 1
}

# Detect the project option -P and the -i option
# Didn't try getopt - suspect wouldn't work since command can have options.
while [ $# -ne 0 ]
do
   case $1 in
        -P) EXTRA_OPTIONS=" $EXTRA_OPTIONS $1 $2 "
            shift
            shift
            P_OPTION=1
            ;;
        -i) EXTRA_OPTIONS=" $EXTRA_OPTIONS $1 $2 "
            shift
            shift
            I_OPTION=1
            ;;
        -np) MAXNP=$2
            max_num_slaves=`qstat -g c -l arch=lx24-amd64 | awk 'NR > 2 {sum = sum + $4} END {print int(sum - 1 - 0.1*sum)}'`
            if [ $MAXNP -gt $max_num_slaves ]; then
              MAXNP=$max_num_slaves
              echo "Number of slaves set to maximum allowable = ${max_num_slaves}."
            fi
            EXTRA_OPTIONS=" $EXTRA_OPTIONS -pe $PE 1-$MAXNP "
            shift
            shift
            NP_OPTION=1
            ;;
         -b) EXTRA_OPTIONS=" $EXTRA_OPTIONS -sync y "
            shift
            ;;
         -m) EXTRA_OPTIONS=" $EXTRA_OPTIONS -m eas "
            shift
            ;;
         -h|--help) usage
            ;;
         *) break
            ;;
   esac
done

if [ $# -eq 0 ]; then
   usage
fi

full_path=`which $1 2> /dev/null`
if [ "x"${full_path} == "x" ]; then
        echo "Executable $1 not found in any directory in $PATH"
        usage
fi

static=`file -L ${full_path} | grep --count --max-count=1 statically`

condor=`nm ${full_path} 2>&1| grep --count --max-count=1 Condor`

mpi=`nm ${full_path} 2>&1| grep --count --max-count=1 MPI_`
if [ ${mpi} -eq 0 ]; then
    mpi=`ldd ${full_path} 2>&1| grep --count --max-count=1 libmpi`
fi

opteron=`file -L ${full_path} | grep --count --max-count=1 'AMD x86-64'`

APPEND_VAR="$NP_OPTION$P_OPTION$I_OPTION$static$condor$mpi$opteron"

. ${SGE_ROOT}/default/common/settings.sh

# JOBNAME has to be set AFTER options are processed
JOBNAME=`basename $1`

EXTRA_OPTIONS="${EXTRA_OPTIONS} -N ${JOBNAME} -o "`pwd`"/${JOBNAME}.out "

case $APPEND_VAR in
    1????0?) echo ""
            echo "You specified multiple processors but the executable is not MPI."
            echo ""
            usage
            ;;
    0????1?) echo ""
            echo "The executable is MPI, please pass the rj script the -np NUM option."
            echo ""
            usage
            ;;
    1????1?) qsub ${EXTRA_OPTIONS} \
                 /sjr/beodata/local/bin/parallel_job.sh $*
            ;;
    ???11??) qsub ${EXTRA_OPTIONS} \
                 -ckpt condor_ckpt \
                 /sjr/beodata/local/bin/ckpt_job.sh $*
            ;;
    ??????1) qsub ${EXTRA_OPTIONS} \
                 -l arch=lx24-amd64 \
                 /sjr/beodata/local/bin/vanilla_job.sh $*
            ;;
    ???00??) qsub ${EXTRA_OPTIONS} \
                 /sjr/beodata/local/bin/vanilla_job.sh $*
            ;;
    ???10??) echo ""
            echo "The executable is static, but without Condor."
            echo "Checkpointing is only available with Condor."
            echo "Use condor_compile to add the checkpointing libraries."
            echo ""
            qsub ${EXTRA_OPTIONS} \
                 /sjr/beodata/local/bin/vanilla_job.sh $*
            ;;
    *)      usage
            ;;
esac

Required supporting scripts for 'rj':

vanilla_job.sh

#!/bin/bash

echo ${JOB_ID} > sge_job_id

: > $SGE_STDOUT_PATH

#$ -S /bin/bash
$*

parallel_job.sh

#!/bin/bash

echo ${JOB_ID} > sge_job_id

: > $SGE_STDOUT_PATH

#$ -S /bin/bash

mpirun -np $NSLOTS $*

ckpt_job.sh

#!/bin/bash

#$ -S /bin/bash

cd $PWD

echo ${JOB_ID} > sge_job_id

: > $SGE_STDOUT_PATH

CHECKPOINT_DIR="/sjr/beodata/tmp/ckpt"
OUTPUT="${CHECKPOINT_DIR}/${JOB_ID}.log"

echo "-------------------------------------" >> $OUTPUT
if [ ${RESTARTED} -eq 0 ];
then
        echo "      Starting job #${JOB_ID}"         >> $OUTPUT
        $1 -_condor_ckpt ${CHECKPOINT_DIR}/${JOB_ID}.ckpt $@ &
else
        echo "      Re-Starting job #${JOB_ID}"      >> $OUTPUT
        $1 -_condor_restart ${CHECKPOINT_DIR}/${JOB_ID}.ckpt &
fi

PROC_PID=$!
echo $! > ${CHECKPOINT_DIR}/${JOB_ID}.pid

echo "-------------------------------------" >> $OUTPUT
echo " Date:     `date`"                     >> $OUTPUT
echo " Job PID:  ${PROC_PID}"                >> $OUTPUT
echo " Hostname: `hostname`"                 >> $OUTPUT
echo "-------------------------------------" >> $OUTPUT
wait
Personal tools