Utilities

From GridWiki
Jump to: navigation, search

Contents

Load Script

The following script just creates jobs. It is useful for testing your settings. No error checking and you have to modify the -q option to fit with your system.

#!/bin/sh


# First argument is the number of jobs
# Second argument is the seconds to sleep

QSUB_OPTIONS="  \
                -q "'*&!boinc'" \
                -cwd \
                -j y \
                -V \
                -N load \
                -o `pwd`/load.out \
"

for i in `seq $1`
do
  qsub $QSUB_OPTIONS /sjr/beodata/local/bin/vanilla_job.sh sleep $2
done

Using Ganglia as Load Sensor

See Using Ganglia As Load Sensor for guidance on using Ganglia as a load sensor for Grid Engine.

Modding qstat

The output from qstat -ext, while complete, is overly verbose (and over 200 characters wide) in many cases. In SGE 5.3, this will strip out bits about the Department, Deadline, PE Master, and Array Task columns; items frequently unused. It makes a good alias (such as "eqstat", or something):

 qstat -ext | cut -c 1-33,39-45,66-92,110-191

Modding qstat Redux

A longer script, but more condensed output from Andy Schwierskott, pulled from the SGE mailing list:

  #!/bin/sh
     echo "JobId     P    S  Project     User Tot-Tkt   ovrts   otckt   dtckt   ftckt   stckt  shr"
  echo "---------------------------------------------------------------------------------------"
  qstat -ext -s rs | grep -v job-ID | sed /-------------/d | \
     gawk '{ printf("%5s %4s %4s %8s %8s %7s %7s %7s %7s %7s %7s %4s\n", \
                   $1, $2, $7, $5, $4, $13, $14, $15, $16, $17, $18, $19) }'
  echo "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -"
  qstat -ext -s p | grep -v job-ID | sed /-------------/d | \
      gawk '{ printf("%5s %4s %4s %8s %8s %7s %7s %7s %7s %7s %7s %4s\n", \
                  $1, $2, $7, $5, $4, $10, $11, $12, $13, $14, $15, $16 ); }'

User Management

If you have lots of users and groups to add at once, and your userlists map to unix groups, qconf can help automate this a bit. This awk snippet looks for entries starting with "grp", and generates a qconf entry to create a matching userlist.

 awk -F: '/^grp/{print "qconf -au ",$4,$1}' /etc/group

qcd

If your user's have the same shared file space as the cluster this alias will change directory to the current working directory of the passed in job id.

It has to be an alias (or sourced) in order to affect the current shell.

I use the tcsh, and that is why there is all of the bizarre back slashing.

Add the line below to the system shell startup file or ~/.tcshrc

# tcsh:
alias qcd cd\ \`qstat\ -j\ \!\*\|awk\ \'/^cwd/\{print\ \$2\}\'\`

For bash/ksh/zsh, a shell function works well:

function qcd {
    cd `qstat -j $1 | awk '/^cwd/{print $2}'`;
}

Might could revisit this to make it a little nicer with the new XML format?

qavailable_processors

This script just sums up the processors that aren't performing a task. You will have to change the qstat options to suite your configuration.

#!/bin/sh

qstat -g c -l arch=lx24-amd64 -q all.q | awk 'NR > 2 {sum = sum + $4} END {print sum}'


qtail

The following script will 'tail' the end of the standard output file of the passed in job id.

#!/bin/sh

if [ $# -ne 1 ]; then
  echo "Usage:"
  echo "  $0 <sge_job_number>"
  exit
fi

out_path=`qstat -j $1|grep ^stdout_path_list|awk '{print $2}'`

if [ "X$out_path" == "X" ]; then
  exit
fi

if [ $out_path == "/dev/null" ]; then
  echo "Standard output is not available because directed to /dev/null"
  exit
fi

tail -f $out_path

rj: qsub wrapper script

I created a wrapper script for qsub that allows my users to not need to know the complexity of the cluster configurations or job submission. I called the wrapper script 'rj' (for run jobs) in order to distinguish it from the collection of 'q*' apps that come with Grid Engine.

It has the following features:

  • Detects whether the application is checkpointable (with Condor)
  • Detects whether the application is MPI (currently known to work with OpenMPI and possibly MPICH), and then checks and makes sure gets the option to specify number of processors
  • Sets the maximum number of parallel processors to what is available so that the job can be dispatched immediately

Realize that this is very specific to my setup, and you will need to go through it and edit to match what you want it to do. It also has a lot of history that should probably be re-written. Approaching the idea to allowing 'rj' to accept all arguments that 'qsub' accepts and pass through, defaulting or forcing those options that I need to change.


#!/bin/bash

# Default
WHOAMI=`whoami`
EMAIL=${WHOAMI}@sjrwmd.com
EXTRA_OPTIONS="  \
                 -q "'*&!boinc'" \
                 -cwd \
                 -j y \
                 -M ${EMAIL} \
                 -V \
"
PE=mpi
NP_OPTION=0
P_OPTION=0
I_OPTION=0

function usage(){
echo "
rj [-np NUM] [-i INPUT_FILE] [-mb] EXECUTABLE
    # EXECUTABLE runs on appropriate node.

rj -np NUM MPI_EXECUTABLE
     # Runs the MPI parallel MPI_EXECUTABLE on NUM nodes.
     # NUM can be a single number or a range, for example 2-8, would run the
     # MPI_EXECUTABLE on at least 2 processors up to 8 processors.

rj -i INPUT_FILE EXECUTABLE
     # Runs EXECUTABLE with interactive input supplied by INPUT_FILE.

rj [-mb] EXECUTABLE
     # -b Waits for EXECUTABLE to finish (blocks) before returning.
     # -m Sends e-mail at end of job."
exit 1
}

# Detect the project option -P and the -i option
# Didn't try getopt - suspect wouldn't work since command can have options.
while [ $# -ne 0 ]
do
   case $1 in
        -P) EXTRA_OPTIONS=" $EXTRA_OPTIONS $1 $2 "
            shift
            shift
            P_OPTION=1
            ;;
        -i) EXTRA_OPTIONS=" $EXTRA_OPTIONS $1 $2 "
            shift
            shift
            I_OPTION=1
            ;;
        -np) MAXNP=$2
            max_num_slaves=`qstat -g c -l arch=lx24-amd64 | awk 'NR > 2 {sum = sum + $4} END {print int(sum - 1 - 0.1*sum)}'`
            if [ $MAXNP -gt $max_num_slaves ]; then
              MAXNP=$max_num_slaves
              echo "Number of slaves set to maximum allowable = ${max_num_slaves}."
            fi
            EXTRA_OPTIONS=" $EXTRA_OPTIONS -pe $PE 1-$MAXNP "
            shift
            shift
            NP_OPTION=1
            ;;
         -b) EXTRA_OPTIONS=" $EXTRA_OPTIONS -sync y "
            shift
            ;;
         -m) EXTRA_OPTIONS=" $EXTRA_OPTIONS -m eas "
            shift
            ;;
         -h|--help) usage
            ;;
         *) break
            ;;
   esac
done

if [ $# -eq 0 ]; then
   usage
fi

full_path=`which $1 2> /dev/null`
if [ "x"${full_path} == "x" ]; then
        echo "Executable $1 not found in any directory in $PATH"
        usage
fi

static=`file -L ${full_path} | grep --count --max-count=1 statically`

condor=`nm ${full_path} 2>&1| grep --count --max-count=1 Condor`

mpi=`nm ${full_path} 2>&1| grep --count --max-count=1 MPI_`
if [ ${mpi} -eq 0 ]; then
    mpi=`ldd ${full_path} 2>&1| grep --count --max-count=1 libmpi`
fi

opteron=`file -L ${full_path} | grep --count --max-count=1 'AMD x86-64'`

APPEND_VAR="$NP_OPTION$P_OPTION$I_OPTION$static$condor$mpi$opteron"

. ${SGE_ROOT}/default/common/settings.sh

# JOBNAME has to be set AFTER options are processed
JOBNAME=`basename $1`

EXTRA_OPTIONS="${EXTRA_OPTIONS} -N ${JOBNAME} -o "`pwd`"/${JOBNAME}.out "

case $APPEND_VAR in
    1????0?) echo ""
            echo "You specified multiple processors but the executable is not MPI."
            echo ""
            usage
            ;;
    0????1?) echo ""
            echo "The executable is MPI, please pass the rj script the -np NUM option."
            echo ""
            usage
            ;;
    1????1?) qsub ${EXTRA_OPTIONS} \
                 /sjr/beodata/local/bin/parallel_job.sh $*
            ;;
    ???11??) qsub ${EXTRA_OPTIONS} \
                 -ckpt condor_ckpt \
                 /sjr/beodata/local/bin/ckpt_job.sh $*
            ;;
    ??????1) qsub ${EXTRA_OPTIONS} \
                 -l arch=lx24-amd64 \
                 /sjr/beodata/local/bin/vanilla_job.sh $*
            ;;
    ???00??) qsub ${EXTRA_OPTIONS} \
                 /sjr/beodata/local/bin/vanilla_job.sh $*
            ;;
    ???10??) echo ""
            echo "The executable is static, but without Condor."
            echo "Checkpointing is only available with Condor."
            echo "Use condor_compile to add the checkpointing libraries."
            echo ""
            qsub ${EXTRA_OPTIONS} \
                 /sjr/beodata/local/bin/vanilla_job.sh $*
            ;;
    *)      usage
            ;;
esac

Required supporting scripts for 'rj':

vanilla_job.sh

#!/bin/bash

echo ${JOB_ID} > sge_job_id

: > $SGE_STDOUT_PATH

#$ -S /bin/bash
$*

parallel_job.sh

#!/bin/bash

echo ${JOB_ID} > sge_job_id

: > $SGE_STDOUT_PATH

#$ -S /bin/bash

mpirun -np $NSLOTS $*

ckpt_job.sh

#!/bin/bash

#$ -S /bin/bash

cd $PWD

echo ${JOB_ID} > sge_job_id

: > $SGE_STDOUT_PATH

CHECKPOINT_DIR="/sjr/beodata/tmp/ckpt"
OUTPUT="${CHECKPOINT_DIR}/${JOB_ID}.log"

echo "-------------------------------------" >> $OUTPUT
if [ ${RESTARTED} -eq 0 ];
then
        echo "      Starting job #${JOB_ID}"         >> $OUTPUT
        $1 -_condor_ckpt ${CHECKPOINT_DIR}/${JOB_ID}.ckpt $@ &
else
        echo "      Re-Starting job #${JOB_ID}"      >> $OUTPUT
        $1 -_condor_restart ${CHECKPOINT_DIR}/${JOB_ID}.ckpt &
fi

PROC_PID=$!
echo $! > ${CHECKPOINT_DIR}/${JOB_ID}.pid

echo "-------------------------------------" >> $OUTPUT
echo " Date:     `date`"                     >> $OUTPUT
echo " Job PID:  ${PROC_PID}"                >> $OUTPUT
echo " Hostname: `hostname`"                 >> $OUTPUT
echo "-------------------------------------" >> $OUTPUT
wait

User Job Stats

I created this script to automate some monthly reporting I was required to do. It's features are;

  • Display figures for the week, month or year to date, or for the specified number of days
  • Display total cluster usage, globally and for each user
  • Display average cluster usage, for each user and the usage of the 'average' job across the whole cluster

It has a help function with usage examples, but if anyone needs pointers feel free to drop me a line.

NB: To determine which users have actually logged on to the cluster, it gets a list of home directories from the path specified by 'HOME_DIR_LOCATION'. You may need to alter the value of this variable to suit your system for the script to work.

#!/bin/bash
#
#==========================================================================
# Name: User SGE Job Statistics
# Author: Chris Bingham
# Date: 11.11.2008
# Language: Bash
# External References: ls, qacct, grep, cut, expr, getent, tr, date, qconf
#
# This script will use the SGE command 'qacct' to search back though all the
# job records for the specified number of days, and return a table of
# statistics for each user's usage of the cluster, sorted highest to lowest.
#==========================================================================

# Global Variables
HOME_DIR_LOCATION="/home"

# Store the first two arguments, all other arguments will be discarded
DAYS=$1
TOTAL=$2

function display_help() {
	# Display a help message
	echo "---User SGE Job Statistics---"
	echo "This script will use the SGE command 'qacct' to search back though all the"
	echo "job records for the specified number of days, and return a table of statistics"
	echo "for each user's usage of the cluster, sorted highest to lowest."
	echo ""
	echo "--Usage--"
	echo -e "  user_job_stats.sh [DAYS|OPTION] [total]"
	echo -e "  Where 'DAYS' is a number of days to gather statistics for or 'OPTION' is one of the following;"
	echo -e "  \tweek\t\t\tGather statistics for the week so far"
	echo -e "  \tmonth\t\t\tGather statistics for the month to date"
	echo -e "  \tyear\t\t\tGather statistics for the year to date"
	echo -e "  \thelp\t\t\tDisplay this message"
	echo -e "  Specifying 'total' at the end of the line will generate total usage statistics rather than average usage statistics"
	echo ""
	echo "--Usage Examples--"
	echo -e "  user_job_stats.sh 10\t\tGet average statistics for the last 10 days"
	echo -e "  user_job_stats.sh 10 total\tGet total statistics for the last 10 days"
	echo -e "  user_job_stats.sh month\tGet average statistics for the month to date"
	echo -e "  user_job_stats.sh week total\tGet total statistics for the week to date"
	echo ""
	echo "--Definitions--"
	echo -e "  CPU Time\t\t\tThe amount of time for which jobs were using CPU resources, measured using SGE's 'cpu' metric from 'qacct'"
	echo -e "  Wallclock Time\t\tThe amount of time for which jobs were running."
	echo -e "  % of Wallclock Time\t\tThe percentage of the cluster's overall total available wallclock time used."
	echo -e "  \t\t\t\tTotal available wallclock time is calculated as: DAYS * Number of CPUs in Cluster"
}

function averages_or_totals() {
	# Determine if the user request totals or averages
	if [ "$TOTAL" = "total" ] ; then
		gen_totals
	elif [ -z "$TOTAL" ] ; then
		gen_averages
	else
		display_help
	fi
}

function gen_human_readable_time() {
	# Convert a time in seconds into a more human-friendly scale (hours, days etc instead of seconds)
	
	# If the time span is less than 1 hour, convert to minutes
	if [ "$TIME" -lt "3600" ] ; then
		TIME=`echo "scale=2; $TIME/60" | bc`
		TIME="$TIME minutes"
	# If the time span is less than 1 day, convert to hours
	elif [ "$TIME" -lt "86400" ] ; then
		TIME=`echo "scale=2; $TIME/3600" | bc`
		TIME="$TIME hours"
	# If the time span is less than 1 week, convert to days
	elif [ "$TIME" -lt "604800" ] ; then
		TIME=`echo "scale=2; $TIME/86400" | bc`
		TIME="$TIME days"
	# If the time span is less than 1 year, convert to weeks
	elif [ "$TIME" -lt "31449600" ] ; then
		TIME=`echo "scale=2; $TIME/604800" | bc`
		TIME="$TIME weeks"
	# If the time span is 1 year or more, convert to years
	else
		TIME=`echo "scale=2; $TIME/31449600" | bc`
		TIME="$TIME years"
	fi
}

function get_userlist() {
	# Get a list of usernames by listing the home directories
	USER_LIST=`ls $HOME_DIR_LOCATION`
}

function calc_cpu_wallclock() {
	# Calculate the total amount of available cluster time during the specified number of days
	# Get a count of the numbe rof execution hosts from SGE
	CPU_COUNT=`qconf -sep | grep -i sum | tr -d [:alpha:][:space:]`
	
	# Calculate the total amount of wallclock for all nodes in seconds
	CPU_WALLCLOCK=$(($CPU_COUNT * $DAYS * 24 * 60 * 60))
	# Calculate 1% of the total cluster wallclock (for later use)
	CPU_WALLCLOCK_100=`expr $CPU_WALLCLOCK / 100`
}

function gen_totals() {

	# Call 'get_userlist' to get a user name list
	get_userlist
	
	# Get the total amount of available cluster time during the specified number of days
	calc_cpu_wallclock
	
	# Reset the total counters to zero
	TOTAL_PERCENT_CPU_WALLCLOCK="0"
	TOTAL_CPU_WALLCLOCK="0"
	TOTAL_CLUSTER_CPUTIME="0"
	TOTAL_JOB_COUNT="0"

	# For each username found, do the following;
	for i in $USER_LIST ; do
		
		# Use 'qacct' and 'grep' to count the total number of jobs they've submitted
		USER_USER_JOB_COUNT=`qacct -o "$i" -j -d $DAYS | grep "jobname" -c`
		TOTAL_JOB_COUNT=`expr $TOTAL_JOB_COUNT + $USER_USER_JOB_COUNT`
		
		# If the user has submitted no jobs, record their utilisation as zero,
		# else, use 'qacct', 'grep' and 'cut' to get an array of the wallclock and CPU time
		# counters for all of their jobs
		if [ "$USER_USER_JOB_COUNT" = "0" ] ; then
			USER_TOTAL_WALLCLOCK="0"
			USER_TOTAL_CPUTIME="0"
			USER_PERCENT_CPU_WALLCLOCK="0"
		else
			USER_TOTAL_WALLCLOCK=`qacct -o "$i" -d $DAYS | grep "$i" | tr -s " " "\t" | cut -f 2`
			USER_TOTAL_CPUTIME=`qacct -o "$i" -d $DAYS | grep "$i" | tr -s " " "\t" | cut -f 5`
			USER_PERCENT_CPU_WALLCLOCK=`echo "scale=2; $USER_TOTAL_WALLCLOCK/$CPU_WALLCLOCK_100" | bc`
		fi
		
		if [ -z "$USER_TOTAL_WALLCLOCK" ]  ; then
			USER_TOTAL_WALLCLOCK="0"
			USER_PERCENT_CPU_WALLCLOCK="0"
		fi
		
		if [ -z "$USER_TOTAL_CPUTIME" ] ; then
			USER_TOTAL_CPUTIME="0"
			USER_PERCENT_CPU_WALLCLOCK="0"
		fi
		
		# Add this user's percentage of cluster time consumed to the total
		TOTAL_PERCENT_CPU_WALLCLOCK=`echo "scale=2; $TOTAL_PERCENT_CPU_WALLCLOCK+$USER_PERCENT_CPU_WALLCLOCK" | bc`
		TOTAL_CPU_WALLCLOCK=`echo "scale=2; $TOTAL_CPU_WALLCLOCK+$USER_TOTAL_WALLCLOCK" | bc`
		TOTAL_CLUSTER_CPUTIME=`echo "scale=2; $TOTAL_CLUSTER_CPUTIME+$USER_TOTAL_CPUTIME" | bc`
		
		# Convert the user's username into their actual name
		USERNAME=`getent passwd "$i" | cut -d ":" -f 5`
		
		# If the passwd file didn't contain the user's actual name, use their username
		if [ -z "$USERNAME" ] ; then
			USERNAME="$i"
		fi
		
		# Start the output string
		OUT="$OUT$USER_PERCENT_CPU_WALLCLOCK \t\t\t"
		
		# If the average wallclock and CPU time are less than a millon, add an extra tab
		# after them to improve the readability of the output table
		if [ "$USER_TOTAL_WALLCLOCK" -ge "1000000" ] ; then
			OUT="$OUT$USER_TOTAL_WALLCLOCK \t\t"
		else
			OUT="$OUT$USER_TOTAL_WALLCLOCK \t\t\t"
		fi
		
		if [ "$USER_TOTAL_CPUTIME" -ge "1000000" ] ; then
			OUT="$OUT$USER_TOTAL_CPUTIME \t$USER_USER_JOB_COUNT \t\t\t$USERNAME \n"
		else
			OUT="$OUT$USER_TOTAL_CPUTIME \t\t$USER_USER_JOB_COUNT \t\t\t$USERNAME \n"
		fi
		
		# Clean up all variables for the next loop
		USER_USER_JOB_COUNT=""
		USER_TOTAL_WALLCLOCK=""
		USER_TOTAL_CPUTIME=""
		USER_PERCENT_CPU_WALLCLOCK=""
		USERNAME=""
	done
	
	# Output the results table, performing a reverse-numerical sort on the results
	echo -e "% of Wallclock Time\tTotal Wallclock Time\tTotal CPU Time\tTotal Number of Jobs\tUser Name"
	echo "-------------------------------------------------------------------------------------------------"
	echo -e $OUT | sort -nr
	echo "-------------------------------------------------------------------------------------------------"
	
	# Calculate the percentage of time the cluster was idle for
	CLUSTER_IDLE=`echo "scale=2; 100-$TOTAL_PERCENT_CPU_WALLCLOCK" | bc`
	
	# Convert the total wallclock time in to a human-readable form
	TIME="$TOTAL_CPU_WALLCLOCK"
	gen_human_readable_time
	TOTAL_CPU_WALLCLOCK="$TIME"
	
	# Convert the total CPU time in to a human-readable form
	TIME="$TOTAL_CLUSTER_CPUTIME"
	gen_human_readable_time
	TOTAL_CLUSTER_CPUTIME="$TIME"
	
	# Output overall cluster utilisation statistics
	echo "$TOTAL_JOB_COUNT jobs used $TOTAL_CPU_WALLCLOCK of Wallclock Time and $TOTAL_CLUSTER_CPUTIME of CPU Time have been used during the last $DAYS days, and the cluster was $CLUSTER_IDLE% idle."
}

function gen_averages() {

	# Call 'get_userlist' to get a user name list
	get_userlist
	
	# For each username found, do the following;
	for i in $USER_LIST ; do
		
		# Use 'qacct' and 'grep' to count the total number of jobs they've submitted
		USER_JOB_COUNT=`qacct -o "$i" -j -d $DAYS | grep "jobname" -c`
		
		# Add the user's job count to the overall total
		TOTAL_JOB_COUNT=`expr $TOTAL_JOB_COUNT + $USER_JOB_COUNT`
		
		# If the user has submitted no jobs, record their utilisation as zero,
		# else, use 'qacct', 'grep' and 'cut' to get an array of the wallclock and CPU time
		# counters for all of their jobs
		if [ "$USER_JOB_COUNT" = "0" ] ; then
			USER_JOB_WALLCLOCKS="0"
			USER_JOB_CPUTIMES="0"
		else
			USER_JOB_WALLCLOCKS=`qacct -o "$i" -j -d $DAYS |  grep "wallclock " | cut -d " " -f 2`
			USER_JOB_CPUTIMES=`qacct -o "$i" -j -d $DAYS | grep "cpu " | cut -c 14-100`
		fi
		
		# Set 'COUNT' to zero
		COUNT="0"
		
		# If the user's utilisation isn't zero, calculate their average wallclock time
		if [ "$USER_JOB_WALLCLOCKS" != "0" ] ; then
			for a in $USER_JOB_WALLCLOCKS ; do
				
				USER_TOTAL_WALLCLOCK=`expr $USER_TOTAL_WALLCLOCK + $a`
				COUNT=`expr $COUNT + 1`
				
			done
			
			# Add the user's total wallclock time to the overall total
			TOTAL_JOB_WALLCLOCK=`expr $TOTAL_JOB_WALLCLOCK + $USER_TOTAL_WALLCLOCK`
			
			USER_AVG_JOB_WALLCLOCK=`expr $USER_TOTAL_WALLCLOCK / $COUNT`
		else
			USER_AVG_JOB_WALLCLOCK="0"
		fi
			
		# Set 'COUNT' to zero
		COUNT="0"
		
		# If the user's utilisation isn't zero, calculate their average CPU time
		if [ "$USER_JOB_CPUTIMES" != "0" ] ; then
			for b in $USER_JOB_CPUTIMES ; do
				
				USER_TOTAL_CPUTIME=`expr $USER_TOTAL_CPUTIME + $b`
				COUNT=`expr $COUNT + 1`
				
			done
			
			# Add the user's total CPU time to the overall total
			TOTAL_JOB_CPUTIME=`expr $TOTAL_JOB_CPUTIME + $USER_TOTAL_CPUTIME`
			
			USER_AVG_JOB_CPUTIME=`expr $USER_TOTAL_CPUTIME / $COUNT`
		else
			USER_AVG_JOB_CPUTIME="0"
		fi
		
		# Convert the user's username into their actual name
		USERNAME=`getent passwd "$i" | cut -d ":" -f 5`
		
		# If the passwd file didn't contain the user's actual name, use their username
		if [ -z "$USERNAME" ] ; then
			USERNAME="$i"
		fi
		
		# If the average wallclock and CPU time are less than a million, add an extra tab
		# after them to improve the readability of the output table
		if [ "$USER_AVG_JOB_WALLCLOCK" -ge "1000000" ] ; then
			OUT="$OUT$USER_AVG_JOB_WALLCLOCK \t\t\t"
		else
			OUT="$OUT$USER_AVG_JOB_WALLCLOCK \t\t\t\t"
		fi
		
		if [ "$USER_AVG_JOB_CPUTIME" -ge "1000000" ] ; then
			OUT="$OUT$USER_AVG_JOB_CPUTIME \t\t$USER_JOB_COUNT \t\t\t$USERNAME \n"
		else
			OUT="$OUT$USER_AVG_JOB_CPUTIME \t\t\t$USER_JOB_COUNT \t\t\t$USERNAME \n"
		fi
		
		# Clean up all variables for the next loop
		USER_JOB_COUNT=""
		USER_JOB_WALLCLOCKS=""
		USER_JOB_CPUTIMES=""
		COUNT=""
		USER_TOTAL_WALLCLOCK=""
		USER_TOTAL_CPUTIME=""
		USER_AVG_JOB_WALLCLOCK=""
		USER_AVG_JOB_CPUTIME=""
		USERNAME=""
	done
	
	# Output the results table, performing a reverse-numerical sort on the results
	echo -e "Average Wallclock Time/Job\tAverage CPU Time/Job\tTotal Number of Jobs\tUser Name"
	echo "-------------------------------------------------------------------------------------------"
	echo -e $OUT | sort -nr
	echo "-------------------------------------------------------------------------------------------"
	
	# If the total job count is 0, record all overall averages as 0, otherwise calculate them
	if [ "$TOTAL_JOB_COUNT" = "0" ] ; then
		TOTAL_AVG_JOB_WALLCLOCK="0"
		TOTAL_AVG_JOB_CPUTIME="0"
	else
		TOTAL_AVG_JOB_WALLCLOCK=`expr $TOTAL_JOB_WALLCLOCK / $TOTAL_JOB_COUNT`
		TOTAL_AVG_JOB_CPUTIME=`expr $TOTAL_JOB_CPUTIME / $TOTAL_JOB_COUNT`
	fi
	
	# Calculate average job idle time
	TOTAL_AVG_JOB_IDLE=`echo "scale=2; 100-(($TOTAL_AVG_JOB_CPUTIME/$TOTAL_AVG_JOB_WALLCLOCK)*100)" | bc`
	
	# Convert the total wallclock time in to a human-readable form
	TIME="$TOTAL_AVG_JOB_WALLCLOCK"
	gen_human_readable_time
	TOTAL_AVG_JOB_WALLCLOCK="$TIME"
	
	# Convert the total CPU time in to a human-readable form
	TIME="$TOTAL_AVG_JOB_CPUTIME"
	gen_human_readable_time
	TOTAL_AVG_JOB_CPUTIME="$TIME"
	
	# Output the overall average job statistics
	echo "The average job during the last $DAYS days took $TOTAL_AVG_JOB_WALLCLOCK to complete, consumed $TOTAL_AVG_JOB_CPUTIME of CPU Time and was idle for $TOTAL_AVG_JOB_IDLE% of the time."
}

# Check if the first argument was null, and display help and exit if so
if [ -z "$DAYS" ] ; then
	display_help
	exit
else
	# Else, select from the following options
	case "$DAYS" in
		"week")
			# If 'week' is specified, determine how many days into the week we are, starting from, Monday
			DAYS=`date +%u`
			averages_or_totals
			;;
		"month")
			# If 'month' is specified, determine how many days into the month we are
			DAYS=`date +%d`
			averages_or_totals
			;;
		"year")
			# If 'year' is specified, determine how many days into the year we are
			DAYS=`date +%j`
			averages_or_totals
			;;
		(*[0-9])
			# Otherwise, if the input conatains numbers, trim out all non-numeric characters and continue
			DAYS=`echo $DAYS | tr -d [:alpha:][:punct:]`
			averages_or_totals
			;;
		"help")
			# If 'help' is specified, display the help message and exit
			display_help
			exit
			;;
		*)
			# If the input is anything else, display the help message and exit
			display_help
			exit
			;;
	esac
fi

qtime

This script was written to allow people to get an idea of how long their job might have to wait to be executed at any given time. It iterates though all completed jobs from the given time period (week, month or year to date, or a number of days) and calculates the minimum, maximum and average wait times.

On one system I've worked on the average wait time over a month was used as a measurement of system performance in the SLA, with this script providing the figures.

Again, it has a help function with usage examples, but if anyone need pointers feel free to drop me a line.

#!/bin/bash
#
#==========================================================================
# Name: Average Queuing Time
# Author: Chris Bingham
# Date: 12.02.2009
# Language: Bash
# External References: qacct, grep, cut, bc
#
# This script will calculate the average time jobs have had to spent
# queuing before being run over the specified time period
#==========================================================================

# Store the first argument
DAYS="$1"

function display_help() {
	# Display a help message
	echo "---Average Queuing Time---"
	echo "This script will calculate the average time jobs have had to spent"
	echo "queuing before being run over the specified time period"
	echo ""
	echo "--Usage--"
	echo -e "  qtime.sh [DAYS|OPTION]"
	echo -e "  Where 'DAYS' is a number of days to calculate the average for or 'OPTION' is one of the following;"
	echo -e "  \tweek\t\t\tCalculate the average for the week so far"
	echo -e "  \tmonth\t\t\tCalculate the average for the month to date"
	echo -e "  \tyear\t\t\tCalculate the average for the year to date"
	echo -e "  \thelp\t\t\tDisplay this message"
}

function gen_human_readable_time() {
	# Convert a time in seconds into a more human-friendly scale (hours, days etc instead of seconds)
	
	# If the time span is less than 1 hour, convert to minutes
	if [ "$TIME_INT" -lt "60" ] ; then
		TIME="$TIME seconds"
	elif [ "$TIME_INT" -lt "3600" ] ; then
		TIME=`echo "scale=2; $TIME/60" | bc`
		TIME="$TIME minutes"
	# If the time span is less than 1 day, convert to hours
	elif [ "$TIME_INT" -lt "86400" ] ; then
		TIME=`echo "scale=2; $TIME/3600" | bc`
		TIME="$TIME hours"
	# If the time span is less than 1 week, convert to days
	elif [ "$TIME_INT" -lt "604800" ] ; then
		TIME=`echo "scale=2; $TIME/86400" | bc`
		TIME="$TIME days"
	# If the time span is less than 1 year, convert to weeks
	elif [ "$TIME_INT" -lt "31449600" ] ; then
		TIME=`echo "scale=2; $TIME/604800" | bc`
		TIME="$TIME weeks"
	# If the time span is 1 year or more, convert to years
	else
		TIME=`echo "scale=2; $TIME/31449600" | bc`
		TIME="$TIME years"
	fi
}

function calc_avg() {
	# Set the field seperator for array creation to a new line
	OLDIFS=$IFS
	IFS=$'\n'
	
	# Get information for SGE using the 'qacct' command, storing submit and start times in arrays
	USER_JOB_COUNT=`qacct -j -d $DAYS | grep "jobname" -c`
	USER_SUBMIT_TIMES=($(qacct -j -d $DAYS | grep "qsub_time" | cut -d " " -f 5-9))
	USER_START_TIMES=($(qacct -j -d $DAYS | grep "start_time" | cut -d " " -f 4-9))
	
	# Get the length of one of the arrays
	USER_SUBMIT_TIMES_COUNT=${#USER_SUBMIT_TIMES[@]}
	
	# Reset the field seperator to it's previous value
	IFS=$OLDIFS

	# Create variables to store min and max wait times
	MIN_WAIT_TIME=""
	MAX_WAIT_TIME=""
	
	# Determine if any jobs have been completed ov the specified time period
	if [ "$USER_SUBMIT_TIMES_COUNT" -gt "0" ] ; then
		# If yes, then calculate the average wait time
		
		# For each element in the arrays, do the following;
		for (( i=0; i<${USER_SUBMIT_TIMES_COUNT}; i++ )) ; do
			# Convert the submit and start time to seconds since the epoch
			SUBMIT_SECONDS=`date -d "${USER_SUBMIT_TIMES[$i]}" +%s`
			START_SECONDS=`date -d "${USER_START_TIMES[$i]}" +%s`
			
			# Calculate how long the job was queuing, and add this to the total queuing time
			WAIT_TIME=$(($START_SECONDS-$SUBMIT_SECONDS))
			TOTAL_WAIT_TIME=$((TOTAL_WAIT_TIME+$WAIT_TIME))
			
			if [ -z "$MIN_WAIT_TIME" ] ; then
				MIN_WAIT_TIME=$WAIT_TIME
				MAX_WAIT_TIME=$WAIT_TIME
			else
				if [ "$MIN_WAIT_TIME" -gt "$WAIT_TIME" ] ; then
					MIN_WAIT_TIME=$WAIT_TIME
				fi
				if [ "$MAX_WAIT_TIME" -lt "$WAIT_TIME" ] ; then
					MAX_WAIT_TIME=$WAIT_TIME
				fi
			fi
			
			# Reset all variables for the next iteration of the loop
			WAIT_TIME=""
			SUBMIT_SECONDS=""
			START_SECONDS=""
		done
		
		# Calculate the average queuing time as both an integer and floating point number
		AVG_WAIT_TIME=`echo "scale=2; $TOTAL_WAIT_TIME/$USER_SUBMIT_TIMES_COUNT" | bc`
		AVG_WAIT_TIME_INT=$(($TOTAL_WAIT_TIME/$USER_SUBMIT_TIMES_COUNT))
		
		TIME_INT=$AVG_WAIT_TIME_INT
		TIME=$AVG_WAIT_TIME
		
		gen_human_readable_time
		
		AVG_WAIT_TIME=$TIME
		
		TIME_INT=$MIN_WAIT_TIME
		TIME=$MIN_WAIT_TIME
		
		gen_human_readable_time
		
		MIN_WAIT_TIME=$TIME
		
		TIME_INT=$MAX_WAIT_TIME
		TIME=$MAX_WAIT_TIME
		
		gen_human_readable_time
		
		MAX_WAIT_TIME=$TIME
		
		# Display the average queuing time
		echo ""
		echo "During the last $DAYS days, jobs had to queue (wait to be run) for;"
		echo -e "\tOn average:\t$AVG_WAIT_TIME"
		echo -e "\tAt least:\t$MIN_WAIT_TIME"
		echo -e "\tAt most:\t$MAX_WAIT_TIME"
		echo ""
	else
		# If no, then display the wait time as 0 seconds
		echo ""
                echo "During the last $DAYS days, jobs had to queue (wait to be run) for;"
                echo -e "\tOn average:\t0 seconds"
                echo -e "\tAt least:\t0 seconds"
                echo -e "\tAt most:\t0 seconds"
                echo ""

	fi
}

# Check first argument
case "$DAYS" in
	"week")
		# If 'week' is specified, determine how many days into the week we are, starting from, Monday
		DAYS=`date +%u`
		calc_avg
		;;
	"month")
		# If 'month' is specified, determine how many days into the month we are
		DAYS=`date +%d`
		calc_avg
		;;
	"year")
		# If 'year' is specified, determine how many days into the year we are
		DAYS=`date +%j`
		calc_avg
		;;
	(*[0-9])
		# If the input conatains numbers, trim out all non-numeric characters and continue
		DAYS=`echo $DAYS | tr -d [:alpha:][:punct:]`
		calc_avg
		;;
	"help")
		# If 'help' is specified, display the help message and exit
		display_help
		exit
		;;
	*)
		# If the input is anything else, display the help message and exit
		display_help
		exit
		;;
esac

Queue Job Count

This one's quite simple - it determines what queues exist on the cluster, then counts up how many jobs have been submitted to each one over the specified period (week, month or year to date, or a number of days).

#!/bin/bash
#
#==========================================================================
# Name: SGE Queue Job Count
# Author: Chris Bingham
# Date: 28.11.2008
# Language: Bash
# External References: qconf, qacct, grep, date, tr
#
# This script will use the SGE command 'qconf' to get a list of queues
# configured on the cluster, then use 'qacct' to search back though job
# records for each queue for the specified number of days, and return a
# table of jobs counts for each queue, sorted highest to lowest
#==========================================================================

# Store the first argument, all other argument will be discarded
DAYS=$1

# Use 'qacct' to get a list of queues, and convert it to an array
QUEUE_LIST=`qconf -sql`
QUEUE_LIST=`echo $QUEUE_LIST | tr -t ' ' " " `

# Create a variable to store the total job count
TOTAL_JOB_COUNT="0"

function get_job_count() {
	# For each queue found, do the following;
	for q in `echo -e $QUEUE_LIST` ; do
	
		# Use 'qacct' and 'grep' to get a count of the number of jobs for the
		# specified time period
		QUEUE_JOB_COUNT=`qacct -d $DAYS -q $q -j | grep "qname        $q" -c`
		
		# Add this to the total job count		
		TOTAL_JOB_COUNT=$(($TOTAL_JOB_COUNT+$QUEUE_JOB_COUNT))

		# Store the results for later output
		OUT="$OUT$QUEUE_JOB_COUNT\t\t$q\n"
	done
	
	# Output the results table, performing a reverse-numerical sort on the results
	echo -e "Job Count\tQueue"
	echo "--------------------------"
	echo -e $OUT | sort -nr
	echo "Total Job Count: $TOTAL_JOB_COUNT"
}



function display_help() {
	# Display a help message
	echo "---SGE Queue Job Count---"
	echo "This script will use the SGE command 'qacct' to search back though all the"
	echo "job records for the specified number of days, and return a table of the"
	echo "job counts for each queue configured on the system."
	echo ""
	echo "Usage: q_job_count.sh [DAYS|OPTION]"
	echo "Where 'DAYS' is a number of days to gather statistics for or 'OPTION' is one of the following;"
	echo -e "\tweek\tGather statistics for the week so far"
	echo -e "\tmonth\tGather statistics for the month to date"
	echo -e "\tyear\tGather statistics for the year to date"
	echo -e "\thelp\tDisplay this message"
}


# Check if the first argument was null, and display help and exit if so
if [ -z "$DAYS" ] ; then
	display_help
	exit
else
	# Else, select from the following options
	case "$DAYS" in
		"week")
			# If 'week' is specified, determine how many days into the week we are, starting from, Monday
			DAYS=`date +%u`
			get_job_count
			;;
		"month")
			# If 'month' is specified, determine how many days into the month we are
			DAYS=`date +%d`
			get_job_count
			;;
		"year")
			# If 'year' is specified, determine how many days into the year we are
			DAYS=`date +%j`
			get_job_count
			;;
		(*[0-9])
			# Otherwise, if the input conatains numbers, trim out all non-numeric characters and continue
			DAYS=`echo $DAYS | tr -d [:alpha:][:punct:]`
			get_job_count
			;;
		"help")
			# If 'help' is specified, display the help message and exit
			display_help
			exit
			;;
		*)
			# If the input is anything else, display the help message and exit
			display_help
			exit
			;;
	esac
fi

Change Queue State

A very short script that will either enable or disable all queue instances on the host its run on - I've found it useful for quickly knocking out the queues on nodes that are being taken down for maintenance.

#!/bin/bash
#
#==========================================================================
# Name: Change SGE Queue Instance States
# Author: Chris Bingham
# Date: 28.11.2008
# Language: Bash
# External References: qselect, qmod, grep, tr
#
# This script will, based on the argument supplied, either enable or disable
# all queue instances on the current host
#==========================================================================

# Store the first argument, all other argument will be discarded
ACTION=$1

# Convert any uppercase letters to lowercase, for the case statement below
ACTION=`echo $ACTION | tr -t [:upper:] [:lower:]`

# Determine what action to take based on the argument supplied
case "$ACTION" in
	"enable")
		# If the argument was 'enable', then use 'qselect' and 'grep' to select all
		# queue instances on the current host (except 'test.q') and then enable them
		# using 'qmod'
		qmod -e `qselect -q *@$HOSTNAME`
		;;
	"disable")
		# If the argument was 'disable', then use 'qselect' and 'grep' to select all
		# queue instances on the current host (except 'test.q') and then disable them
		# using 'qmod'
		qmod -d `qselect -q *@$HOSTNAME`
		;;
	*)
		# If the argument was anything else, display an error message
		echo "Invalid option: please enter either 'enable' or 'disable'"
		;;
esac

filter-accounting

A small Perl script to filter the accounting file by the end_time of the jobs. Mostly useful for splitting up an accounting file by years, for example.

Other Miscellaneous

Some others, partially overlapping with some above are listed under http://www.nw-grid.ac.uk/LivScripts.

Personal tools
Namespaces

Variants
Actions
GridWiki Navigation
Toolbox