Utilities
Contents |
Load Script
The following script just creates jobs. It is useful for testing your settings. No error checking and you have to modify the -q option to fit with your system.
#!/bin/sh
# First argument is the number of jobs
# Second argument is the seconds to sleep
QSUB_OPTIONS=" \
-q "'*&!boinc'" \
-cwd \
-j y \
-V \
-N load \
-o `pwd`/load.out \
"
for i in `seq $1`
do
qsub $QSUB_OPTIONS /sjr/beodata/local/bin/vanilla_job.sh sleep $2
done
Using Ganglia as Load Sensor
See Using Ganglia As Load Sensor for guidance on using Ganglia as a load sensor for Grid Engine.
Modding qstat
The output from qstat -ext, while complete, is overly verbose (and over 200 characters wide) in many cases. In SGE 5.3, this will strip out bits about the Department, Deadline, PE Master, and Array Task columns; items frequently unused. It makes a good alias (such as "eqstat", or something):
qstat -ext | cut -c 1-33,39-45,66-92,110-191
Modding qstat Redux
A longer script, but more condensed output from Andy Schwierskott, pulled from the SGE mailing list:
#!/bin/sh
echo "JobId P S Project User Tot-Tkt ovrts otckt dtckt ftckt stckt shr"
echo "---------------------------------------------------------------------------------------"
qstat -ext -s rs | grep -v job-ID | sed /-------------/d | \
gawk '{ printf("%5s %4s %4s %8s %8s %7s %7s %7s %7s %7s %7s %4s\n", \
$1, $2, $7, $5, $4, $13, $14, $15, $16, $17, $18, $19) }'
echo "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -"
qstat -ext -s p | grep -v job-ID | sed /-------------/d | \
gawk '{ printf("%5s %4s %4s %8s %8s %7s %7s %7s %7s %7s %7s %4s\n", \
$1, $2, $7, $5, $4, $10, $11, $12, $13, $14, $15, $16 ); }'
User Management
If you have lots of users and groups to add at once, and your userlists map to unix groups, qconf can help automate this a bit. This awk snippet looks for entries starting with "grp", and generates a qconf entry to create a matching userlist.
awk -F: '/^grp/{print "qconf -au ",$4,$1}' /etc/group
qcd
If your user's have the same shared file space as the cluster this alias will change directory to the current working directory of the passed in job id.
It has to be an alias (or sourced) in order to affect the current shell.
I use the tcsh, and that is why there is all of the bizarre back slashing.
Add the line below to the system shell startup file or ~/.tcshrc
# tcsh:
alias qcd cd\ \`qstat\ -j\ \!\*\|awk\ \'/^cwd/\{print\ \$2\}\'\`
For bash/ksh/zsh, a shell function works well:
function qcd {
cd `qstat -j $1 | awk '/^cwd/{print $2}'`;
}
Might could revisit this to make it a little nicer with the new XML format?
qavailable_processors
This script just sums up the processors that aren't performing a task. You will have to change the qstat options to suite your configuration.
#!/bin/sh
qstat -g c -l arch=lx24-amd64 -q all.q | awk 'NR > 2 {sum = sum + $4} END {print sum}'
qtail
The following script will 'tail' the end of the standard output file of the passed in job id.
#!/bin/sh
if [ $# -ne 1 ]; then
echo "Usage:"
echo " $0 <sge_job_number>"
exit
fi
out_path=`qstat -j $1|grep ^stdout_path_list|awk '{print $2}'`
if [ "X$out_path" == "X" ]; then
exit
fi
if [ $out_path == "/dev/null" ]; then
echo "Standard output is not available because directed to /dev/null"
exit
fi
tail -f $out_path
rj: qsub wrapper script
I created a wrapper script for qsub that allows my users to not need to know the complexity of the cluster configurations or job submission. I called the wrapper script 'rj' (for run jobs) in order to distinguish it from the collection of 'q*' apps that come with Grid Engine.
It has the following features:
- Detects whether the application is checkpointable (with Condor)
- Detects whether the application is MPI (currently known to work with OpenMPI and possibly MPICH), and then checks and makes sure gets the option to specify number of processors
- Sets the maximum number of parallel processors to what is available so that the job can be dispatched immediately
Realize that this is very specific to my setup, and you will need to go through it and edit to match what you want it to do. It also has a lot of history that should probably be re-written. Approaching the idea to allowing 'rj' to accept all arguments that 'qsub' accepts and pass through, defaulting or forcing those options that I need to change.
#!/bin/bash
# Default
WHOAMI=`whoami`
EMAIL=${WHOAMI}@sjrwmd.com
EXTRA_OPTIONS=" \
-q "'*&!boinc'" \
-cwd \
-j y \
-M ${EMAIL} \
-V \
"
PE=mpi
NP_OPTION=0
P_OPTION=0
I_OPTION=0
function usage(){
echo "
rj [-np NUM] [-i INPUT_FILE] [-mb] EXECUTABLE
# EXECUTABLE runs on appropriate node.
rj -np NUM MPI_EXECUTABLE
# Runs the MPI parallel MPI_EXECUTABLE on NUM nodes.
# NUM can be a single number or a range, for example 2-8, would run the
# MPI_EXECUTABLE on at least 2 processors up to 8 processors.
rj -i INPUT_FILE EXECUTABLE
# Runs EXECUTABLE with interactive input supplied by INPUT_FILE.
rj [-mb] EXECUTABLE
# -b Waits for EXECUTABLE to finish (blocks) before returning.
# -m Sends e-mail at end of job."
exit 1
}
# Detect the project option -P and the -i option
# Didn't try getopt - suspect wouldn't work since command can have options.
while [ $# -ne 0 ]
do
case $1 in
-P) EXTRA_OPTIONS=" $EXTRA_OPTIONS $1 $2 "
shift
shift
P_OPTION=1
;;
-i) EXTRA_OPTIONS=" $EXTRA_OPTIONS $1 $2 "
shift
shift
I_OPTION=1
;;
-np) MAXNP=$2
max_num_slaves=`qstat -g c -l arch=lx24-amd64 | awk 'NR > 2 {sum = sum + $4} END {print int(sum - 1 - 0.1*sum)}'`
if [ $MAXNP -gt $max_num_slaves ]; then
MAXNP=$max_num_slaves
echo "Number of slaves set to maximum allowable = ${max_num_slaves}."
fi
EXTRA_OPTIONS=" $EXTRA_OPTIONS -pe $PE 1-$MAXNP "
shift
shift
NP_OPTION=1
;;
-b) EXTRA_OPTIONS=" $EXTRA_OPTIONS -sync y "
shift
;;
-m) EXTRA_OPTIONS=" $EXTRA_OPTIONS -m eas "
shift
;;
-h|--help) usage
;;
*) break
;;
esac
done
if [ $# -eq 0 ]; then
usage
fi
full_path=`which $1 2> /dev/null`
if [ "x"${full_path} == "x" ]; then
echo "Executable $1 not found in any directory in $PATH"
usage
fi
static=`file -L ${full_path} | grep --count --max-count=1 statically`
condor=`nm ${full_path} 2>&1| grep --count --max-count=1 Condor`
mpi=`nm ${full_path} 2>&1| grep --count --max-count=1 MPI_`
if [ ${mpi} -eq 0 ]; then
mpi=`ldd ${full_path} 2>&1| grep --count --max-count=1 libmpi`
fi
opteron=`file -L ${full_path} | grep --count --max-count=1 'AMD x86-64'`
APPEND_VAR="$NP_OPTION$P_OPTION$I_OPTION$static$condor$mpi$opteron"
. ${SGE_ROOT}/default/common/settings.sh
# JOBNAME has to be set AFTER options are processed
JOBNAME=`basename $1`
EXTRA_OPTIONS="${EXTRA_OPTIONS} -N ${JOBNAME} -o "`pwd`"/${JOBNAME}.out "
case $APPEND_VAR in
1????0?) echo ""
echo "You specified multiple processors but the executable is not MPI."
echo ""
usage
;;
0????1?) echo ""
echo "The executable is MPI, please pass the rj script the -np NUM option."
echo ""
usage
;;
1????1?) qsub ${EXTRA_OPTIONS} \
/sjr/beodata/local/bin/parallel_job.sh $*
;;
???11??) qsub ${EXTRA_OPTIONS} \
-ckpt condor_ckpt \
/sjr/beodata/local/bin/ckpt_job.sh $*
;;
??????1) qsub ${EXTRA_OPTIONS} \
-l arch=lx24-amd64 \
/sjr/beodata/local/bin/vanilla_job.sh $*
;;
???00??) qsub ${EXTRA_OPTIONS} \
/sjr/beodata/local/bin/vanilla_job.sh $*
;;
???10??) echo ""
echo "The executable is static, but without Condor."
echo "Checkpointing is only available with Condor."
echo "Use condor_compile to add the checkpointing libraries."
echo ""
qsub ${EXTRA_OPTIONS} \
/sjr/beodata/local/bin/vanilla_job.sh $*
;;
*) usage
;;
esac
Required supporting scripts for 'rj':
vanilla_job.sh
#!/bin/bash
echo ${JOB_ID} > sge_job_id
: > $SGE_STDOUT_PATH
#$ -S /bin/bash
$*
parallel_job.sh
#!/bin/bash
echo ${JOB_ID} > sge_job_id
: > $SGE_STDOUT_PATH
#$ -S /bin/bash
mpirun -np $NSLOTS $*
ckpt_job.sh
#!/bin/bash
#$ -S /bin/bash
cd $PWD
echo ${JOB_ID} > sge_job_id
: > $SGE_STDOUT_PATH
CHECKPOINT_DIR="/sjr/beodata/tmp/ckpt"
OUTPUT="${CHECKPOINT_DIR}/${JOB_ID}.log"
echo "-------------------------------------" >> $OUTPUT
if [ ${RESTARTED} -eq 0 ];
then
echo " Starting job #${JOB_ID}" >> $OUTPUT
$1 -_condor_ckpt ${CHECKPOINT_DIR}/${JOB_ID}.ckpt $@ &
else
echo " Re-Starting job #${JOB_ID}" >> $OUTPUT
$1 -_condor_restart ${CHECKPOINT_DIR}/${JOB_ID}.ckpt &
fi
PROC_PID=$!
echo $! > ${CHECKPOINT_DIR}/${JOB_ID}.pid
echo "-------------------------------------" >> $OUTPUT
echo " Date: `date`" >> $OUTPUT
echo " Job PID: ${PROC_PID}" >> $OUTPUT
echo " Hostname: `hostname`" >> $OUTPUT
echo "-------------------------------------" >> $OUTPUT
wait
User Job Stats
I created this script to automate some monthly reporting I was required to do. It's features are;
- Display figures for the week, month or year to date, or for the specified number of days
- Display total cluster usage, globally and for each user
- Display average cluster usage, for each user and the usage of the 'average' job across the whole cluster
It has a help function with usage examples, but if anyone needs pointers feel free to drop me a line.
NB: To determine which users have actually logged on to the cluster, it gets a list of home directories from the path specified by 'HOME_DIR_LOCATION'. You may need to alter the value of this variable to suit your system for the script to work.
#!/bin/bash
#
#==========================================================================
# Name: User SGE Job Statistics
# Author: Chris Bingham
# Date: 11.11.2008
# Language: Bash
# External References: ls, qacct, grep, cut, expr, getent, tr, date, qconf
#
# This script will use the SGE command 'qacct' to search back though all the
# job records for the specified number of days, and return a table of
# statistics for each user's usage of the cluster, sorted highest to lowest.
#==========================================================================
# Global Variables
HOME_DIR_LOCATION="/home"
# Store the first two arguments, all other arguments will be discarded
DAYS=$1
TOTAL=$2
function display_help() {
# Display a help message
echo "---User SGE Job Statistics---"
echo "This script will use the SGE command 'qacct' to search back though all the"
echo "job records for the specified number of days, and return a table of statistics"
echo "for each user's usage of the cluster, sorted highest to lowest."
echo ""
echo "--Usage--"
echo -e " user_job_stats.sh [DAYS|OPTION] [total]"
echo -e " Where 'DAYS' is a number of days to gather statistics for or 'OPTION' is one of the following;"
echo -e " \tweek\t\t\tGather statistics for the week so far"
echo -e " \tmonth\t\t\tGather statistics for the month to date"
echo -e " \tyear\t\t\tGather statistics for the year to date"
echo -e " \thelp\t\t\tDisplay this message"
echo -e " Specifying 'total' at the end of the line will generate total usage statistics rather than average usage statistics"
echo ""
echo "--Usage Examples--"
echo -e " user_job_stats.sh 10\t\tGet average statistics for the last 10 days"
echo -e " user_job_stats.sh 10 total\tGet total statistics for the last 10 days"
echo -e " user_job_stats.sh month\tGet average statistics for the month to date"
echo -e " user_job_stats.sh week total\tGet total statistics for the week to date"
echo ""
echo "--Definitions--"
echo -e " CPU Time\t\t\tThe amount of time for which jobs were using CPU resources, measured using SGE's 'cpu' metric from 'qacct'"
echo -e " Wallclock Time\t\tThe amount of time for which jobs were running."
echo -e " % of Wallclock Time\t\tThe percentage of the cluster's overall total available wallclock time used."
echo -e " \t\t\t\tTotal available wallclock time is calculated as: DAYS * Number of CPUs in Cluster"
}
function averages_or_totals() {
# Determine if the user request totals or averages
if [ "$TOTAL" = "total" ] ; then
gen_totals
elif [ -z "$TOTAL" ] ; then
gen_averages
else
display_help
fi
}
function gen_human_readable_time() {
# Convert a time in seconds into a more human-friendly scale (hours, days etc instead of seconds)
# If the time span is less than 1 hour, convert to minutes
if [ "$TIME" -lt "3600" ] ; then
TIME=`echo "scale=2; $TIME/60" | bc`
TIME="$TIME minutes"
# If the time span is less than 1 day, convert to hours
elif [ "$TIME" -lt "86400" ] ; then
TIME=`echo "scale=2; $TIME/3600" | bc`
TIME="$TIME hours"
# If the time span is less than 1 week, convert to days
elif [ "$TIME" -lt "604800" ] ; then
TIME=`echo "scale=2; $TIME/86400" | bc`
TIME="$TIME days"
# If the time span is less than 1 year, convert to weeks
elif [ "$TIME" -lt "31449600" ] ; then
TIME=`echo "scale=2; $TIME/604800" | bc`
TIME="$TIME weeks"
# If the time span is 1 year or more, convert to years
else
TIME=`echo "scale=2; $TIME/31449600" | bc`
TIME="$TIME years"
fi
}
function get_userlist() {
# Get a list of usernames by listing the home directories
USER_LIST=`ls $HOME_DIR_LOCATION`
}
function calc_cpu_wallclock() {
# Calculate the total amount of available cluster time during the specified number of days
# Get a count of the numbe rof execution hosts from SGE
CPU_COUNT=`qconf -sep | grep -i sum | tr -d [:alpha:][:space:]`
# Calculate the total amount of wallclock for all nodes in seconds
CPU_WALLCLOCK=$(($CPU_COUNT * $DAYS * 24 * 60 * 60))
# Calculate 1% of the total cluster wallclock (for later use)
CPU_WALLCLOCK_100=`expr $CPU_WALLCLOCK / 100`
}
function gen_totals() {
# Call 'get_userlist' to get a user name list
get_userlist
# Get the total amount of available cluster time during the specified number of days
calc_cpu_wallclock
# Reset the total counters to zero
TOTAL_PERCENT_CPU_WALLCLOCK="0"
TOTAL_CPU_WALLCLOCK="0"
TOTAL_CLUSTER_CPUTIME="0"
TOTAL_JOB_COUNT="0"
# For each username found, do the following;
for i in $USER_LIST ; do
# Use 'qacct' and 'grep' to count the total number of jobs they've submitted
USER_USER_JOB_COUNT=`qacct -o "$i" -j -d $DAYS | grep "jobname" -c`
TOTAL_JOB_COUNT=`expr $TOTAL_JOB_COUNT + $USER_USER_JOB_COUNT`
# If the user has submitted no jobs, record their utilisation as zero,
# else, use 'qacct', 'grep' and 'cut' to get an array of the wallclock and CPU time
# counters for all of their jobs
if [ "$USER_USER_JOB_COUNT" = "0" ] ; then
USER_TOTAL_WALLCLOCK="0"
USER_TOTAL_CPUTIME="0"
USER_PERCENT_CPU_WALLCLOCK="0"
else
USER_TOTAL_WALLCLOCK=`qacct -o "$i" -d $DAYS | grep "$i" | tr -s " " "\t" | cut -f 2`
USER_TOTAL_CPUTIME=`qacct -o "$i" -d $DAYS | grep "$i" | tr -s " " "\t" | cut -f 5`
USER_PERCENT_CPU_WALLCLOCK=`echo "scale=2; $USER_TOTAL_WALLCLOCK/$CPU_WALLCLOCK_100" | bc`
fi
if [ -z "$USER_TOTAL_WALLCLOCK" ] ; then
USER_TOTAL_WALLCLOCK="0"
USER_PERCENT_CPU_WALLCLOCK="0"
fi
if [ -z "$USER_TOTAL_CPUTIME" ] ; then
USER_TOTAL_CPUTIME="0"
USER_PERCENT_CPU_WALLCLOCK="0"
fi
# Add this user's percentage of cluster time consumed to the total
TOTAL_PERCENT_CPU_WALLCLOCK=`echo "scale=2; $TOTAL_PERCENT_CPU_WALLCLOCK+$USER_PERCENT_CPU_WALLCLOCK" | bc`
TOTAL_CPU_WALLCLOCK=`echo "scale=2; $TOTAL_CPU_WALLCLOCK+$USER_TOTAL_WALLCLOCK" | bc`
TOTAL_CLUSTER_CPUTIME=`echo "scale=2; $TOTAL_CLUSTER_CPUTIME+$USER_TOTAL_CPUTIME" | bc`
# Convert the user's username into their actual name
USERNAME=`getent passwd "$i" | cut -d ":" -f 5`
# If the passwd file didn't contain the user's actual name, use their username
if [ -z "$USERNAME" ] ; then
USERNAME="$i"
fi
# Start the output string
OUT="$OUT$USER_PERCENT_CPU_WALLCLOCK \t\t\t"
# If the average wallclock and CPU time are less than a millon, add an extra tab
# after them to improve the readability of the output table
if [ "$USER_TOTAL_WALLCLOCK" -ge "1000000" ] ; then
OUT="$OUT$USER_TOTAL_WALLCLOCK \t\t"
else
OUT="$OUT$USER_TOTAL_WALLCLOCK \t\t\t"
fi
if [ "$USER_TOTAL_CPUTIME" -ge "1000000" ] ; then
OUT="$OUT$USER_TOTAL_CPUTIME \t$USER_USER_JOB_COUNT \t\t\t$USERNAME \n"
else
OUT="$OUT$USER_TOTAL_CPUTIME \t\t$USER_USER_JOB_COUNT \t\t\t$USERNAME \n"
fi
# Clean up all variables for the next loop
USER_USER_JOB_COUNT=""
USER_TOTAL_WALLCLOCK=""
USER_TOTAL_CPUTIME=""
USER_PERCENT_CPU_WALLCLOCK=""
USERNAME=""
done
# Output the results table, performing a reverse-numerical sort on the results
echo -e "% of Wallclock Time\tTotal Wallclock Time\tTotal CPU Time\tTotal Number of Jobs\tUser Name"
echo "-------------------------------------------------------------------------------------------------"
echo -e $OUT | sort -nr
echo "-------------------------------------------------------------------------------------------------"
# Calculate the percentage of time the cluster was idle for
CLUSTER_IDLE=`echo "scale=2; 100-$TOTAL_PERCENT_CPU_WALLCLOCK" | bc`
# Convert the total wallclock time in to a human-readable form
TIME="$TOTAL_CPU_WALLCLOCK"
gen_human_readable_time
TOTAL_CPU_WALLCLOCK="$TIME"
# Convert the total CPU time in to a human-readable form
TIME="$TOTAL_CLUSTER_CPUTIME"
gen_human_readable_time
TOTAL_CLUSTER_CPUTIME="$TIME"
# Output overall cluster utilisation statistics
echo "$TOTAL_JOB_COUNT jobs used $TOTAL_CPU_WALLCLOCK of Wallclock Time and $TOTAL_CLUSTER_CPUTIME of CPU Time have been used during the last $DAYS days, and the cluster was $CLUSTER_IDLE% idle."
}
function gen_averages() {
# Call 'get_userlist' to get a user name list
get_userlist
# For each username found, do the following;
for i in $USER_LIST ; do
# Use 'qacct' and 'grep' to count the total number of jobs they've submitted
USER_JOB_COUNT=`qacct -o "$i" -j -d $DAYS | grep "jobname" -c`
# Add the user's job count to the overall total
TOTAL_JOB_COUNT=`expr $TOTAL_JOB_COUNT + $USER_JOB_COUNT`
# If the user has submitted no jobs, record their utilisation as zero,
# else, use 'qacct', 'grep' and 'cut' to get an array of the wallclock and CPU time
# counters for all of their jobs
if [ "$USER_JOB_COUNT" = "0" ] ; then
USER_JOB_WALLCLOCKS="0"
USER_JOB_CPUTIMES="0"
else
USER_JOB_WALLCLOCKS=`qacct -o "$i" -j -d $DAYS | grep "wallclock " | cut -d " " -f 2`
USER_JOB_CPUTIMES=`qacct -o "$i" -j -d $DAYS | grep "cpu " | cut -c 14-100`
fi
# Set 'COUNT' to zero
COUNT="0"
# If the user's utilisation isn't zero, calculate their average wallclock time
if [ "$USER_JOB_WALLCLOCKS" != "0" ] ; then
for a in $USER_JOB_WALLCLOCKS ; do
USER_TOTAL_WALLCLOCK=`expr $USER_TOTAL_WALLCLOCK + $a`
COUNT=`expr $COUNT + 1`
done
# Add the user's total wallclock time to the overall total
TOTAL_JOB_WALLCLOCK=`expr $TOTAL_JOB_WALLCLOCK + $USER_TOTAL_WALLCLOCK`
USER_AVG_JOB_WALLCLOCK=`expr $USER_TOTAL_WALLCLOCK / $COUNT`
else
USER_AVG_JOB_WALLCLOCK="0"
fi
# Set 'COUNT' to zero
COUNT="0"
# If the user's utilisation isn't zero, calculate their average CPU time
if [ "$USER_JOB_CPUTIMES" != "0" ] ; then
for b in $USER_JOB_CPUTIMES ; do
USER_TOTAL_CPUTIME=`expr $USER_TOTAL_CPUTIME + $b`
COUNT=`expr $COUNT + 1`
done
# Add the user's total CPU time to the overall total
TOTAL_JOB_CPUTIME=`expr $TOTAL_JOB_CPUTIME + $USER_TOTAL_CPUTIME`
USER_AVG_JOB_CPUTIME=`expr $USER_TOTAL_CPUTIME / $COUNT`
else
USER_AVG_JOB_CPUTIME="0"
fi
# Convert the user's username into their actual name
USERNAME=`getent passwd "$i" | cut -d ":" -f 5`
# If the passwd file didn't contain the user's actual name, use their username
if [ -z "$USERNAME" ] ; then
USERNAME="$i"
fi
# If the average wallclock and CPU time are less than a million, add an extra tab
# after them to improve the readability of the output table
if [ "$USER_AVG_JOB_WALLCLOCK" -ge "1000000" ] ; then
OUT="$OUT$USER_AVG_JOB_WALLCLOCK \t\t\t"
else
OUT="$OUT$USER_AVG_JOB_WALLCLOCK \t\t\t\t"
fi
if [ "$USER_AVG_JOB_CPUTIME" -ge "1000000" ] ; then
OUT="$OUT$USER_AVG_JOB_CPUTIME \t\t$USER_JOB_COUNT \t\t\t$USERNAME \n"
else
OUT="$OUT$USER_AVG_JOB_CPUTIME \t\t\t$USER_JOB_COUNT \t\t\t$USERNAME \n"
fi
# Clean up all variables for the next loop
USER_JOB_COUNT=""
USER_JOB_WALLCLOCKS=""
USER_JOB_CPUTIMES=""
COUNT=""
USER_TOTAL_WALLCLOCK=""
USER_TOTAL_CPUTIME=""
USER_AVG_JOB_WALLCLOCK=""
USER_AVG_JOB_CPUTIME=""
USERNAME=""
done
# Output the results table, performing a reverse-numerical sort on the results
echo -e "Average Wallclock Time/Job\tAverage CPU Time/Job\tTotal Number of Jobs\tUser Name"
echo "-------------------------------------------------------------------------------------------"
echo -e $OUT | sort -nr
echo "-------------------------------------------------------------------------------------------"
# If the total job count is 0, record all overall averages as 0, otherwise calculate them
if [ "$TOTAL_JOB_COUNT" = "0" ] ; then
TOTAL_AVG_JOB_WALLCLOCK="0"
TOTAL_AVG_JOB_CPUTIME="0"
else
TOTAL_AVG_JOB_WALLCLOCK=`expr $TOTAL_JOB_WALLCLOCK / $TOTAL_JOB_COUNT`
TOTAL_AVG_JOB_CPUTIME=`expr $TOTAL_JOB_CPUTIME / $TOTAL_JOB_COUNT`
fi
# Calculate average job idle time
TOTAL_AVG_JOB_IDLE=`echo "scale=2; 100-(($TOTAL_AVG_JOB_CPUTIME/$TOTAL_AVG_JOB_WALLCLOCK)*100)" | bc`
# Convert the total wallclock time in to a human-readable form
TIME="$TOTAL_AVG_JOB_WALLCLOCK"
gen_human_readable_time
TOTAL_AVG_JOB_WALLCLOCK="$TIME"
# Convert the total CPU time in to a human-readable form
TIME="$TOTAL_AVG_JOB_CPUTIME"
gen_human_readable_time
TOTAL_AVG_JOB_CPUTIME="$TIME"
# Output the overall average job statistics
echo "The average job during the last $DAYS days took $TOTAL_AVG_JOB_WALLCLOCK to complete, consumed $TOTAL_AVG_JOB_CPUTIME of CPU Time and was idle for $TOTAL_AVG_JOB_IDLE% of the time."
}
# Check if the first argument was null, and display help and exit if so
if [ -z "$DAYS" ] ; then
display_help
exit
else
# Else, select from the following options
case "$DAYS" in
"week")
# If 'week' is specified, determine how many days into the week we are, starting from, Monday
DAYS=`date +%u`
averages_or_totals
;;
"month")
# If 'month' is specified, determine how many days into the month we are
DAYS=`date +%d`
averages_or_totals
;;
"year")
# If 'year' is specified, determine how many days into the year we are
DAYS=`date +%j`
averages_or_totals
;;
(*[0-9])
# Otherwise, if the input conatains numbers, trim out all non-numeric characters and continue
DAYS=`echo $DAYS | tr -d [:alpha:][:punct:]`
averages_or_totals
;;
"help")
# If 'help' is specified, display the help message and exit
display_help
exit
;;
*)
# If the input is anything else, display the help message and exit
display_help
exit
;;
esac
fi
qtime
This script was written to allow people to get an idea of how long their job might have to wait to be executed at any given time. It iterates though all completed jobs from the given time period (week, month or year to date, or a number of days) and calculates the minimum, maximum and average wait times.
On one system I've worked on the average wait time over a month was used as a measurement of system performance in the SLA, with this script providing the figures.
Again, it has a help function with usage examples, but if anyone need pointers feel free to drop me a line.
#!/bin/bash
#
#==========================================================================
# Name: Average Queuing Time
# Author: Chris Bingham
# Date: 12.02.2009
# Language: Bash
# External References: qacct, grep, cut, bc
#
# This script will calculate the average time jobs have had to spent
# queuing before being run over the specified time period
#==========================================================================
# Store the first argument
DAYS="$1"
function display_help() {
# Display a help message
echo "---Average Queuing Time---"
echo "This script will calculate the average time jobs have had to spent"
echo "queuing before being run over the specified time period"
echo ""
echo "--Usage--"
echo -e " qtime.sh [DAYS|OPTION]"
echo -e " Where 'DAYS' is a number of days to calculate the average for or 'OPTION' is one of the following;"
echo -e " \tweek\t\t\tCalculate the average for the week so far"
echo -e " \tmonth\t\t\tCalculate the average for the month to date"
echo -e " \tyear\t\t\tCalculate the average for the year to date"
echo -e " \thelp\t\t\tDisplay this message"
}
function gen_human_readable_time() {
# Convert a time in seconds into a more human-friendly scale (hours, days etc instead of seconds)
# If the time span is less than 1 hour, convert to minutes
if [ "$TIME_INT" -lt "60" ] ; then
TIME="$TIME seconds"
elif [ "$TIME_INT" -lt "3600" ] ; then
TIME=`echo "scale=2; $TIME/60" | bc`
TIME="$TIME minutes"
# If the time span is less than 1 day, convert to hours
elif [ "$TIME_INT" -lt "86400" ] ; then
TIME=`echo "scale=2; $TIME/3600" | bc`
TIME="$TIME hours"
# If the time span is less than 1 week, convert to days
elif [ "$TIME_INT" -lt "604800" ] ; then
TIME=`echo "scale=2; $TIME/86400" | bc`
TIME="$TIME days"
# If the time span is less than 1 year, convert to weeks
elif [ "$TIME_INT" -lt "31449600" ] ; then
TIME=`echo "scale=2; $TIME/604800" | bc`
TIME="$TIME weeks"
# If the time span is 1 year or more, convert to years
else
TIME=`echo "scale=2; $TIME/31449600" | bc`
TIME="$TIME years"
fi
}
function calc_avg() {
# Set the field seperator for array creation to a new line
OLDIFS=$IFS
IFS=$'\n'
# Get information for SGE using the 'qacct' command, storing submit and start times in arrays
USER_JOB_COUNT=`qacct -j -d $DAYS | grep "jobname" -c`
USER_SUBMIT_TIMES=($(qacct -j -d $DAYS | grep "qsub_time" | cut -d " " -f 5-9))
USER_START_TIMES=($(qacct -j -d $DAYS | grep "start_time" | cut -d " " -f 4-9))
# Get the length of one of the arrays
USER_SUBMIT_TIMES_COUNT=${#USER_SUBMIT_TIMES[@]}
# Reset the field seperator to it's previous value
IFS=$OLDIFS
# Create variables to store min and max wait times
MIN_WAIT_TIME=""
MAX_WAIT_TIME=""
# Determine if any jobs have been completed ov the specified time period
if [ "$USER_SUBMIT_TIMES_COUNT" -gt "0" ] ; then
# If yes, then calculate the average wait time
# For each element in the arrays, do the following;
for (( i=0; i<${USER_SUBMIT_TIMES_COUNT}; i++ )) ; do
# Convert the submit and start time to seconds since the epoch
SUBMIT_SECONDS=`date -d "${USER_SUBMIT_TIMES[$i]}" +%s`
START_SECONDS=`date -d "${USER_START_TIMES[$i]}" +%s`
# Calculate how long the job was queuing, and add this to the total queuing time
WAIT_TIME=$(($START_SECONDS-$SUBMIT_SECONDS))
TOTAL_WAIT_TIME=$((TOTAL_WAIT_TIME+$WAIT_TIME))
if [ -z "$MIN_WAIT_TIME" ] ; then
MIN_WAIT_TIME=$WAIT_TIME
MAX_WAIT_TIME=$WAIT_TIME
else
if [ "$MIN_WAIT_TIME" -gt "$WAIT_TIME" ] ; then
MIN_WAIT_TIME=$WAIT_TIME
fi
if [ "$MAX_WAIT_TIME" -lt "$WAIT_TIME" ] ; then
MAX_WAIT_TIME=$WAIT_TIME
fi
fi
# Reset all variables for the next iteration of the loop
WAIT_TIME=""
SUBMIT_SECONDS=""
START_SECONDS=""
done
# Calculate the average queuing time as both an integer and floating point number
AVG_WAIT_TIME=`echo "scale=2; $TOTAL_WAIT_TIME/$USER_SUBMIT_TIMES_COUNT" | bc`
AVG_WAIT_TIME_INT=$(($TOTAL_WAIT_TIME/$USER_SUBMIT_TIMES_COUNT))
TIME_INT=$AVG_WAIT_TIME_INT
TIME=$AVG_WAIT_TIME
gen_human_readable_time
AVG_WAIT_TIME=$TIME
TIME_INT=$MIN_WAIT_TIME
TIME=$MIN_WAIT_TIME
gen_human_readable_time
MIN_WAIT_TIME=$TIME
TIME_INT=$MAX_WAIT_TIME
TIME=$MAX_WAIT_TIME
gen_human_readable_time
MAX_WAIT_TIME=$TIME
# Display the average queuing time
echo ""
echo "During the last $DAYS days, jobs had to queue (wait to be run) for;"
echo -e "\tOn average:\t$AVG_WAIT_TIME"
echo -e "\tAt least:\t$MIN_WAIT_TIME"
echo -e "\tAt most:\t$MAX_WAIT_TIME"
echo ""
else
# If no, then display the wait time as 0 seconds
echo ""
echo "During the last $DAYS days, jobs had to queue (wait to be run) for;"
echo -e "\tOn average:\t0 seconds"
echo -e "\tAt least:\t0 seconds"
echo -e "\tAt most:\t0 seconds"
echo ""
fi
}
# Check first argument
case "$DAYS" in
"week")
# If 'week' is specified, determine how many days into the week we are, starting from, Monday
DAYS=`date +%u`
calc_avg
;;
"month")
# If 'month' is specified, determine how many days into the month we are
DAYS=`date +%d`
calc_avg
;;
"year")
# If 'year' is specified, determine how many days into the year we are
DAYS=`date +%j`
calc_avg
;;
(*[0-9])
# If the input conatains numbers, trim out all non-numeric characters and continue
DAYS=`echo $DAYS | tr -d [:alpha:][:punct:]`
calc_avg
;;
"help")
# If 'help' is specified, display the help message and exit
display_help
exit
;;
*)
# If the input is anything else, display the help message and exit
display_help
exit
;;
esac
Queue Job Count
This one's quite simple - it determines what queues exist on the cluster, then counts up how many jobs have been submitted to each one over the specified period (week, month or year to date, or a number of days).
#!/bin/bash
#
#==========================================================================
# Name: SGE Queue Job Count
# Author: Chris Bingham
# Date: 28.11.2008
# Language: Bash
# External References: qconf, qacct, grep, date, tr
#
# This script will use the SGE command 'qconf' to get a list of queues
# configured on the cluster, then use 'qacct' to search back though job
# records for each queue for the specified number of days, and return a
# table of jobs counts for each queue, sorted highest to lowest
#==========================================================================
# Store the first argument, all other argument will be discarded
DAYS=$1
# Use 'qacct' to get a list of queues, and convert it to an array
QUEUE_LIST=`qconf -sql`
QUEUE_LIST=`echo $QUEUE_LIST | tr -t ' ' " " `
# Create a variable to store the total job count
TOTAL_JOB_COUNT="0"
function get_job_count() {
# For each queue found, do the following;
for q in `echo -e $QUEUE_LIST` ; do
# Use 'qacct' and 'grep' to get a count of the number of jobs for the
# specified time period
QUEUE_JOB_COUNT=`qacct -d $DAYS -q $q -j | grep "qname $q" -c`
# Add this to the total job count
TOTAL_JOB_COUNT=$(($TOTAL_JOB_COUNT+$QUEUE_JOB_COUNT))
# Store the results for later output
OUT="$OUT$QUEUE_JOB_COUNT\t\t$q\n"
done
# Output the results table, performing a reverse-numerical sort on the results
echo -e "Job Count\tQueue"
echo "--------------------------"
echo -e $OUT | sort -nr
echo "Total Job Count: $TOTAL_JOB_COUNT"
}
function display_help() {
# Display a help message
echo "---SGE Queue Job Count---"
echo "This script will use the SGE command 'qacct' to search back though all the"
echo "job records for the specified number of days, and return a table of the"
echo "job counts for each queue configured on the system."
echo ""
echo "Usage: q_job_count.sh [DAYS|OPTION]"
echo "Where 'DAYS' is a number of days to gather statistics for or 'OPTION' is one of the following;"
echo -e "\tweek\tGather statistics for the week so far"
echo -e "\tmonth\tGather statistics for the month to date"
echo -e "\tyear\tGather statistics for the year to date"
echo -e "\thelp\tDisplay this message"
}
# Check if the first argument was null, and display help and exit if so
if [ -z "$DAYS" ] ; then
display_help
exit
else
# Else, select from the following options
case "$DAYS" in
"week")
# If 'week' is specified, determine how many days into the week we are, starting from, Monday
DAYS=`date +%u`
get_job_count
;;
"month")
# If 'month' is specified, determine how many days into the month we are
DAYS=`date +%d`
get_job_count
;;
"year")
# If 'year' is specified, determine how many days into the year we are
DAYS=`date +%j`
get_job_count
;;
(*[0-9])
# Otherwise, if the input conatains numbers, trim out all non-numeric characters and continue
DAYS=`echo $DAYS | tr -d [:alpha:][:punct:]`
get_job_count
;;
"help")
# If 'help' is specified, display the help message and exit
display_help
exit
;;
*)
# If the input is anything else, display the help message and exit
display_help
exit
;;
esac
fi
Change Queue State
A very short script that will either enable or disable all queue instances on the host its run on - I've found it useful for quickly knocking out the queues on nodes that are being taken down for maintenance.
#!/bin/bash # #========================================================================== # Name: Change SGE Queue Instance States # Author: Chris Bingham # Date: 28.11.2008 # Language: Bash # External References: qselect, qmod, grep, tr # # This script will, based on the argument supplied, either enable or disable # all queue instances on the current host #========================================================================== # Store the first argument, all other argument will be discarded ACTION=$1 # Convert any uppercase letters to lowercase, for the case statement below ACTION=`echo $ACTION | tr -t [:upper:] [:lower:]` # Determine what action to take based on the argument supplied case "$ACTION" in "enable") # If the argument was 'enable', then use 'qselect' and 'grep' to select all # queue instances on the current host (except 'test.q') and then enable them # using 'qmod' qmod -e `qselect -q *@$HOSTNAME` ;; "disable") # If the argument was 'disable', then use 'qselect' and 'grep' to select all # queue instances on the current host (except 'test.q') and then disable them # using 'qmod' qmod -d `qselect -q *@$HOSTNAME` ;; *) # If the argument was anything else, display an error message echo "Invalid option: please enter either 'enable' or 'disable'" ;; esac
filter-accounting
A small Perl script to filter the accounting file by the end_time of the jobs. Mostly useful for splitting up an accounting file by years, for example.
Other Miscellaneous
Some others, partially overlapping with some above are listed under http://www.nw-grid.ac.uk/LivScripts.